Stan or JAGS for Bayesian ideal-point IRT?

2016-04-13

Anybody who has ever tried to run even a moderately-sized Bayesian IRT model in R (for ideal points as in the political science literature, or otherwise) will know that these models can take a long time. It’s not R’s fault: these are usually big models with lots of parameters, and naturally take longer.¹ Not to mention the fact that Bayesian computation is more computationally intense than other methods. Historically (okay, I’m talking about the last twenty years, maybe ‘historically’ is a little strong), the sampling software BUGS (Bayesian Inference Using Gibbs Sampling) and then JAGS were used to run Bayesian models (JAGS is still pretty common, and BUGS too, though not as much). Lately, Stan has been gaining ground, certainly as regards more complex modelling.

While the reasons for choosing Stan are often put down to speed, when running many types of models there is not actually a large difference, with JAGS actually being faster for some models, according to John Kruschke². Given the lack of a big difference between JAGS/BUGS and Stan, which sampling software should we use for IRT models? Well, first of all, a large part of the literature utilises either JAGS or BUGS, indeed, code is publicly available for many of these models, helping to spread the use of these two modelling languages.³ For beginners, this is a handy way to learn, and it’s how I learned. Indeed, the language of JAGS/BUGS (I’m just going to use ‘JAGS’ to refer to both from now on) is a bit more intuitive for many people, and given the availability of others’ code, beginning with these models can then be reduced to just tinkering with small details of code that is already written.

Stan, on the other hand, is newer and has a syntax that is in some ways quite different from JAGS. Variables need to be declared, as does their type (something not many R users are familiar with, I certainly wasn’t). The model code is imperative, not declarative⁴, and there are specific ‘blocks’ to the code. Stan has a different default sampler and is generally argued by its creators to be much faster. Well, in my experience, there is actually no contest. As much as I liked JAGS when I started out, Stan is simply incomparable to JAGS in terms of speed for these models– Stan is much, much faster. I was analysing nominal vote data for the Brazilian Federal Senate⁵ (these data have plenty of missing values, which are handled easily in JAGS but have to be deleted out in Stan) and, through the use of the runjags package (and its autorun option), I discovered that it would take around 28 hours to run my two-dimensional model to reach signs of convergence (or signs of non-convergence, as Gill puts it). As I was in the middle of writing a PhD thesis with lots of these models to process, that just wasn’t an option. (Regardless, any time I let the model run like this, R crashed or became unresponsive, or the estimates were simply of bad quality.) So I started tinkering with the options in runjags, trying different samplers etc. Then I noticed exactly why JAGS is so slow for these models.

In order to run a model, JAGS first compiles a Directed Acyclic Graph (DAG) of all the nodes in the model (software such as NIMBLE will let you print out the graph pretty easily). But since we have a latent regression with an unobserved regressor in the equation⁶

y_{ij} = \beta_j\bf{x_i} - \alpha_j

then JAGS is unable to build such a DAG. Since it can’t build a DAG, it can’t surmise that there is conjugacy in the model and then exploit that through Gibbs sampling. So JAGS just uses the default Metropolis-Hastings sampler (and given that it is called Just Another Gibbs Sampler, it kind of misses the point of using JAGS in the first place). This means that all the gains available through Gibbs sampling are simply not available for latent models of this type with JAGS, and hence the sampling process runs very slowly. I’m not sure the literature was ever aware of this fact, either. Many papers and books extoll the virtues of Gibbs sampling (and spend pages and pages deriving the conditional distributions involved) and then show the reader how to do it in JAGS or BUGS (see Simon Jackman’s book for an example)⁷, but unbeknownst to these authors, their JAGS programs are not using Gibbs sampling.

So that leaves us with Stan. Use it! 😎

In a future post, I’ll show some examples of IRT ideal-point models in Stan. I have some on my Github, and Pablo Barberá also has some nice examples (hat tip: I learned from him, amongst others. Thanks, Pablo!).

Update: Guilherme Jardim Duarte also has some Bayesian IRT examples on his Github, in particular the dynamic model of Martin & Quinn, have a look.

For more on how these models can have tons of parameters, see Clinton, Jackman, and Rivers (2004): ‘The statistical analysis of roll-call data’, American Political Science Review, Vol. 98, No. 2.↩
Kruschke mentions this in his book…not sure where, exactly.↩
See this paper by Curtis (pdf downloads automatically) or the book by [Armstrong et. al](https://www.crcpress.com/Analyzing-Spatial-Models-of-Choice-and-Judgment-with-R Armstrong-II-Bakker-Carroll-Hare-Poole-Rosenthal/9781466517158).↩
See here for the difference.↩
You can read about this research here.↩↩
This is the canonical statistical model for Bayesian IRT. The data ( $y_{ij}$ ) are the votes, in binary form (1 = ‘Yes’; 2 = ‘No’); the $x_i$ are the ideal points of the legislators; and $\beta_j$ and $\alpha_j$ are the discrimination (slope) and difficulty (intercept) parameters, respectively. See the article cited in footnote 1.↩
I don’t mean to denigrate Jackman’s book. It’s highly detailed and thorough, and he deserves a lot of credit for spearheading the use of these Bayesian IRT models in political science. I’ve cited his work numerous times, I’m a fan.↩