Bayesian IRT in R and Stan

IRT

Stan

MCMC

Published

May 21, 2016

This blog post is outdated and the code will not run – for newer, cleaner IRT R code, see this github repo and this blog post.

The code below on Stan is also available as an RPub webpage, if you’d rather work through the examples than read all of the post.

One of the first areas where Bayesian modelling gained an entry point into the social sciences (and in particular political science) was in the area of legislator ideal points, with the use of the Item-Response Theory (IRT) models from the educational testing literature in psychology. This topic proved to be the perfect subject for the comparison of Bayesian and frequentist methods, since ideal point creation usually depends on nominal voting data, which may contain a lot of missing data (legislators who miss votes or abstain) and a huge number of parameters (hundreds of roll-calls by hundreds of legislators). The benefits of Bayesian methods over frequentist techniques for ideal point analysis is discussed at length elsewhere¹, but here I’ll talk about a side-effect of using Bayesian methods for creating ideal points from roll-call data, that is, the long time it can take to run these models on a desktop computer. (In the following discussion, I refer to ‘legislators’, but these IRT models apply to all types of response to a question, whether the ‘question’ is a vote by a politician or a judge or questions on a test or survey.)

To create ideal points in R, you have three or four main options if you want to use ol’ Bayes. First, there is the ready-made ideal() command of the package pscl by Simon Jackman & co. pscl includes some very handy little functions for those interested in generating ideal points from legislative voting data – summary statistics and plots are all easy to make, and come ready-made, such as party loyalty statistics, for example. However, ideal() suffers somewhat from being so ‘ready’: it is a bit unsuited for more complex or indivualistic models compared to some of the options mentioned later. I’ve also repeatedly run into problems with ideal() when trying to use some of the pscl package options (dropList(), for example), or when estimating multidimensional models. In terms of MCMC, only one chain at a time may be run. In fact, it is what it says on the tin: it’s a Bayesian version of W-NOMINATE, which means it has the advantages of that program (easy to use) and the disadvantages (when it doesn’t work you’re not sure why…a bit ‘black-box’).

The MCMCpack package also allows for the creation of ideal points, although its output is slightly less friendly to the beginner (an mcmclist object). Its MCMCirt1d() command is pretty similar to ideal() but allows for setting ‘soft’ constraints rather than the spike prior that pscl uses to pin down the position of (at least) two legislators. This is better for two reasons, in my opinion. First, it avoids a hard constraint on legislators for legislatures in which we do not have strong a priori evidence to suppose that, for example, Legislator X is an extremist to the right, or Legislator Y to the left (the use of extremist legislators on either end of the supposed scale ‘anchors’ it). With MCMCpack, the ideal points of the constrained legislators are drawn from a truncated normal distribution (truncated at zero) and so Legislator X (our extremist to the right) simply cannot have an ideal point on the left side of the scale and the opposite for our left-side extremist legislator (the use of these soft constraints obviates the need for them actually being extremists too). I’ve also found MCMCpack to be faster, although I haven’t tested that formally. In either case, both functions are quite similar. MCMCpack also has functions for dynamic models, robust & multidimensional models, and Ordinal IRT. They’ve all worked well for me with the exception of MCMCirtkd(), the multidimensional model function, which never seems to get started.

The next option is to use the BUGS modelling language, either with BUGS itself or its cousin JAGS, both of which have been heavily used in the literature but can be extremely slow. I don’t recommend their use for ideal points.

Next, we have Stan, which doesn’t have the simpler syntax of JAGS & BUGS, but is simply incomparably better in terms of speed. However, since it’s newer, you won’t find the amount of resources available for BUGS, for example (like here). There are a few resources: a simple one-dimensional model can be seen on Pablo Barberá’s github; a friend of mine, Guilherme Duarte, has an example of a dynamic model on his github too.

There are some other resources available, but relate to slightly different IRT models, more common in the educational-testing literature, and less so in ideal point studies: the ‘Rasch’ model; the 2PL model (in which a ‘yes’ answer has a specific associated movement in the dimensional space and the discrimination parameter only takes on postive values; in the ideal-point model of Jackman it can possess negative and positive values).

Since there are so few Stan resources for ideal point IRT models, I thought I’d post a few models here. The code is also available as an RPub webpage, as mentioned earlier. The statistical model we’ll employ is:

\[ y_{ij} = \beta_j \bf{x_i} - \alpha_j, \]

where \(y_{ij}\) are the votes, in binary form (1 = ‘Yes’; 2 = ‘No’); the \(\bf x_i\) are the ideal points of the legislators; and \(\beta_j\) and \(\alpha_j\) are the discrimination and difficulty parameters of the model.

Starting from scratch in R in a new session (you’ll need a C++ compiler if you don’t have one, see here):

install.packages("rstan")
library("rstan")

Ideal points are created from a j m matrix of voting data (j legislators voting on m votes), coded 1 for ‘yes’ and 0 for ‘no’ and abstentions. Missing data are NA, and are deleted out before running in Stan. We can easily simulate data for this type of thing, but let’s use a real database. This data is from the 53rd legislature of the Brazilian Federal Senate (with thanks to CEBRAP, who built the original database, this comes from an extended version I created), we’ll download it from my Github repo. You’ll need to install readr if you don’t have it. (I also have the bad habit of naming my data as “data”… not generally a great idea. It’ll be ok here, though.)

library(readr)
data <- read_csv("https://raw.githubusercontent.com/RobertMyles/Bayesian-Ideal-Point-IRT-Models/master/Senate_Example.csv")
colnames(data)

So let’s take a look at the data. You’ll see the column names are “VoteNumber”, “SenNumber”, “SenatorUpper”, “Vote”, “Party”, “GovCoalition”, “State”, “FP”, “Origin”, “Contentious”, “IndGov”, and “VoteType”. I’ve kept them in this state so that we can tidy things up and manipulate things a little, stuff you’ll probably have to do any time you deal with real data of this sort. We can also have a look later at different plotting options using some of these variables. First, let’s change the votes, which are in the format “S” (Sim, ‘Yes’), “N” (‘No’), “A” (Abstention), and “O” (Obstruction), to numeric format.

data$Vote[data$Vote=="S"] <- 1
data$Vote[data$Vote=="N"] <- 0
data$Vote[data$Vote  %in% c(NA,"O","A")] <- NA
data$Vote <- as.numeric(data$Vote)

Next, we’ll create the ‘vote matrix’. This is the j \(\times\) m matrix that we will use to create the ideal points with Stan. The rows will be the legislators and the columns the votes. We will also need to deal with the issue of ‘constraints’: we need to identify d(d + 1) legislators in d dimensions and constrain their ideal points in some way. For now, we’ll just organise our vote matrix in such a way that the two legislators that will be constrained are placed in rows 1 and 2 of the matrix. For this example, we can use Senators Agripino and Suplicy, who belong to two parties that are generally considered to be on opposite sides of the political ‘space’ that we will place our ideal points upon. Organizing things in this way is not necessary but makes the Stan model code cleaner later on.

data$FullID <- paste(data$SenatorUpper, data$Party, sep=":")
NameID <- unique(data$FullID)
J <- length(unique(NameID))
M <- length(unique(data$VoteNumber))
grep("JOSE AGRIPINO:PFL", NameID)  #34
grep("EDUARDO SUPLICY:PT", NameID) #12
NameID <- NameID[c(34, 12, 1:11, 13:33, 35:J)]

y <- matrix(NA,J,M)
Rows <- match(data$FullID, NameID)
Cols <- unique(data$VoteNumber)
Columns <- match(data$VoteNumber, Cols)

for(i in 1:dim(data)[1]){
  y[Rows[i], Columns[i]] <- data$Vote[i]
}

dimnames(y) <- list(unique(NameID), unique(data$VoteNumber))

I presume you’re using RStudio. Clicking on the viewer should show you the vote matrix, which should look like this:

Next we’ll make a dataframe of legislator variables which we’ll use later on, and one of vote characteristics.

ldata <- data.frame(FullID = unique(NameID),
                    Party = data$Party[match(unique(NameID),
                                             data$FullID)],
                    GovCoalition = data$GovCoalition[match(unique(NameID),
                                                           data$FullID)],
                    Name = data$SenatorUpper[match(unique(NameID),
                                                   data$FullID)],
                    State = data$State[match(unique(NameID),
                                             data$FullID)],
                    row.names = NULL,
                    stringsAsFactors = FALSE)

vdata <- data.frame(VoteNumber = unique(data$VoteNumber),
                    VoteType = data$VoteType[match(unique(data$VoteNumber),
                                                   data$VoteNumber)],
                    SenNumber = data$SenNumber[match(unique(data$VoteNumber),
                                                     data$VoteNumber)],
                    Origin = data$Origin[match(unique(data$VoteNumber),
                                               data$VoteNumber)],
                    Contentious = data$Contentious[match(unique(data$VoteNumber),
                                                         data$VoteNumber)],
                    IndGov = data$IndGov[match(unique(data$VoteNumber),
                                               data$VoteNumber)],
                    stringsAsFactors = F)

Stan is not like JAGS and BUGS in that NA is unwieldy to incorporate. The best thing to do is to delete missing data out, as can be seen in Barberá’s script linked to earlier, which I’ll copy here.

N <- length(y)
j <- rep(1:J, times=M)
m <- rep(1:M, each=J)

miss <- which(is.na(y))
N <- N - length(miss)
j <- j[-miss]
m <- m[-miss]
y <- y[-miss]

Next, we’ll set our initial values. There are various ways to do this, ranging from leaving it up to Stan (i.e. not setting any values) to creating lists with specific starting values for each parameter. What we’ll do here is use the starting values as a way to start the parties off in separate places. This has several advantages: we already know that these parties don’t vote together very often (i.e., they are parties of the government and the opposition) and so we can speed up the model by starting the legislators off where we already know they’ll be (i.e. right-wing parties on the right etc.). This also has the benefit of making it less likely that we’ll end up with ‘sign-flips’, where a legislator with a bi-modal posterior distribution has an ideal point from the ‘wrong’ mode.² For the discrimination and difficulty paramters, we’ll use a random sample from normal distributions. We’ll also save all this information as stan.data, which is the list of data we’ll use with Stan.

ldata$ThetaStart <- rnorm(J, 0, 1)
ldata$ThetaStart[ldata$Party=="PFL" | ldata$Party=="PTB" | ldata$Party=="PSDB" | ldata$Party=="PPB"] <- 2
ldata$ThetaStart[ldata$Party=="PT" | ldata$Party=="PSOL" | ldata$Party=="PCdoB"] <- -2
ThetaStart <- ldata$ThetaStart

initF <- function() {
  list(theta=ThetaStart, beta=rnorm(M, 0, 2), alpha=rnorm(M, 0, 2))
}

stan.data <- list(J=J, M=M, N=N, j=j, m=m, y=y, ThetaStart=ThetaStart)

Stan model code differs from those mentioned above in a few aspects. Firstly, variables need to be declared, along with their type. For example, J, which is our index for the number of senators, is declared in the following code as an integer. The parameters are likewise declared, as real numbers. The model code has three blocks: data, parameters and the model itself (there are other blocks possible, such asgenerated data, see the Stan manual. Stan code is also imperative – the order of the blocks matters.

stan.code <- "
    data {
    int<lower=1> J; //Senators
    int<lower=1> M; //Proposals
    int<lower=1> N; //no. of observations
    int<lower=1, upper=J> j[N]; //Senator for observation n
    int<lower=1, upper=M> m[N]; //Proposal for observation n
    int<lower=0, upper=1> y[N]; //vote of observation n
    }
    parameters {
    real alpha[M];
    real beta[M];
    real theta[J];
    }
    model {
    alpha ~ normal(0,5);
    beta ~ normal(0,5);
    theta ~ normal(0,1);
    theta[1] ~ normal(1, .01);
    theta[2] ~ normal(-1, .01);
    for (n in 1:N)
    y[n] ~ bernoulli_logit(theta[j[n]] * beta[m[n]] - alpha[m[n]]);
    }"

This IRT model can be run using either the logistic or probit link function, however, since Stan has a built in bernoulli_logit, we’ll use that. You can see from the model block above that we have specified specific prior distributions for theta[1] and theta[2]. These are our constrained legislators – Agripino and Suplicy. We can do this using truncated normal distributions in Stan (i.e. theta[1] ~ normal(1, .01)T[0,], for example), but in my experience this makes things slower and increases the number of divergent transitions reported by Stan. We then use the stan() command to run our model in Stan. Here, I’m using 1000 iterations just to show (as it doesn’t take too long); these IRT models generally need more iterations than other models, for good estimates from this data, I run 5000 iterations with 2500 burn-in. A couple of hundred iterations usually suffices in Stan, depending on the model. The number of chains and cores are linked to what I have available on my computer. You can check this with the parallel package using detectCores(). A quick way to check convergence of the chains is with a graph of Rhat, shown below.

stan.fit <- stan(model_code=stan.code, data=stan.data, iter=3000,
                 warmup=1500, chains=4, thin=5, init=initF,
                 verbose=TRUE, cores=4, seed=1234)

stan_rhat(stan.fit, bins=60)

Values of Rhat should be 1.03 or lower. As you can see, even from 1000 iterations, we can be confident these chains are converging.

Graphing Ideal Points

I find the best way to plot ideal points is by using ggplot2. It’s automatically loaded as part of rstan. I also prefer to use an mcmc.list object, simply because I’m more used to it. But you can use the stan.fit object directly if you prefer.

MS <- As.mcmc.list(stan.fit)
sMS <- summary(MS)

There are various things we can plot from the summary above. Of main interest is usually the ideal points, so we’ll start with those first. First, let’s extract the ideal points (“theta”) from the summary, along with the lower and upper ends of the 95% credible interval:

Theta <- sMS$statistics[grep("theta", row.names(sMS$statistics)),1]
ThetaQ <- sMS$quantiles[grep("theta", row.names(sMS$statistics)),c(1,5)]
Theta <- as.data.frame(cbind(Theta, ThetaQ))
rm(ThetaQ)
Theta$FullID <- ldata$FullID
row.names(Theta) <- NULL
colnames(Theta)[1:3] <- c("Mean", "Lower", "Upper")
Theta <- merge(Theta, ldata, by="FullID")
Theta <- Theta[order(Theta$Mean),]

Now we have a dataframe of legislator characteristics alng with their ideal points. Since we’re dealing with a one-dimensional model here, the most straight-forward way to plot is along a scale ranging from the lowest ideal point to the highest. Here, I’ll colour the ideal points and their intervals by membership of the government coalition. I’ve used some other plotting options to make this plot the way I like it, but it’s easy to change things to your taste in ggplot2.

Y <- seq(from=1, to=length(Theta$Mean), by=1)

ggplot(Theta, aes(x=Mean, y=Y)) +
  geom_point(aes(colour=GovCoalition),
             shape=19, size=3) +
  geom_errorbarh(aes(xmin = Lower, xmax = Upper,colour = GovCoalition),
                 height = 0) +
  geom_text(aes(x = Upper, label = FullID, colour = GovCoalition),
            size = 2.5, hjust = -.05) +
  scale_colour_manual(values = c("red", "blue")) +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title = element_blank(),
        legend.position = "none",
        panel.grid.major.y = element_blank(),
        panel.grid.major.x = element_line(linetype = 1,
                                          colour = "grey"),
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "white"),
        panel.border = element_rect(colour = "black", fill = NA,
                                    size = .4)) +
  scale_x_continuous(limits = c(-2.7, 4))

Of course, that’s not all the information we have in our ldata dataframe. We could plot things by party or by state. Let’s plot something by region (since there are a lot of states):

St <- Theta[is.na(Theta$State)==FALSE,]  # take out president
St$Region <- NA
SE <- c("SP", "RJ", "ES", "MG")
S <- c("RS", "PR", "SC")
N <- c("AM", "RO", "RR", "TO", "PA", "AC", "AP")
CW <- c("DF", "GO", "MT", "MS")
NE <- c("CE", "MA", "AL", "RN", "PB", "SE", "PI", "BA", "PE")
St$Region[St$State %in% SE] <- "South-East"
St$Region[St$State %in% S] <- "South"
St$Region[St$State %in% NE] <- "North-East"
St$Region[St$State %in% CW] <- "Centre-West"
St$Region[St$State %in% N] <- "North"

nameorder <- St$FullID[order(St$Region, St$Mean)]
St$FullID <- factor(St$FullID, levels=nameorder)

ggplot(St, aes(x=Mean, y=FullID)) +
  geom_point(size = 3, aes(colour = Region)) +
  geom_errorbarh(aes(xmin = Lower, xmax = Upper, colour = Region),
                 height = 0) +
  facet_grid(Region ~ ., scales = "free_y") +
  scale_colour_manual(values = c("orange", "black", "red",
                                 "blue", "darkgreen")) +
  theme_bw()

We can also analyse the other parameters of the model, and run multidimensional models too. See the RPub for the code for these.

Footnotes

There are many discussions on this topic, but Clinton & Jackman (2009) is a good place to start. An earlier paper by Clinton, Jackman & Rivers makes the point somewhat more forcefully.↩︎
For more on this point, see Jackman 2001, Rivers 2003 paper cited in the main text, or the Appendix of my PhD thesis.↩︎