Counting "Select All That Apply" Questions in Qualtrics
Qualtrics Messy Data
My friend Devon Cantwell reached out with an interesting messy data caused by how Qualtrics produces “select all that apply” variables. For example, in her (mock) survey, she asks students to select all the colors that they personally find attractive from a list. When downloaded from Qualtrics, we get a dataframe that looks like this:
glimpse(dat)
## Rows: 940
## Columns: 4
## $ color_1 <fct> Sparkle, Blue, Blue, Sparkle, Blue, Sparkle, Sparkle, Green, B~
## $ color_2 <fct> NA, Moldy Book, NA, Moldy Book, Moldy Book, Honey Bee, Moldy B~
## $ color_3 <fct> NA, Apple Core Brown, NA, Apple Core Brown, NA, NA, NA, NA, NA~
## $ color_4 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
So all students pick at least one color, some pick two, but relatively few pick three or four. One thing we might want to know is the first color selected by respondent? That’s relatively easy:
dat %>% count(color_1)
## # A tibble: 8 x 2
## color_1 n
## <fct> <int>
## 1 Blue 233
## 2 Green 134
## 3 Yellow 14
## 4 Sparkle 189
## 5 Apple Core Brown 6
## 6 Honey Bee 13
## 7 Moldy Book 42
## 8 <NA> 309
But this only tells us the first color selected, not how many times a color was selected. What if we want to count all the instances where “Moldy Book” was selected, across columns? Or getting a more succinct answer for all colors? Because these are not ordered in any way, and the respondent wasn’t asked for an ordered preference, we need to count across the variables.
We can use tidyr
for a quick solution:
library(tidyr)
dat %>%
gather(key, value, na.rm = TRUE) %>%
count(value)
## Warning: attributes are not identical across measure variables;
## they will be dropped
## # A tibble: 7 x 2
## value n
## <chr> <int>
## 1 Apple Core Brown 78
## 2 Blue 233
## 3 Green 134
## 4 Honey Bee 32
## 5 Moldy Book 222
## 6 Sparkle 230
## 7 Yellow 38
Good thing we checked! It turns out that Sparkle and Moldy Book are basically just as popular as Blue! If we had stopped with just checking the first color picked, our inference for color preference would have been way off.