Tidying STM with tidytext

Last updated on Jun 23, 2020 3 min read R, Research

Libraries

library(tidytext)
library(ggthemes)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(scales)

Load Previous STM Objects

I have previously run stm models for topics ranging from 3 to 25. Based on the fit indices, a six-topic model was selected. I am not showing that analysis here, but instead loading the saved stm objects (the model and the ‘out’ object).

Tidy the stm model

I am using the tidytext package from Julia Silge, which she demonstrates in a blog post here.

td_beta <- tidy(modelfit6)
td_beta

## # A tibble: 1,416 x 3
##    topic term        beta
##    <int> <chr>      <dbl>
##  1     1 abil    9.13e-61
##  2     2 abil    6.07e-55
##  3     3 abil    1.51e- 2
##  4     4 abil    2.98e-54
##  5     5 abil    3.66e-63
##  6     6 abil    2.39e-58
##  7     1 account 1.57e-91
##  8     2 account 1.37e-25
##  9     3 account 4.28e-66
## 10     4 account 6.47e-63
## # ... with 1,406 more rows

td_gamma <- tidy(modelfit6, matrix = "gamma",
                 document_names = rownames(out$meta))

td_gamma

## # A tibble: 1,266 x 3
##    document topic   gamma
##    <chr>    <int>   <dbl>
##  1 1            1 0.00343
##  2 2            1 0.00462
##  3 3            1 0.0861 
##  4 4            1 0.0467 
##  5 5            1 0.0106 
##  6 6            1 0.00306
##  7 7            1 0.00419
##  8 8            1 0.0511 
##  9 9            1 0.00356
## 10 10           1 0.144  
## # ... with 1,256 more rows

Plot Topic Relevance

top_terms <- td_beta %>%
  arrange(beta) %>%
  group_by(topic) %>%
  top_n(7, beta) %>%
  arrange(-beta) %>%
  select(topic, term) %>%
  summarise(terms = list(term)) %>%
  mutate(terms = map(terms, paste, collapse = ", ")) %>% 
  unnest()

gamma_terms <- td_gamma %>%
  group_by(topic) %>%
  summarise(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  left_join(top_terms, by = "topic") %>%
  mutate(topic = paste0("Topic ", topic),
         topic = reorder(topic, gamma))

gamma_terms %>%
  top_n(6, gamma) %>%
  ggplot(aes(topic, gamma, label = terms, fill = topic)) +
  geom_col(show.legend = FALSE) +
  geom_text(hjust = 0.85, nudge_y = 0.0005, size = 3) +
  coord_flip() +
  theme_hc() +
  theme(plot.title = element_text(size = 14)) +
  labs(x = NULL, y = expression(gamma),
       title = "Top Six Topics by Prevalence in the Officer Responses",
       subtitle = "With the top words that contribute to each topic")

Table of Topic Proportions with Top Terms

require(knitr)
gamma_terms %>%
  select(topic, gamma, terms) %>%
  kable(digits = 3, 
        col.names = c("Topic", "Expected topic proportion", "Top 6 terms"))

Topic	Expected topic proportion	Top 6 terms
Topic 6	0.240	complaint, protect, public, fals, make, tool, time
Topic 5	0.214	affect, way, chang, made, turn, situat, also
Topic 3	0.192	camera, job, help, bodi, wear, captur, action
Topic 4	0.149	offic, peopl, work, worri, one, say, good
Topic 2	0.117	use, video, will, call, without, review, someth
Topic 1	0.088	know, person, feel, wear, like, supervisor, think

police R Markdown STM survey

Ian T. Adams, Ph.D.

Assistant Professor, Department of Criminology & Criminal Justice

My research interests center around policing policy, people, behavior, and technology.