Developing Race and Gender Estimates for US Law Enforcement Leadership

A Use-Case for the predictrace Package


Researchers might be interested in developing a descriptive understanding of the gender and race composition of a particular industry, organization, or other institution. Oftentimes this is done with sampling from a population. This is the case in law enforcement. With approximately 18,000 sub-federal law enforcement agencies in the United States, and somewhere around 800,000 officers, it can be a challenging environment for researchers. Given the huge variation in agency type, size, composition, etc., generalizing across “law enforcement” is tricky at best.

In this preliminary analysis, I attempt a population-level inference for US law enforcement agencies, to develop estimates of race and gender proportions in the “chief executive” spot. The chief executive for a sherrif’s office is the Sheriff (often elected), while in a state-level agency it might be Executive Director - there is a lot of variation.

Gender and Race in US Law Enforcement

John Shjarback and Natalie Todak (2019) use data from the 2013 Law Enforcement Management and Administrative Statistics (LEMAS) survey to analyze correlates of women in supervisory, mid-level, and chief executive roles in 2,826 municipal police, sheriff’s offices, and primary state law enforcement agencies. The 2013 LEMAS data was the first national survey to report on this level of data, and just 2.7% of the agencies were led by women. My goal here will be to see if using a commercial database of a much larger set of agencies, combined with a probabilistic estimate of gender and race, compares to the estimates from the 2013 LEMAS.

The 2016 LEMAS estimates that for chiefs across all size of local agencies, 89.6% were White, 4% Black, 3.1% Hispanic, and 2.4% other. It also estimates that in those same agencies, just 2.6% of chiefs were female. However, this 2016 sample design results in 2,612 local agencies (rather than the larger sample of all agencies), and uses a stratified sampling that intentionally oversamples from the largest agencies (+100 full-time officers).

But another method might be obtaining population-level information and inferring race and gender for the individuals based on that information. Jacob Kaplan has developed the predictrace package to do just that. The package develops a probability of race and gender based on the first name of a subject. This is from the package’s introduction:

The goal of predictrace is to predict the race of a surname or first name and the gender of a first name. This package uses U.S. Census data which says how many people of each race has a certain surname. For first name data, this package uses data from Tzioumis (2018). From this we can predict which race is mostly likely to have that surname or first name. The possible races are American Indian, Asian, Black, Hispanic, White, or two or more races. For the gender of first names, this package uses data from the United States Social Security Administration (SSA) that tells how many people of a given name are female and how many are male (no other genders are included). I use this to determine the proportion of each gender a name is, and use the gender with the higher proportion as the most likely gender for that name. Please note that the Census data on the race of first names is far smaller than the SSA data on the gender of first names, so you will match far fewer first names to race than to gender.


In this short demonstration, I will attempt to develop race and gender estimates for individuals who lead US law enforcement agencies. To do so, I will rely on a commercial dataset from the National Directory of Law Enforcement Administrators (NDLEA). The dataset contains just over 37,000 listings for the chief administrator of law enforcement organizations at every level of the US system - from municipal police to heads of major federal agencies like the FBI, and everything in-between. The company that puts this database together commits to contacting every agency on the list at least once a year, and the company representative I spoke to said they are closer to once every three months. In my experience the dataset has been very reliable when I need to contact a head administrator directly.

However, in order to constrain the analysis, I will just be looking at Campus Law Enforcement, County Sheriffs, and Municipal Law Enforcement agencies (n=17,104). Because I look at some correlations later with population, I drop any observations missing that information (missing n= 204), leaving a total of 16,900 observations. I’ll also reduce this to a simpler dataset by retaining only the department type, first name of administrator, state, and population served.

Let’s check and see if that looks right.
DeptType FirstName MailingState Population
Campus Law Enforcement Dave MA <25,000
Municipal Law Enforcement Paul VA <25,000
Municipal Law Enforcement Justin AK 100k-1M
Municipal Law Enforcement Thomas MI 25k-50k
Municipal Law Enforcement Troy MO <25,000
Municipal Law Enforcement Ronald MA <25,000
Municipal Law Enforcement Julian CA 100k-1M
Municipal Law Enforcement John OK <25,000
Municipal Law Enforcement David MI <25,000
Municipal Law Enforcement Matt PA <25,000

Looks like population data is pretty spotty (there’s an outlier from a typo that had the population of Shelby County, TN, at over 93 million! I fixed it behind the scenes here), but that’s not our main focus here today. Overall, it’s looking pretty good!

Inferring Race and Gender from First Name Data

Kaplan’s package predictrace will derive a gender and race classification for first names contained within our dataset. First we’ll use the predict_gender call, and then the predict_race functions to build the initial lists.

As you can see, the package reports probabilities for each entry, and gives a best-guess (likely_gender and likely_race) given those probabilities.

name match_name likely_race probability_american_indian probability_asian probability_black probability_hispanic probability_white probability_2races
Steve steve white 0.0024 0.0721 0.0221 0.0483 0.8540 0.0010
Eliezer eliezer NA NA NA NA NA NA NA
Hector hector hispanic 0.0000 0.0135 0.0045 0.9270 0.0550 0.0000
Ron ron white 0.0034 0.0469 0.0402 0.0235 0.8844 0.0017
James james white 0.0012 0.0147 0.0328 0.0100 0.9402 0.0012
Desiree desiree white 0.0030 0.0334 0.1246 0.1155 0.7143 0.0091
Kevin kevin white 0.0006 0.0324 0.0284 0.0082 0.9296 0.0009
Rick rick white 0.0029 0.0284 0.0073 0.0277 0.9314 0.0022
Christopher christopher white 0.0013 0.0140 0.0200 0.0179 0.9454 0.0014
Karl karl white 0.0007 0.0260 0.0281 0.0070 0.9374 0.0007
name match_name likely_gender probability_female probability_male
Michael michael male 0.0049518 0.9950482
Berkley berkley female 0.6417722 0.3582278
Kelly kelly female 0.8523312 0.1476688
Donald donald male 0.0039238 0.9960762
Alfonzo alfonzo male 0.0000000 1.0000000
Joseph joseph male 0.0040515 0.9959485
Scott scott male 0.0033662 0.9966338
Christopher christopher male 0.0046306 0.9953694
Dennis dennis male 0.0042935 0.9957065
Donald donald male 0.0039238 0.9960762
So now let’s quickly add the best-guess from the predictrace package back to our original data, and quickly get a feel for the overall distribution of gender and race.
Table 1: Summary Statistics
Variable N Percent
DeptType 16900
… Municipal Law Enforcement 11697 69.2%
… Campus Law Enforcement 2038 12.1%
… County Sheriffs 3165 18.7%
Population 16900
… <25,000 13614 80.6%
… 25k-50k 1562 9.2%
… 50k-100k 868 5.1%
… 100k-1M 797 4.7%
… 1M-10M 59 0.3%
gender 16619
… male 15583 93.8%
… female 1036 6.2%
race 16175
… white 15844 98%
… black 67 0.4%
… hispanic 234 1.4%
… hispanic, white 2 0%
… asian 27 0.2%
… asian, white 1 0%


Let’s breakdown race and gender estimates by population of the area served by the agency. Because of the very low counts in Hispanic/White, and Asian/White, I’m going to collapse those into Hispanic and Asian categories respectively. As population data for very small areas (<1000 pop.) can be spotty in the NDLEA, we lose some observations.

Table 2: Race and Gender of Chief Administrator, by Population Served
Variable Overall, N = 16,9001 <25,000, N = 13,6141 25k-50k, N = 1,5621 50k-100k, N = 8681 100k-1M, N = 7971 1M-10M, N = 591
White 15,844 (97.95%) 12,796 (98.13%) 1,483 (97.95%) 793 (97.42%) 720 (95.87%) 52 (92.86%)
Black 67 (0.41%) 50 (0.38%) 6 (0.40%) 6 (0.74%) 4 (0.53%) 1 (1.79%)
Hispanic 236 (1.46%) 175 (1.34%) 23 (1.52%) 14 (1.72%) 21 (2.80%) 3 (5.36%)
Asian 28 (0.17%) 19 (0.15%) 2 (0.13%) 1 (0.12%) 6 (0.80%) 0 (0.00%)
Unknown 725 574 48 54 46 3
male 15,583 (93.77%) 12,569 (93.85%) 1,456 (94.12%) 791 (93.94%) 718 (92.17%) 49 (83.05%)
female 1,036 (6.23%) 823 (6.15%) 91 (5.88%) 51 (6.06%) 61 (7.83%) 10 (16.95%)
Unknown 281 222 15 26 18 0

1 n (%)

Perhaps unsurprisingly, law enforcement agencies are predominantly led by males. However, there may be progress over the decade or so. Compared to the LEMAS 2013 data, which estimated just 2.7% of agencies were led by women, my analysis estimates that overall 6.2% of agencies are led by women. The proportion of women-led agencies tends to be stable around 6% until we get to the larger population centers, and in the largest (between 1M and 10M pop.), 17% of the agencies are led by women. This is much larger than the 8.5% suggested by the 2016 LEMAS, though the largest category there is 250,000+ population.

In terms of racial characteristics, this analysis suggests that, overall, 98% of agencies are led by White chief executives. This percentage is negatively correlated with population. In other words, the percentage of White chief executives tends to decrease as the size of population served increases. Even at the top-end of population size, however, these positions are heavily skewed, as seen in the largest (1M to 10M) areas, where 93% of chief executives are estimated to be White.

Let’s see if the proportions hold across agency types as well.

Table 3: Race and Gender of Chief Administrator, by Department Type
Variable Overall, N = 16,9001 Municipal Law Enforcement, N = 11,6971 Campus Law Enforcement, N = 2,0381 County Sheriffs, N = 3,1651
White 15,844 (97.95%) 11,035 (98.12%) 1,866 (96.73%) 2,943 (98.10%)
Black 67 (0.41%) 40 (0.36%) 15 (0.78%) 12 (0.40%)
Hispanic 236 (1.46%) 155 (1.38%) 45 (2.33%) 36 (1.20%)
Asian 28 (0.17%) 16 (0.14%) 3 (0.16%) 9 (0.30%)
Unknown 725 451 109 165
male 15,583 (93.77%) 10,910 (94.56%) 1,729 (86.71%) 2,944 (95.37%)
female 1,036 (6.23%) 628 (5.44%) 265 (13.29%) 143 (4.63%)
Unknown 281 159 44 78

1 n (%)

As you can see, based on these results, agency type does not seem to be correlated with higher percentages of non-white chief executives. However, campus law enforcement agencies are much more likely than other agency types to be led by women - over 13% compared to the average of 6.3% overall.


There is a lot of investigation needed before relying on these estimates, as they are even more overwhelmingly White than previous reporting would suggest. Recall that the 2016 LEMAS estimated that among local agency chiefs, 89.6% were White, 4% Black, 3.1% Hispanic, and 2.4% other race. The differences here suggest more analysis is needed, but several obvious options present themselves. It may be there are substantial gaps between the sampling in the LEMAS versus a population-level estimate. Alternatively, the probabilities themselves are skewing towards White likelihoods. The inclusion of more than just local agencies in this analysis also deserves some thought, as there may be agency characteristics that lead to higher proportions of non-Whites to be selected for the top job.

Some of the gaps are too large to comfortably chalk up to sampling or research design. The 2016 LEMAS estimated that in agencies serving over 250,000 people, just 65% of chiefs were White, while the current analysis would suggest this number is between 92-96%. That large of a gap is a strong suggestion that the inference of race for this population is questionable. On the other hand, the gender inferences seem much more stable across this analysis and previous ones.

As always, lots of warnings here about how seriously we should take these estimates. They are, after all, based on probabilistic inferences about race and gender given only a first name. There are lots of weaknesses to consider in that approach. On the other hand, this gives a much broader look at nearly the entire population of US law enforcement agencies in their respective categories (municipal, sheriff’s, campus, and state law enforcement).

Many thanks to Jacob Kaplan, who developed the predictrace package for R, as this quick analysis would not be possible without his hard work.

Ian T. Adams
Ian T. Adams
Ph.D. Candidate and Instructor

My research interests include public workplace surveillance, policing, and emotional labor.