Developing Race and Gender Estimates for US Law Enforcement Leadership

A Use-Case for the predictrace Package

Last updated on Jul 3, 2021 16 min read police, R, Research, stats

Introduction

Researchers might be interested in developing a descriptive understanding of the gender and race composition of a particular industry, organization, or other institution. Oftentimes this is done with sampling from a population. This is the case in law enforcement. With approximately 18,000 sub-federal law enforcement agencies in the United States, and somewhere around 800,000 officers, it can be a challenging environment for researchers. Given the huge variation in agency type, size, composition, etc., generalizing across “law enforcement” is tricky at best.

In this preliminary analysis, I attempt a population-level inference for US law enforcement agencies, to develop estimates of race and gender proportions in the “chief executive” spot. The chief executive for a sherrif’s office is the Sheriff (often elected), while in a state-level agency it might be Executive Director - there is a lot of variation.

Gender and Race in US Law Enforcement

John Shjarback and Natalie Todak (2019) use data from the 2013 Law Enforcement Management and Administrative Statistics (LEMAS) survey to analyze correlates of women in supervisory, mid-level, and chief executive roles in 2,826 municipal police, sheriff’s offices, and primary state law enforcement agencies. The 2013 LEMAS data was the first national survey to report on this level of data, and just 2.7% of the agencies were led by women. My goal here will be to see if using a commercial database of a much larger set of agencies, combined with a probabilistic estimate of gender and race, compares to the estimates from the 2013 LEMAS.

The 2016 LEMAS estimates that for chiefs across all size of local agencies, 89.6% were White, 4% Black, 3.1% Hispanic, and 2.4% other. It also estimates that in those same agencies, just 2.6% of chiefs were female. However, this 2016 sample design results in 2,612 local agencies (rather than the larger sample of all agencies), and uses a stratified sampling that intentionally oversamples from the largest agencies (+100 full-time officers).

But another method might be obtaining population-level information and inferring race and gender for the individuals based on that information. Jacob Kaplan has developed the predictrace package to do just that. The package develops a probability of race and gender based on the first name of a subject. This is from the package’s introduction:

The goal of predictrace is to predict the race of a surname or first name and the gender of a first name. This package uses U.S. Census data which says how many people of each race has a certain surname. For first name data, this package uses data from Tzioumis (2018). From this we can predict which race is mostly likely to have that surname or first name. The possible races are American Indian, Asian, Black, Hispanic, White, or two or more races. For the gender of first names, this package uses data from the United States Social Security Administration (SSA) that tells how many people of a given name are female and how many are male (no other genders are included). I use this to determine the proportion of each gender a name is, and use the gender with the higher proportion as the most likely gender for that name. Please note that the Census data on the race of first names is far smaller than the SSA data on the gender of first names, so you will match far fewer first names to race than to gender.

Data

In this short demonstration, I will attempt to develop race and gender estimates for individuals who lead US law enforcement agencies. To do so, I will rely on a commercial dataset from the National Directory of Law Enforcement Administrators (NDLEA). The dataset contains just over 37,000 listings for the chief administrator of law enforcement organizations at every level of the US system - from municipal police to heads of major federal agencies like the FBI, and everything in-between. The company that puts this database together commits to contacting every agency on the list at least once a year, and the company representative I spoke to said they are closer to once every three months. In my experience the dataset has been very reliable when I need to contact a head administrator directly.

However, in order to constrain the analysis, I will just be looking at Campus Law Enforcement, County Sheriffs, and Municipal Law Enforcement agencies (n=17,104). Because I look at some correlations later with population, I drop any observations missing that information (missing n= 204), leaving a total of 16,900 observations. I’ll also reduce this to a simpler dataset by retaining only the department type, first name of administrator, state, and population served.

Let’s check and see if that looks right.

DeptType	FirstName	MailingState	Population
Campus Law Enforcement	Dave	MA	<25,000
Municipal Law Enforcement	Paul	VA	<25,000
Municipal Law Enforcement	Justin	AK	100k-1M
Municipal Law Enforcement	Thomas	MI	25k-50k
Municipal Law Enforcement	Troy	MO	<25,000
Municipal Law Enforcement	Ronald	MA	<25,000
Municipal Law Enforcement	Julian	CA	100k-1M
Municipal Law Enforcement	John	OK	<25,000
Municipal Law Enforcement	David	MI	<25,000
Municipal Law Enforcement	Matt	PA	<25,000

Looks like population data is pretty spotty (there’s an outlier from a typo that had the population of Shelby County, TN, at over 93 million! I fixed it behind the scenes here), but that’s not our main focus here today. Overall, it’s looking pretty good!

Inferring Race and Gender from First Name Data

Kaplan’s package predictrace will derive a gender and race classification for first names contained within our dataset. First we’ll use the predict_gender call, and then the predict_race functions to build the initial lists.

As you can see, the package reports probabilities for each entry, and gives a best-guess (likely_gender and likely_race) given those probabilities.

name	match_name	likely_race	probability_american_indian	probability_asian	probability_black	probability_hispanic	probability_white	probability_2races
Steve	steve	white	0.0024	0.0721	0.0221	0.0483	0.8540	0.0010
Eliezer	eliezer	NA	NA	NA	NA	NA	NA	NA
Hector	hector	hispanic	0.0000	0.0135	0.0045	0.9270	0.0550	0.0000
Ron	ron	white	0.0034	0.0469	0.0402	0.0235	0.8844	0.0017
James	james	white	0.0012	0.0147	0.0328	0.0100	0.9402	0.0012
Desiree	desiree	white	0.0030	0.0334	0.1246	0.1155	0.7143	0.0091
Kevin	kevin	white	0.0006	0.0324	0.0284	0.0082	0.9296	0.0009
Rick	rick	white	0.0029	0.0284	0.0073	0.0277	0.9314	0.0022
Christopher	christopher	white	0.0013	0.0140	0.0200	0.0179	0.9454	0.0014
Karl	karl	white	0.0007	0.0260	0.0281	0.0070	0.9374	0.0007

name	match_name	likely_gender	probability_female	probability_male
Michael	michael	male	0.0049518	0.9950482
Berkley	berkley	female	0.6417722	0.3582278
Kelly	kelly	female	0.8523312	0.1476688
Donald	donald	male	0.0039238	0.9960762
Alfonzo	alfonzo	male	0.0000000	1.0000000
Joseph	joseph	male	0.0040515	0.9959485
Scott	scott	male	0.0033662	0.9966338
Christopher	christopher	male	0.0046306	0.9953694
Dennis	dennis	male	0.0042935	0.9957065
Donald	donald	male	0.0039238	0.9960762

So now let’s quickly add the best-guess from the predictrace package back to our original data, and quickly get a feel for the overall distribution of gender and race.

Table 1: Summary Statistics
Variable	N	Percent
DeptType	16900
… Municipal Law Enforcement	11697	69.2%
… Campus Law Enforcement	2038	12.1%
… County Sheriffs	3165	18.7%
Population	16900
… <25,000	13614	80.6%
… 25k-50k	1562	9.2%
… 50k-100k	868	5.1%
… 100k-1M	797	4.7%
… 1M-10M	59	0.3%
gender	16619
… male	15583	93.8%
… female	1036	6.2%
race	16175
… white	15844	98%
… black	67	0.4%
… hispanic	234	1.4%
… hispanic, white	2	0%
… asian	27	0.2%
… asian, white	1	0%

Results

Let’s breakdown race and gender estimates by population of the area served by the agency. Because of the very low counts in Hispanic/White, and Asian/White, I’m going to collapse those into Hispanic and Asian categories respectively. As population data for very small areas (<1000 pop.) can be spotty in the NDLEA, we lose some observations.

Table 2: **Race and Gender of Chief Administrator, by Population Served**
Variable	Overall, N = 16,900¹	<25,000, N = 13,614¹	25k-50k, N = 1,562¹	50k-100k, N = 868¹	100k-1M, N = 797¹	1M-10M, N = 59¹
race
White	15,844 (97.95%)	12,796 (98.13%)	1,483 (97.95%)	793 (97.42%)	720 (95.87%)	52 (92.86%)
Black	67 (0.41%)	50 (0.38%)	6 (0.40%)	6 (0.74%)	4 (0.53%)	1 (1.79%)
Hispanic	236 (1.46%)	175 (1.34%)	23 (1.52%)	14 (1.72%)	21 (2.80%)	3 (5.36%)
Asian	28 (0.17%)	19 (0.15%)	2 (0.13%)	1 (0.12%)	6 (0.80%)	0 (0.00%)
Unknown	725	574	48	54	46	3
gender
male	15,583 (93.77%)	12,569 (93.85%)	1,456 (94.12%)	791 (93.94%)	718 (92.17%)	49 (83.05%)
female	1,036 (6.23%)	823 (6.15%)	91 (5.88%)	51 (6.06%)	61 (7.83%)	10 (16.95%)
Unknown	281	222	15	26	18	0
¹ n (%)

Perhaps unsurprisingly, law enforcement agencies are predominantly led by males. However, there may be progress over the decade or so. Compared to the LEMAS 2013 data, which estimated just 2.7% of agencies were led by women, my analysis estimates that overall 6.2% of agencies are led by women. The proportion of women-led agencies tends to be stable around 6% until we get to the larger population centers, and in the largest (between 1M and 10M pop.), 17% of the agencies are led by women. This is much larger than the 8.5% suggested by the 2016 LEMAS, though the largest category there is 250,000+ population.

In terms of racial characteristics, this analysis suggests that, overall, 98% of agencies are led by White chief executives. This percentage is negatively correlated with population. In other words, the percentage of White chief executives tends to decrease as the size of population served increases. Even at the top-end of population size, however, these positions are heavily skewed, as seen in the largest (1M to 10M) areas, where 93% of chief executives are estimated to be White.

Let’s see if the proportions hold across agency types as well.

Table 3: **Race and Gender of Chief Administrator, by Department Type**
Variable	Overall, N = 16,900¹	Municipal Law Enforcement, N = 11,697¹	Campus Law Enforcement, N = 2,038¹	County Sheriffs, N = 3,165¹
race
White	15,844 (97.95%)	11,035 (98.12%)	1,866 (96.73%)	2,943 (98.10%)
Black	67 (0.41%)	40 (0.36%)	15 (0.78%)	12 (0.40%)
Hispanic	236 (1.46%)	155 (1.38%)	45 (2.33%)	36 (1.20%)
Asian	28 (0.17%)	16 (0.14%)	3 (0.16%)	9 (0.30%)
Unknown	725	451	109	165
gender
male	15,583 (93.77%)	10,910 (94.56%)	1,729 (86.71%)	2,944 (95.37%)
female	1,036 (6.23%)	628 (5.44%)	265 (13.29%)	143 (4.63%)
Unknown	281	159	44	78
¹ n (%)

As you can see, based on these results, agency type does not seem to be correlated with higher percentages of non-white chief executives. However, campus law enforcement agencies are much more likely than other agency types to be led by women - over 13% compared to the average of 6.3% overall.

Conclusion

There is a lot of investigation needed before relying on these estimates, as they are even more overwhelmingly White than previous reporting would suggest. Recall that the 2016 LEMAS estimated that among local agency chiefs, 89.6% were White, 4% Black, 3.1% Hispanic, and 2.4% other race. The differences here suggest more analysis is needed, but several obvious options present themselves. It may be there are substantial gaps between the sampling in the LEMAS versus a population-level estimate. Alternatively, the probabilities themselves are skewing towards White likelihoods. The inclusion of more than just local agencies in this analysis also deserves some thought, as there may be agency characteristics that lead to higher proportions of non-Whites to be selected for the top job.

Some of the gaps are too large to comfortably chalk up to sampling or research design. The 2016 LEMAS estimated that in agencies serving over 250,000 people, just 65% of chiefs were White, while the current analysis would suggest this number is between 92-96%. That large of a gap is a strong suggestion that the inference of race for this population is questionable. On the other hand, the gender inferences seem much more stable across this analysis and previous ones.

As always, lots of warnings here about how seriously we should take these estimates. They are, after all, based on probabilistic inferences about race and gender given only a first name. There are lots of weaknesses to consider in that approach. On the other hand, this gives a much broader look at nearly the entire population of US law enforcement agencies in their respective categories (municipal, sheriff’s, campus, and state law enforcement).

Many thanks to Jacob Kaplan, who developed the predictrace package for R, as this quick analysis would not be possible without his hard work.

police stats

Ian T. Adams, Ph.D.

Assistant Professor, Department of Criminology & Criminal Justice

My research interests center around policing policy, people, behavior, and technology.