# Eliminating the bias of segregation indices

It is well known that most standard estimators of segregation indices are biased. The segregation package provides a few tools to assess this bias. This post will discuss this problem with some simple examples and show under what conditions bootstrapping and simulation can help to remove the bias. The post relies on some tools that were only recently added to the package, so install the most recent version from GitHub to follow along:

`remotes::install_github("elbersb/segregation")`

## Bias in small and large samples

To illustrate the problem, let’s use R’s `stats::r2dtable`

function to
simulate a random contingency table. To make the following more
concrete, let’s assume that we observe racial segregation in schools.
Each school has an equal number of students of each of the two racial
groups, but we only observe a sample. If the sample is small, we do not
expect to sample exactly an even number of students of each of the two
groups, so the segregation index is likely to be biased upwards.

One hypothetical sample could look like this:

```
(mat = stats::r2dtable(1, rep(10, 5), c(25, 25))[[1]])
## [,1] [,2]
## [1,] 5 5
## [2,] 4 6
## [3,] 3 7
## [4,] 7 3
## [5,] 6 4
```

Now we can compute the Mutual Information index (M) and its normalized version, the H index:

```
library("segregation")
dat = matrix_to_long(mat) # convert to long format
mutual_total(dat, "group", "unit", weight = "n")
## stat est
## <char> <num>
## 1: M 0.0410
## 2: H 0.0591
```

Clearly, both indices are non-zero. For the index of dissimilarity, the bias is even stronger:

```
dissimilarity(dat, "group", "unit", weight = "n")
## stat est
## <char> <num>
## 1: D 0.24
```

A index value of 0.3 is often interpreted as “moderate segregation”, so this bias is clearly a problem. Generally, the index of dissimilarity suffers more from small-sample bias than the information-theoretic indices.

Importantly, the bias is not simply a function of sample size. For instance, if we increase the number of schools to 10,000, but still expect 5 students of each racial group in each school, the bias is pretty much the same:

```
mat_large = stats::r2dtable(1, rep(10, 10000), c(50000, 50000))[[1]]
dat_large = matrix_to_long(mat_large) # convert to long format
mutual_total(dat_large, "group", "unit", weight = "n")
## stat est
## <char> <num>
## 1: M 0.0540
## 2: H 0.0778
dissimilarity(dat_large, "group", "unit", weight = "n")
## stat est
## <char> <num>
## 1: D 0.248
```

This is despite the fact that in the first case, our sample size is 50, and in the second case it’s 100,000! For the index of dissimilarity, Winship (1977) has described this bias in detail.

## Solution 1: Bootstrapping

In many circumstances, it helps to enable bootstrapping to estimate the
bias. When bootstrapping is enabled, the `segregation`

package reports
bias-adjusted estimates.
Let’s try this for both datasets from above:

```
mutual_total(dat, "group", "unit", weight = "n", se = TRUE)
## stat est se CI bias
## <char> <num> <num> <list> <num>
## 1: M -0.00933 0.0465 -0.0956, 0.0647 0.0503
## 2: H -0.01520 0.0683 -0.1400, 0.0932 0.0743
mutual_total(dat_large, "group", "unit", weight = "n", se = TRUE)
## stat est se CI bias
## <char> <num> <num> <list> <num>
## 1: M -0.000629 0.00118 -0.00296, 0.00159 0.0546
## 2: H -0.000908 0.00170 -0.00427, 0.00230 0.0788
```

In this case, the bootstrap estimates the bias pretty well. Because the bias (last column) is subtracted from the segregation estimates, the bootstrap-adjusted estimate may become slightly negative.

For the index of dissimilarity, this procedure does not work as well:

```
dissimilarity(dat, "group", "unit", weight = "n", se = TRUE)
## stat est se CI bias
## <char> <num> <num> <list> <num>
## 1: D 0.171 0.0999 -0.0632, 0.3596 0.0689
dissimilarity(dat_large, "group", "unit", weight = "n", se = TRUE)
## stat est se CI bias
## <char> <num> <num> <list> <num>
## 1: D 0.14 0.00201 0.137,0.145 0.108
```

Although the bias estimate is fairly large, a substantial bias remains.

## Solution 2: Compute the expected value under independence

The bootstrap may sometime work to estimate the bias, but two major problems remain. The first, as we have seen, is that the bias estimation does not work well for the index of dissimilarity. The second situation in which the bootstrap will do badly is when the contingency table is very sparse and contains many zero entries. I’ll come back to that in the example at the end of the post.

A direct approach of estimating the bias is the following: Using the observed marginal distributions, simulate a contingency table under the assumption that true segregation is zero. Repeat this process a number of times and record the average. This quantity is the expected value of the segregation index when students are randomly distributed across schools, conditional on the marginal distributions. In economics, this quantity is also sometime called “random segregation” (Carrington and Troske 1998).

The `segregation`

package implements this algorithm in the following two
functions:

```
mutual_expected(dat, "group", "unit", weight = "n")
## stat est se
## <char> <num> <num>
## 1: M under 0 0.0443 0.0290
## 2: H under 0 0.0639 0.0418
dissimilarity_expected(dat, "group", "unit", weight = "n")
## stat est se
## <char> <num> <num>
## 1: D under 0 0.226 0.0945
```

In both cases, calculating the expected value of the index gives a good estimate of the bias. When reporting the final results, we could simply subtract the bias from the segregation estimates.

## An example with sparse data

As a final point, the example in this section demonstrates some circumstances under which also the information-theoretic indices may be highly biased.

The `segregation`

package contains an example dataset, `school_ses`

with
artifical data. Each row of this dataset describes a student, with
information on the school the student attends (`school_id`

), the
student’s ethnic group (one of A, B, or C; `ethnic_group`

), and the
student’s socio-economic status (provided in quintiles; `ses_quintile`

).
Because there are three ethnic-groups, we will only compute multigroup
indices using the M and H index.

The `school_ses`

dataset is sparse: There are 149 schools in total, but
only 46 of those contain students of all three ethnic groups, and 26
schools contain only students of a single ethnic group.

The ethnic segregation in this dataset is fairly large, but we may expect this estimate to be upwardly biased:

```
mutual_total(school_ses, "ethnic_group", "school_id")
## stat est
## <char> <num>
## 1: M 0.544
## 2: H 0.577
```

For this dataset, the two approaches of estimating the bias differ somewhat:

```
mutual_total(school_ses, "ethnic_group", "school_id", se = TRUE)
## stat est se CI bias
## <char> <num> <num> <list> <num>
## 1: M 0.529 0.01000 0.512,0.545 0.0160
## 2: H 0.559 0.00921 0.542,0.576 0.0181
mutual_expected(school_ses, "ethnic_group", "school_id")
## stat est se
## <char> <num> <num>
## 1: M under 0 0.0304 0.00240
## 2: H under 0 0.0322 0.00254
```

This difference is still rather small, and will not be consequential in
many situations. However, the advantage of using information-theoretic
measures lies in their decomposability, and there the bias may be much
larger. For instance, assume that we are interested in computing ethnic
segregation conditionally on SES status. We can use the within `argument`

to
calculate this:

```
mutual_total(school_ses, "ethnic_group", "school_id", within = "ses_quintile")
## stat est
## <char> <num>
## 1: M 0.463
## 2: H 0.490
```

Estimating the bias of this conditional index using bootstrapping yields a bias estimate of around 0.04:

```
mutual_total(school_ses, "ethnic_group", "school_id", within = "ses_quintile",
se = TRUE)
## stat est se CI bias
## <char> <num> <num> <list> <num>
## 1: M 0.424 0.00853 0.410,0.439 0.0389
## 2: H 0.450 0.00909 0.433,0.465 0.0408
```

However, if we compute the expected value conditional on SES, the result looks very different:

```
mutual_expected(school_ses, "ethnic_group", "school_id", within = "ses_quintile")
## stat est se
## <char> <num> <num>
## 1: M under 0 0.105 0.00848
## 2: H under 0 0.132 0.01113
```

The bias is estimated to be very large – around 0.1 for the M and around 0.13 for the H! The reason for this discrepancy is that the indices are computed within each group defined by the SES quintiles. These “conditional” contingency tables are much smaller, and even sparser than the overall dataset. It follows that the bias is even larger. One therefore has to be very careful when decomposing segregation measures for small or sparse samples.

## Conclusion

When working with segregation indices, it is important to be aware that
almost all “naive” estimators of these indices are upwardly biased. In many situations, this
bias will be small. However, if the overall sample size is small, or some of the
groups or units are small, the bias can be substantive. Importantly, it
is not always the case that the bias is small in large samples. My
recommendation is to always check the sensitivity of your results using
*both* bootstrapping and by calculating “random segregation”. Special attention
needs to be paid when decomposing segregation measures for small or sparse
samples, as the decompositions will be based on even smaller/sparser samples.

## References

Winship, Christopher. 1977. A Revaluation of Indexes of Residential
Segregation. *Social Forces* 55(4): 1058-1066.

Carrington, William J. and Kenneth R. Troske. 1998. Interfirm
Segregation and the Black/White Wage Gap. *Journal of Labor Economics*
16(2): 231-260.