# Inspection Paradox in Dartmouth Class Sizes

The Inspection Paradox creeps in when the probability of you recording a data point depends on the value of the data point. Jake VanderPlas and Allen Downey have excellent examples and explanations of places where the inspection paradox shows up.

A straightforward example involves a school with 100 students who are each two classes: one containing all of the students and one private lesson with a professor. At this school, the average class size that any student encounters is about 50 but the average class actually has about 2 students.

```
(
100 * 1 # one class with everyone
+ 1 * 100 # 100 classes that are 1:1
) / 101 # total number of classes
```

```
1.9801980198019802
```

This isn’t just a quirk of my toy example. If we look at Dartmouth’s class enrollment for winter 2019, we can see that the average class size and the observed class size for a student are substantially different.

```
import pandas
courses = pandas.read_html(
"../assets/dartmouth-winter-2019-enrollment.html",
"Instructor")[2]
class_sizes = courses.Enrl
```

```
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
clipped_class_sizes = class_sizes.clip(0, 100)
plt.hist(clipped_class_sizes,
bins=clipped_class_sizes.max())
plt.title("Class Size Distribution")
plt.xlabel("class size")
plt.ylabel("count of classes at size");
```

To calculate the average class size as observed by a student, we can draw random classes from the list of classes weighted by their enrollment, kind of like picking a course-load by randomly selecting seats in distinct classes.

```
import numpy
# I generate each student's classes
# independently in a loop so that I
# can use replace=False and prevent
# anyone from attending the same class
# multiple times
SIMULATED_STUDENTS = 1000
average_observed_class_size = numpy.vstack([
numpy.random.choice(
class_sizes,
p=class_sizes / class_sizes.sum(),
size=3, # typical courseload
replace=False) # can't attend same class twice
for _ in range(SIMULATED_STUDENTS)
]).mean(axis=1)
plt.hist(average_observed_class_size, bins=100)
plt.title("Estimated observed class size")
plt.xlabel("class size")
plt.ylabel("students who observed");
```

The simulation above assigns each student a list of class sizes like
`(30, 14, 61)`

that they observed. In a full simulation, with each class
filling up, there would be 30 data-points for a class with size 30, 5 for a
class with size 5, … which makes the calculation much simpler. Instead of
averaging `sum(class sizes) / 3`

over all students, we can combine the
numerators to get

```
average_by_student = (class_sizes ** 2).sum() \
/ class_sizes.sum()
average_by_student
```

```
32.44172932330827
```

If we take the average by class instead, it appears much better

```
average_by_class = class_sizes.replace(0, numpy.nan).mean()
average_by_class
```

```
15.896414342629482
```

I couldn’t find an average class size figure, but I did find a breakdown of classes by size which indicates:

- 64.5% of classes are < 20 students
- 28.6% of classes are 20-49 students
- 6.9% of classes are >= 50 students

```
bucketed = pandas.cut(class_sizes,
[-numpy.inf, 19, 49, numpy.inf])
class_buckets = (
bucketed.value_counts() / len(bucketed)
).to_frame(name="based on winter 2019")
class_buckets.index.name = "class size bucket"
class_buckets['official count'] = [0.645, 0.286, 0.069]
class_buckets.style.format("{:.0%}")
```

based on winter 2019 | official count | |
---|---|---|

class size bucket | ||

(-inf, 19.0] | 77% | 64% |

(19.0, 49.0] | 20% | 29% |

(49.0, inf] | 3% | 7% |

```
clipped_class_sizes = class_sizes.clip(0, 100)
plt.hist(clipped_class_sizes,
clipped_class_sizes.max(),
alpha=0.4)
top = clipped_class_sizes.value_counts().max()
plt.axvline(average_by_student, color="r", linestyle="--")
plt.text(average_by_student + 3,
top,
"average by student",
rotation=90,
color="r");
plt.axvline(average_by_class, color="r", linestyle="--")
plt.text(average_by_class + 3,
top,
"average by class",
rotation=90,
color="r");
plt.title("Average class size at Dartmouth (Winter 2019)")
plt.xlabel("class size")
plt.ylabel("count of classes at size");
```