# Inspection Paradox in Dartmouth Class Sizes

The Inspection Paradox creeps in when the probability of you recording a data point depends on the value of the data point. Jake VanderPlas and Allen Downey have excellent examples and explanations of places where the inspection paradox shows up.

A straightforward example involves a school with 100 students who are each two classes: one containing all of the students and one private lesson with a professor. At this school, the average class size that any student encounters is about 50 but the average class actually has about 2 students.

``````(
100 * 1  # one class with everyone
+ 1 * 100  # 100 classes that are 1:1
) / 101  # total number of classes
``````
``````1.9801980198019802
``````

This isn’t just a quirk of my toy example. If we look at Dartmouth’s class enrollment for winter 2019, we can see that the average class size and the observed class size for a student are substantially different.

``````import pandas

"../assets/dartmouth-winter-2019-enrollment.html",
"Instructor")
class_sizes = courses.Enrl
``````
``````import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

%matplotlib inline

clipped_class_sizes = class_sizes.clip(0, 100)
plt.hist(clipped_class_sizes,
bins=clipped_class_sizes.max())
plt.title("Class Size Distribution")
plt.xlabel("class size")
plt.ylabel("count of classes at size");
`````` To calculate the average class size as observed by a student, we can draw random classes from the list of classes weighted by their enrollment, kind of like picking a course-load by randomly selecting seats in distinct classes.

``````import numpy

# I generate each student's classes
# independently in a loop so that I
# can use replace=False and prevent
# anyone from attending the same class
# multiple times

SIMULATED_STUDENTS = 1000

average_observed_class_size = numpy.vstack([
numpy.random.choice(
class_sizes,
p=class_sizes / class_sizes.sum(),
replace=False)  # can't attend same class twice
for _ in range(SIMULATED_STUDENTS)
]).mean(axis=1)

plt.hist(average_observed_class_size, bins=100)
plt.title("Estimated observed class size")
plt.xlabel("class size")
plt.ylabel("students who observed");
`````` The simulation above assigns each student a list of class sizes like `(30, 14, 61)` that they observed. In a full simulation, with each class filling up, there would be 30 data-points for a class with size 30, 5 for a class with size 5, … which makes the calculation much simpler. Instead of averaging `sum(class sizes) / 3` over all students, we can combine the numerators to get

``````average_by_student = (class_sizes ** 2).sum() \
/ class_sizes.sum()
average_by_student
``````
``````32.44172932330827
``````

If we take the average by class instead, it appears much better

``````average_by_class = class_sizes.replace(0, numpy.nan).mean()
average_by_class
``````
``````15.896414342629482
``````

I couldn’t find an average class size figure, but I did find a breakdown of classes by size which indicates:

• 64.5% of classes are < 20 students
• 28.6% of classes are 20-49 students
• 6.9% of classes are >= 50 students
``````bucketed = pandas.cut(class_sizes,
[-numpy.inf, 19, 49, numpy.inf])

class_buckets = (
bucketed.value_counts() / len(bucketed)
).to_frame(name="based on winter 2019")
class_buckets.index.name = "class size bucket"
class_buckets['official count'] = [0.645, 0.286, 0.069]
class_buckets.style.format("{:.0%}")
``````
based on winter 2019 official count
class size bucket
(-inf, 19.0] 77% 64%
(19.0, 49.0] 20% 29%
(49.0, inf] 3% 7%
``````clipped_class_sizes = class_sizes.clip(0, 100)

plt.hist(clipped_class_sizes,
clipped_class_sizes.max(),
alpha=0.4)

top = clipped_class_sizes.value_counts().max()

plt.axvline(average_by_student, color="r", linestyle="--")
plt.text(average_by_student + 3,
top,
"average by student",
rotation=90,
color="r");

plt.axvline(average_by_class, color="r", linestyle="--")
plt.text(average_by_class + 3,
top,
"average by class",
rotation=90,
color="r");

plt.title("Average class size at Dartmouth (Winter 2019)")
plt.xlabel("class size")
plt.ylabel("count of classes at size");
`````` Updated: