Inspection Paradox in Dartmouth Class Sizes

The Inspection Paradox creeps in when the probability of you recording a data point depends on the value of the data point. Jake VanderPlas and Allen Downey have excellent examples and explanations of places where the inspection paradox shows up.

A straightforward example involves a school with 100 students who are each two classes: one containing all of the students and one private lesson with a professor. At this school, the average class size that any student encounters is about 50 but the average class actually has about 2 students.

(
    100 * 1  # one class with everyone
    + 1 * 100  # 100 classes that are 1:1
) / 101  # total number of classes
1.9801980198019802

This isn’t just a quirk of my toy example. If we look at Dartmouth’s class enrollment for winter 2019, we can see that the average class size and the observed class size for a student are substantially different.

import pandas

courses = pandas.read_html(
    "../assets/dartmouth-winter-2019-enrollment.html",
    "Instructor")[2]
class_sizes = courses.Enrl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

%matplotlib inline

clipped_class_sizes = class_sizes.clip(0, 100)
plt.hist(clipped_class_sizes,
         bins=clipped_class_sizes.max())
plt.title("Class Size Distribution")
plt.xlabel("class size")
plt.ylabel("count of classes at size");

png

To calculate the average class size as observed by a student, we can draw random classes from the list of classes weighted by their enrollment, kind of like picking a course-load by randomly selecting seats in distinct classes.

import numpy

# I generate each student's classes
# independently in a loop so that I
# can use replace=False and prevent
# anyone from attending the same class
# multiple times

SIMULATED_STUDENTS = 1000

average_observed_class_size = numpy.vstack([
    numpy.random.choice(
        class_sizes,
        p=class_sizes / class_sizes.sum(),
        size=3,  # typical courseload
        replace=False)  # can't attend same class twice
    for _ in range(SIMULATED_STUDENTS)
]).mean(axis=1)

plt.hist(average_observed_class_size, bins=100)
plt.title("Estimated observed class size")
plt.xlabel("class size")
plt.ylabel("students who observed");

png

The simulation above assigns each student a list of class sizes like (30, 14, 61) that they observed. In a full simulation, with each class filling up, there would be 30 data-points for a class with size 30, 5 for a class with size 5, … which makes the calculation much simpler. Instead of averaging sum(class sizes) / 3 over all students, we can combine the numerators to get

average_by_student = (class_sizes ** 2).sum() \
                     / class_sizes.sum()
average_by_student
32.44172932330827

If we take the average by class instead, it appears much better

average_by_class = class_sizes.replace(0, numpy.nan).mean()
average_by_class
15.896414342629482

I couldn’t find an average class size figure, but I did find a breakdown of classes by size which indicates:

  • 64.5% of classes are < 20 students
  • 28.6% of classes are 20-49 students
  • 6.9% of classes are >= 50 students
bucketed = pandas.cut(class_sizes,
                      [-numpy.inf, 19, 49, numpy.inf])

class_buckets = (
  bucketed.value_counts() / len(bucketed)
).to_frame(name="based on winter 2019")
class_buckets.index.name = "class size bucket"
class_buckets['official count'] = [0.645, 0.286, 0.069]
class_buckets.style.format("{:.0%}")
based on winter 2019 official count
class size bucket
(-inf, 19.0] 77% 64%
(19.0, 49.0] 20% 29%
(49.0, inf] 3% 7%
clipped_class_sizes = class_sizes.clip(0, 100)

plt.hist(clipped_class_sizes,
         clipped_class_sizes.max(),
         alpha=0.4)

top = clipped_class_sizes.value_counts().max()

plt.axvline(average_by_student, color="r", linestyle="--")
plt.text(average_by_student + 3,
         top,
         "average by student",
         rotation=90,
         color="r");

plt.axvline(average_by_class, color="r", linestyle="--")
plt.text(average_by_class + 3,
         top,
         "average by class",
         rotation=90,
         color="r");

plt.title("Average class size at Dartmouth (Winter 2019)")
plt.xlabel("class size")
plt.ylabel("count of classes at size");

png

Updated: