Inspection Paradox in Dartmouth Class Sizes

The Inspection Paradox creeps in when the probability of you recording a data point depends on the value of the data point. Jake VanderPlas and Allen Downey have excellent examples and explanations of places where the inspection paradox shows up.

A straightforward example involves a school with 100 students who are each two classes: one containing all of the students and one private lesson with a professor. At this school, the average class size that any student encounters is about 50 but the average class actually has about 2 students.

(
    100 * 1  # one class with everyone
    + 1 * 100  # 100 classes that are 1:1
) / 101  # total number of classes

1.9801980198019802

This isn’t just a quirk of my toy example. If we look at Dartmouth’s class enrollment for winter 2019, we can see that the average class size and the observed class size for a student are substantially different.

import pandas

courses = pandas.read_html(
    "../assets/dartmouth-winter-2019-enrollment.html",
    "Instructor")[2]
class_sizes = courses.Enrl

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

%matplotlib inline

clipped_class_sizes = class_sizes.clip(0, 100)
plt.hist(clipped_class_sizes,
         bins=clipped_class_sizes.max())
plt.title("Class Size Distribution")
plt.xlabel("class size")
plt.ylabel("count of classes at size");

png

To calculate the average class size as observed by a student, we can draw random classes from the list of classes weighted by their enrollment, kind of like picking a course-load by randomly selecting seats in distinct classes.

import numpy

# I generate each student's classes
# independently in a loop so that I
# can use replace=False and prevent
# anyone from attending the same class
# multiple times

SIMULATED_STUDENTS = 1000

average_observed_class_size = numpy.vstack([
    numpy.random.choice(
        class_sizes,
        p=class_sizes / class_sizes.sum(),
        size=3,  # typical courseload
        replace=False)  # can't attend same class twice
    for _ in range(SIMULATED_STUDENTS)
]).mean(axis=1)

plt.hist(average_observed_class_size, bins=100)
plt.title("Estimated observed class size")
plt.xlabel("class size")
plt.ylabel("students who observed");

png

The simulation above assigns each student a list of class sizes like (30, 14, 61) that they observed. In a full simulation, with each class filling up, there would be 30 data-points for a class with size 30, 5 for a class with size 5, … which makes the calculation much simpler. Instead of averaging sum(class sizes) / 3 over all students, we can combine the numerators to get

average_by_student = (class_sizes ** 2).sum() \
                     / class_sizes.sum()
average_by_student

32.44172932330827

If we take the average by class instead, it appears much better

average_by_class = class_sizes.replace(0, numpy.nan).mean()
average_by_class

15.896414342629482

I couldn’t find an average class size figure, but I did find a breakdown of classes by size which indicates:

64.5% of classes are < 20 students
28.6% of classes are 20-49 students
6.9% of classes are >= 50 students

bucketed = pandas.cut(class_sizes,
                      [-numpy.inf, 19, 49, numpy.inf])

class_buckets = (
  bucketed.value_counts() / len(bucketed)
).to_frame(name="based on winter 2019")
class_buckets.index.name = "class size bucket"
class_buckets['official count'] = [0.645, 0.286, 0.069]
class_buckets.style.format("{:.0%}")

	based on winter 2019	official count
class size bucket
(-inf, 19.0]	77%	64%
(19.0, 49.0]	20%	29%
(49.0, inf]	3%	7%

clipped_class_sizes = class_sizes.clip(0, 100)

plt.hist(clipped_class_sizes,
         clipped_class_sizes.max(),
         alpha=0.4)

top = clipped_class_sizes.value_counts().max()

plt.axvline(average_by_student, color="r", linestyle="--")
plt.text(average_by_student + 3,
         top,
         "average by student",
         rotation=90,
         color="r");

plt.axvline(average_by_class, color="r", linestyle="--")
plt.text(average_by_class + 3,
         top,
         "average by class",
         rotation=90,
         color="r");

plt.title("Average class size at Dartmouth (Winter 2019)")
plt.xlabel("class size")
plt.ylabel("count of classes at size");

png

Alex Riina

Inspection Paradox in Dartmouth Class Sizes

You May Also Enjoy

Automatically editing code

Python Linters, Fixers, and Other Static Checkers

Bad dog! Practice your Spanish flashcards

Expert System for Picking Outfits (#1)