Finding Senators with Gap Terms on Wikidata using SPARQL

A few days ago, I went in search of United States senators who had pulled a [Grover Cleaveland]( and served multiple non-consecutive terms. I couldn't think of any examples, but it's happened a lot of times including recently with senator Dan Coats (R-IN) who was a senator from 1989 to 1999 and from 2011 to 2017. This post isn't really about senators and non-consecutive terms, but is a quick walkthrough on how ask Wikidata questions using SPARQL. As a [bonus](#bonus1) though, I found a senator who was succeeded twice by the same man and a senator who represented three states. # How to ask Wikipedia a question Let's start with my question > What senators served non-consecutive terms? To start, go to the Wikipedia page for a current senator like [Elizabeth Warren]( On the sidebar, follow the link to "Wikidata item." As you scroll through that page, you'll notice a section titled "positions held" with "United States Senator." The page you're currently on is for the "Elizabeth Warren" _item_ ([Q595844][wikidata warren]) and lists _properties_ that relate to her. Each _property_ is related to the original _item_ in the form of a _statement_ containing a _subject_ - _property_ - _value_ like _Elizabeth Warren_ ([Q595844][wikidata warren]) - _Position Held_ ([P39][wikidata position]) - _United States Senator_ ([Q13217683][wikidata senator]). This "connection" itself can have properties as well, like "start time" ([P580][wikidata start time]) which we can use in a filter later. [wikidata warren]: [wikidata senator]: [wikidata position]: [wikidata start time]: That's basically how the data is structured. In the next section, we'll use SPARQL to actually start asking questions with this data model in mind. [wikiquery all senators]: xxx [wikiquery human senators]: xxx [wikiquery current senators]: xxx ## Building the Query Let's start with a simple question and build it up until it answers our original question. > List the United States Senators I've written that question deliberately in an "imperative" mood to draw a distinction between programming languages like python and languages like SPARQL, which we'll use to satisfy our query. In python, you might try to answer a question like this with code like ```python for person in people: if person.occupation == "United States Senator": print(person) ``` SPARQL is in a differnt family of programming languages called "query languages" which is related to the family of logical programming languages. Instead of writing _instructions_ to find information, you write situations with variables and let the program figure out what values the variables should take. You would instead write "find me all _items_ with a _property_ of "position held" with the value "United States Senator." ```sparql SELECT ?senator WHERE { ?senator wdt:P39 # "Position Held" wd:Q13217683. # "United States Senator" } ``` ([run query]( `?senator` is an unspecified variable (the particular `?name` doesn't matter) and the database engine tries to "satisfy" the query by listing all values of `?senator` for which the query is accurate. The `?senator` values aren't particularly readable, so we'll paste in a little boilerplate so we can use `?senatorLabel` in future queries which is a nice representation of the corresponding `?senator` item. ```sparql SELECT ?senator ?senatorLabel WHERE { ?senator wdt:P39 # "Position Held" wd:Q13217683. # "United States Senator" } ``` > Who are the current United States Senators? As a simple modification of the previous query, we'll ask Wikidata to find senators whose property "Position Held" with value "United States Senator" has no end date. In the previous query, we were able to ask about a statement directly, but because we're asking about _qualifiers_ of the statement, we'll have to break down the query to build it back up. Instead of writing a query like `?subject wdt:?property wd:?value` we'll write one statement that assigns the "Position Held" property to its own variable using the `p:` prefix and write another a new statement about the _value_ of of the "Position Held" using the `ps:` prefix. ([official docs][wikiquery syntax]) ```sparql SELECT ?senator ?senatorLabel ?end_time WHERE { ?senator p:P39 ?term. ?term ps:P39 wd:Q13217683. ?term pq:P580 ?end_time. SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } } ``` ([run query]( I've gone ahead and added a third statement to assign the "end time" [P80][wikidata end_time] to its own variable by extracting it as a qualifier with the `pq:` prefix, but the results that "satisfy" the query always have end dates. So we'll add an `OPTIONAL` block and put the `pq:` statement in that block. ```sparql SELECT ?senator ?senatorLabel ?end_time WHERE { ?senator p:P39 ?term. ?term ps:P39 wd:Q13217683. OPTIONAL { ?term pq:P580 ?end_time. } SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } } ``` ([run query]( We could filter for only the `?senator` values without an "end time" here, but that wont be relevant in the final query so I'd recommend doing that in excel, python, or your other tool of choice. [wikidata end time]: [wikiquery syntax]:,_References_and_Ranks This "intermediate" query might have seemed like a distraction, but we've actually gotten all of the data we need to answer the original question in excel or python. In the first page of results, you may already see some senators showing up twice. That's because each line corresponds to one way of assigning the variables in the query that satisfies the requirements. Before we move on, let's take a careful look at the data. # Checking the data Recognize this guy? ![David Palmer]( | senator | senatorLabel | end_time | |---------|--------------|----------| [Q249898]( | David Palmer | | {:.full-width.pull-middle} That's actually Dennis Haysbert's character from [24]( On the [item page][wikidata david palmer] on Wikidata, you'll see that he is an "instance of" ([P31][wikidata instance of]) "fictional human" ([Q15632617][wikidata fictional human]). Back on Elizabeth Warren's Wikidata page, you'll see she's an instance of "human" ([Q5][wikidata human]) so before we go any farther, let's limit ourselves to only real humans. [wikidata david palmer]: [wikidata instance of]: [wikidata fictional human]: [wikidata human]: ```sparql SELECT ?senatorLabel ?start_time ?end_time ?stateLabel WHERE { ?senator wdt:P31 wd:Q5. ?senator p:P39 ?term. ?term ps:P39 wd:Q13217683. OPTIONAL { ?term pq:P580 ?start_time. ?term pq:P582 ?end_time. ?term pq:P768 ?state. } SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } } ```
senatorLabel start_time end_time stateLabel
Brock Adams 1987-01-03T00:00:00Z 1993-01-03T00:00:00Z Washington
David C. Broderick 1857-03-04T00:00:00Z 1859-09-16T00:00:00Z California
David Settle Reid 1854-12-06T00:00:00Z 1859-03-04T00:00:00Z North Carolina
Arlen Specter 1981-01-03T00:00:00Z 2011-01-03T00:00:00Z Pennsylvania
Allen Joseph Ellender 1937-01-03T00:00:00Z 1972-01-27T00:00:00Z Louisiana
download 2014 rows or run query
I stumbled on Wikidata a few months ago while searching for data about the Massachusetts legislature. Besides some gotchas in the data like "fictional humans" creeping in, the data in that case wasn't good enough to use: most candidates--even incumbents--didn't have Wikidata items and the data that did existe was incomplete or implemented differently for some candidates. It's likely that this data is also incomplete or sometimes inaccurate. Nonetheless, it's a really powerful way to start answering my question and if you find inaccurate data, you can submit a correction easily through the Wikidata interface. ---- # Senators with Non-Consecutive Terms We're finally here. We have a list of senators with separate entries for each term they've served and we have the start and end dates of the terms are included to help catch duplicate data. This part is pretty straightforward in excel or python ([see addendum](#findDuplicates)) so here's you're reward:
senator term start term end notes
Alben W. Barkley 1927-03-04 1949-01-19
1955-01-03 1956-04-30
Angus Cameron 1875-03-04 1881-03-03
1881-03-14 1885-03-03
Dan Coats 1989-01-03 1999-01-03
download 57 rows
# Bonus #1: The senator who was succeeded twice by the same man {:#bonus1} Among the plenty of cases when that happened, I stumbled found a case where two senators served four alternating terms in 17 years. In 1869, Matthew Hale Carpenter (R-WI) began his first term in the United States Senate. Although he campaigned for re-election in 1875, he ultimately lost to Angus Cameron (D-WI). After just one term, Cameron decided to call it a quits and Carpenter was reelected by the state legislature (pre-17th Ammendment). You're probably guessing where this is going, but Carpenter was already in poor health when he took office in 1881 and died just months into his disconnected, second term. Cameron was fortunately still around and was quickly re-elected and began his own disconnected, second term in March 1881. # Bonus #2: James Shields (D-IL/MN/MO) After almost dueling Lincoln and then almost becoming Governor of Oregon Territory, James Shields was appointed Senator of Illinois, resigned because he hadn't been a citizen long enough, and then was elected again before the end of the year. After defeat in 1855, Shields moved to the territory of Minnesota and was elected senator when Minnesota became a state in 1858. Shields drew the short straw (literally) and his term ended in 1859 so the other senator from Minnesota could serve until 1863. In 1879, after moving to California and back to the South and fighting for the Union in the Civil War, Shields was eventually elected to replace a Missouri senator who had died near the end of his term. Shields finally called it a quits, presumably satisfied his record of serving as senator from three states would never be matched. I've left out a lot of details and would definitely recommend reading his wikipedia article . # Addendum: Getting duplicates from python {:#findDuplicates} Here's how I extrated the duplicates using python. ```python import pandas df = pandas.read_csv("query.csv", parse_dates=['start_time', 'end_time']) df['start_time'] = pandas.to_datetime(df['start_time'] df['end_time'] = pandas.to_datetime(df['end_time'] # Data comes from multiple sources and includes a lot of exact duplicates df.drop_duplicates(inplace=True) multi_terms = df.duplicates('senatorLabel', keep=False) df[multi_terms].set_index(['senatorLabel', 'start_time']).sort_index() ``` The data at this point can be scanned through for anomolies or used however. # Recommended Reading - - [wikiquery syntax]