Finding Senators with Gap Terms on Wikidata using SPARQL
A few days ago, I went in search of United States senators who had pulled a
[Grover Cleaveland](https://en.wikipedia.org/wiki/Grover_Cleveland) and served
multiple non-consecutive terms. I couldn't think of any examples, but it's
happened a lot of times including recently with senator Dan Coats (R-IN) who
was a senator from 1989 to 1999 and from 2011 to 2017.
This post isn't really about senators and non-consecutive terms, but is a quick
walkthrough on how ask Wikidata questions using SPARQL. As a [bonus](#bonus1)
though, I found a senator who was succeeded twice by the same man and a senator
who represented three states.
# How to ask Wikipedia a question
Let's start with my question
> What senators served non-consecutive terms?
To start, go to the Wikipedia page for a current senator like [Elizabeth
Warren](https://en.wikipedia.org/wiki/Elizabeth_Warren). On the sidebar,
follow the link to "Wikidata item." As you scroll through that page, you'll
notice a section titled "positions held" with "United States Senator."
The page you're currently on is for the "Elizabeth Warren" _item_
([Q595844][wikidata warren]) and lists _properties_ that relate to her. Each
_property_ is related to the original _item_ in the form of a _statement_
containing a _subject_ - _property_ - _value_ like
_Elizabeth Warren_ ([Q595844][wikidata warren]) -
_Position Held_ ([P39][wikidata position]) -
_United States Senator_ ([Q13217683][wikidata senator]).
This "connection" itself can have properties as well, like
"start time" ([P580][wikidata start time])
which we can use in a filter later.
[wikidata warren]: https://www.wikidata.org/wiki/Q595844
[wikidata senator]: https://www.wikidata.org/wiki/Q13217683
[wikidata position]: https://www.wikidata.org/wiki/Property:P39
[wikidata start time]: https://www.wikidata.org/wiki/Property:P580
That's basically how the data is structured. In the next section, we'll use
SPARQL to actually start asking questions with this data model in mind.
[wikiquery all senators]: xxx
[wikiquery human senators]: xxx
[wikiquery current senators]: xxx
## Building the Query
Let's start with a simple question and build it up until it answers our
original question.
> List the United States Senators
I've written that question deliberately in an "imperative" mood to draw a
distinction between programming languages like python and languages like
SPARQL, which we'll use to satisfy our query.
In python, you might try to answer a question like this with code like
```python
for person in people:
if person.occupation == "United States Senator":
print(person)
```
SPARQL is in a differnt family of programming languages called "query
languages" which is related to the family of logical programming languages.
Instead of writing _instructions_ to find information, you write situations
with variables and let the program figure out what values the variables should
take.
You would instead write "find me all _items_ with a _property_
of "position held" with the value "United States Senator."
```sparql
SELECT ?senator
WHERE {
?senator
wdt:P39 # "Position Held"
wd:Q13217683. # "United States Senator"
}
```
([run query](https://query.wikidata.org/#SELECT%20%3Fsenator%0AWHERE%20%7B%0A%20%20%3Fsenator%0A%20%20%20%20wdt%3AP39%20%20%23%20%22Position%20Held%22%0A%20%20%20%20wd%3AQ13217683.%20%20%23%20%22United%20States%20Senator%22%0A%7D))
`?senator` is an unspecified variable (the particular `?name` doesn't matter)
and the database engine tries to "satisfy" the query by listing all values of
`?senator` for which the query is accurate.
The `?senator` values aren't particularly readable, so we'll paste in a little
boilerplate so we can use `?senatorLabel` in future queries which is a nice
representation of the corresponding `?senator` item.
```sparql
SELECT ?senator ?senatorLabel
WHERE {
?senator
wdt:P39 # "Position Held"
wd:Q13217683. # "United States Senator"
}
```
> Who are the current United States Senators?
As a simple modification of the previous query, we'll ask Wikidata to find
senators whose property "Position Held" with value "United States Senator" has
no end date.
In the previous query, we were able to ask about a statement directly, but
because we're asking about _qualifiers_ of the statement, we'll have to break
down the query to build it back up.
Instead of writing a query like `?subject wdt:?property wd:?value` we'll write
one statement that assigns the "Position Held" property to its own variable
using the `p:` prefix and write another a new statement about the _value_ of of
the "Position Held" using the `ps:` prefix.
([official docs][wikiquery syntax])
```sparql
SELECT ?senator ?senatorLabel ?end_time
WHERE {
?senator p:P39 ?term.
?term ps:P39 wd:Q13217683.
?term pq:P580 ?end_time.
SERVICE wikibase:label {
bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".
}
}
```
([run query](https://query.wikidata.org/#SELECT%20%3Fsenator%20%3FsenatorLabel%20%3Fend_time%0AWHERE%20%7B%0A%20%20%3Fsenator%20p%3AP39%20%3Fterm.%0A%20%20%3Fterm%20ps%3AP39%20wd%3AQ13217683.%0A%20%20%3Fterm%20pq%3AP580%20%3Fend_time.%0A%20%20%20%20%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20%0A%20%20%20%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%0A%20%20%7D%0A%7D))
I've gone ahead and added a third statement to assign the "end time"
[P80][wikidata end_time] to its own variable by extracting it as a qualifier
with the `pq:` prefix, but the results that "satisfy" the query always
have end dates. So we'll add an `OPTIONAL` block and put the `pq:` statement in
that block.
```sparql
SELECT ?senator ?senatorLabel ?end_time
WHERE {
?senator p:P39 ?term.
?term ps:P39 wd:Q13217683.
OPTIONAL {
?term pq:P580 ?end_time.
}
SERVICE wikibase:label {
bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".
}
}
```
([run query](https://query.wikidata.org/#SELECT%20%3Fsenator%20%3FsenatorLabel%20%3Fend_time%0AWHERE%20%7B%0A%20%20%3Fsenator%20p%3AP39%20%3Fterm.%0A%20%20%3Fterm%20ps%3AP39%20wd%3AQ13217683.%0A%0A%20%20OPTIONAL%20%7B%0A%20%20%20%20%3Fterm%20pq%3AP580%20%3Fend_time.%0A%20%20%7D%0A%20%20%20%20%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20%0A%20%20%20%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%0A%20%20%7D%0A%7D))
We could filter for only the `?senator` values without an "end time" here, but
that wont be relevant in the final query so I'd recommend doing that in excel,
python, or your other tool of choice.
[wikidata end time]: https://www.wikidata.org/wiki/Property:P582
[wikiquery syntax]: https://en.wikibooks.org/wiki/SPARQL/WIKIDATA_Qualifiers,_References_and_Ranks
This "intermediate" query might have seemed like a distraction, but we've
actually gotten all of the data we need to answer the original question in
excel or python. In the first page of results, you may already see some
senators showing up twice. That's because each line corresponds to one way of
assigning the variables in the query that satisfies the requirements.
Before we move on, let's take a careful look at the data.
# Checking the data
Recognize this guy?
![David Palmer](https://upload.wikimedia.org/wikipedia/commons/4/4a/Dennis_Haysbert_cropped.jpg)
| senator | senatorLabel | end_time |
|---------|--------------|----------|
[Q249898](http://www.wikidata.org/entity/Q249898) | David Palmer | |
{:.full-width.pull-middle}
That's actually Dennis Haysbert's character from
[24](https://en.wikipedia.org/wiki/24_(TV_series)). On the [item page][wikidata
david palmer] on Wikidata, you'll see that he is an "instance of"
([P31][wikidata instance of])
"fictional human" ([Q15632617][wikidata fictional human]).
Back on Elizabeth Warren's Wikidata page, you'll see she's an instance of
"human" ([Q5][wikidata human]) so before we go any farther, let's limit
ourselves to only real humans.
[wikidata david palmer]: https://www.wikidata.org/wiki/Q249898
[wikidata instance of]: https://www.wikidata.org/wiki/Property:P31
[wikidata fictional human]: https://www.wikidata.org/wiki/Q15632617
[wikidata human]: https://www.wikidata.org/wiki/Q5
```sparql
SELECT ?senatorLabel ?start_time ?end_time ?stateLabel WHERE {
?senator wdt:P31 wd:Q5.
?senator p:P39 ?term.
?term ps:P39 wd:Q13217683.
OPTIONAL {
?term pq:P580 ?start_time.
?term pq:P582 ?end_time.
?term pq:P768 ?state.
}
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en".
}
}
```
senatorLabel |
start_time |
end_time |
stateLabel |
Brock Adams |
1987-01-03T00:00:00Z |
1993-01-03T00:00:00Z |
Washington |
David C. Broderick |
1857-03-04T00:00:00Z |
1859-09-16T00:00:00Z |
California |
David Settle Reid |
1854-12-06T00:00:00Z |
1859-03-04T00:00:00Z |
North Carolina |
Arlen Specter |
1981-01-03T00:00:00Z |
2011-01-03T00:00:00Z |
Pennsylvania |
Allen Joseph Ellender |
1937-01-03T00:00:00Z |
1972-01-27T00:00:00Z |
Louisiana |
download 2014 rows
or
run query
|
I stumbled on Wikidata a few months ago while searching for data about the
Massachusetts legislature. Besides some gotchas in the data like "fictional
humans" creeping in, the data in that case wasn't good enough to use: most
candidates--even incumbents--didn't have Wikidata items and the data that did
existe was incomplete or implemented differently for some candidates.
It's likely that this data is also incomplete or sometimes inaccurate.
Nonetheless, it's a really powerful way to start answering my question and if
you find inaccurate data, you can submit a correction easily through the
Wikidata interface.
----
# Senators with Non-Consecutive Terms
We're finally here. We have a list of senators with separate entries for each
term they've served and we have the start and end dates of the terms are
included to help catch duplicate data. This part is pretty straightforward in
excel or python ([see addendum](#findDuplicates)) so here's you're reward:
senator |
term start |
term end |
notes |
Alben W. Barkley |
1927-03-04 |
1949-01-19 |
|
|
1955-01-03 |
1956-04-30 |
|
Angus Cameron |
1875-03-04 |
1881-03-03 |
|
|
1881-03-14 |
1885-03-03 |
|
Dan Coats |
1989-01-03 |
1999-01-03 |
|
download 57 rows
|
# Bonus #1: The senator who was succeeded twice by the same man
{:#bonus1}
Among the plenty of cases when that
happened, I stumbled found a case where two senators served four alternating
terms in 17 years.
In 1869, Matthew Hale Carpenter (R-WI) began his first term in the United
States Senate. Although he campaigned for re-election in 1875, he ultimately
lost to Angus Cameron (D-WI). After just one term, Cameron decided to call it a
quits and Carpenter was reelected by the state legislature (pre-17th
Ammendment).
You're probably guessing where this is going, but Carpenter was already in poor
health when he took office in 1881 and died just months into his disconnected,
second term. Cameron was fortunately still around and was quickly re-elected
and began his own disconnected, second term in March 1881.
# Bonus #2: James Shields (D-IL/MN/MO)
After almost dueling Lincoln and then almost becoming Governor of Oregon
Territory, James Shields was appointed Senator of Illinois, resigned because he
hadn't been a citizen long enough, and then was elected again before the end of
the year.
After defeat in 1855, Shields moved to the territory of Minnesota and was
elected senator when Minnesota became a state in 1858. Shields drew the short
straw (literally) and his term ended in 1859 so the other senator from
Minnesota could serve until 1863.
In 1879, after moving to California and back to the South and fighting for the
Union in the Civil War, Shields was eventually elected to replace a Missouri
senator who had died near the end of his term. Shields finally called it a
quits, presumably satisfied his record of serving as senator from three states
would never be matched.
I've left out a lot of details and would definitely recommend reading his
wikipedia article .
# Addendum: Getting duplicates from python
{:#findDuplicates}
Here's how I extrated the duplicates using python.
```python
import pandas
df = pandas.read_csv("query.csv", parse_dates=['start_time', 'end_time'])
df['start_time'] = pandas.to_datetime(df['start_time'].dt.date)
df['end_time'] = pandas.to_datetime(df['end_time'].dt.date)
# Data comes from multiple sources and includes a lot of exact duplicates
df.drop_duplicates(inplace=True)
multi_terms = df.duplicates('senatorLabel', keep=False)
df[multi_terms].set_index(['senatorLabel', 'start_time']).sort_index()
```
The data at this point can be scanned through for anomolies or used however.
# Recommended Reading
-
- [wikiquery syntax]