Analysing the arXiv
This post is about what you can learn about scientific articles posted on the arXiv by using Natural Language Processing (NLP). Said differently: I had some questions about papers posted on the arXiv and used it as an excuse to teach myself the basics of NLP. We also look at citation counts and reveal the top cited paper of 2014!
The arXiv makes its data available via a simple API which allows you to download almost everything about an article short of its full text. For each article we can look up information about who has been citing it on inspire. Combined this is a powerful dataset that can answer some interesting questions like: what are the most used words, can we auto generate abstracts, what about summarising abstracts or finding the most cited article of 2014.
Let's get going!
First some standard imports that we will need later. Some of them you might need to install but nothing too obscure:
import time
import urllib2
import datetime
from itertools import ifilter
from collections import Counter, defaultdict
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup
import matplotlib.pylab as plt
import pandas as pd
import numpy as np
import bibtexparser
pd.set_option('mode.chained_assignment','warn')
%matplotlib inline
The harvest
function will query the arXiv API for all articles modified between January, 1st 2010 and the end of the year 2014. This is a subtlety worth noting: with this query you will also get articles created before 2010 if their entry was modified after 2010.
The arXiv itself covers so many topics that it is organised into seperate arxivs (I know unfortunate doulbe use of the name arxiv), one for each topic. By default harvest
will collect articles from the physics:hep-ex
arxiv. This is because I am a experimental particle physicist. If you are into theory try physics:hep-th
or stats
if you are a stats guru. The API gives you a full list of all sets of topics to explore.
If you do not care for the technicalities of how to scrape the data skip right ahead to the first factoid.
Harvesting articles¶
Most of harvest
is pretty straight forward. The API returns a big XML document containing information about at most 1000 articles which we can parse with ElementTree
and store. If there are more than 1000 articles for a particular query we can get those using the resumptionToken
in the XML. API access can be throttled so on occasion the arXiv will reply with a 503 error asking us to retry later. The information we harvest is stored in a pandas
dataframe.
OAI = "{http://www.openarchives.org/OAI/2.0/}"
ARXIV = "{http://arxiv.org/OAI/arXiv/}"
def harvest(arxiv="physics:hep-ex"):
df = pd.DataFrame(columns=("title", "abstract", "categories", "created", "id", "doi"))
base_url = "http://export.arxiv.org/oai2?verb=ListRecords&"
url = (base_url +
"from=2010-01-01&until=2014-12-31&" +
"metadataPrefix=arXiv&set=%s"%arxiv)
while True:
print "fetching", url
try:
response = urllib2.urlopen(url)
except urllib2.HTTPError, e:
if e.code == 503:
to = int(e.hdrs.get("retry-after", 30))
print "Got 503. Retrying after {0:d} seconds.".format(to)
time.sleep(to)
continue
else:
raise
xml = response.read()
root = ET.fromstring(xml)
for record in root.find(OAI+'ListRecords').findall(OAI+"record"):
arxiv_id = record.find(OAI+'header').find(OAI+'identifier')
meta = record.find(OAI+'metadata')
info = meta.find(ARXIV+"arXiv")
created = info.find(ARXIV+"created").text
created = datetime.datetime.strptime(created, "%Y-%m-%d")
categories = info.find(ARXIV+"categories").text
# if there is more than one DOI use the first one
# often the second one (if it exists at all) refers
# to an eratum or similar
doi = info.find(ARXIV+"doi")
if doi is not None:
doi = doi.text.split()[0]
contents = {'title': info.find(ARXIV+"title").text,
'id': info.find(ARXIV+"id").text,#arxiv_id.text[4:],
'abstract': info.find(ARXIV+"abstract").text.strip(),
'created': created,
'categories': categories.split(),
'doi': doi,
}
df = df.append(contents, ignore_index=True)
# The list of articles returned by the API comes in chunks of
# 1000 articles. The presence of a resumptionToken tells us that
# there is more to be fetched.
token = root.find(OAI+'ListRecords').find(OAI+"resumptionToken")
if token is None or token.text is None:
break
else:
url = base_url + "resumptionToken=%s"%(token.text)
return df
Set harvest
running and go chat with someone for a few minutes while it gathers the information about your articles.
df = harvest()
What does all that stuff we just downloaded look like? Here are the first five entries in the dataframe:
df.head()
First factoid¶
def bar_chart(items):
"""Make a bar chart showing the count associated with each key
`items` is a list of (key, count) pairs.
"""
width = 0.5
ind = np.arange(len(items))
fig, ax = plt.subplots(figsize=(8,8))
rects1 = ax.bar(ind, zip(*items)[1], width, color='r')
ax.set_xticks(ind+width)
ax.set_xticklabels(zip(*items)[0])
fig.autofmt_xdate()
plt.show()
edits_per_year = Counter(df.created.map(lambda x: x.year))
bar_chart(edits_per_year.items())
new_articles = sum(edits_per_year[year] for year in (2010,2011,2012,2013,2014))
print "Unique arXiv IDs edited between 2010 and 2014:", len(df.id.unique())
print "of which %i entries were created in that time period."%(new_articles)
Here is our first factoid about the arXiv: There are about 16000 articles in hep-ex
which were edited between the beginning of 2010 and the end of 2014. Including about 12000 newly created articles. The other 4000 papers were created before 2010 and were updated after creation. Amazing to see that papers created in 1994 were still being edited almost ten years later!
Let's take a look at those, maybe there is something interesting:
df[df.created<datetime.date(1995,1,1)]
After scrolling through the list I can not spot a particular pattern to the edits. Though it does seem like the list contains articles on interesting topics like evidence for the top quark (index 15148), PeV $\tau$ neutrinos (index 16113) and an article about determining the weak mixing angle at SLD (index 15149).
Collecting citations¶
Unfortunately Inspire does not provide a real API, so we have to scrape their webpages to get what we want. The get_cites
function will look up the citations of an article by its arxiv_id
. Having to make one HTTP request per article means this takes quite a while. So set it going and come back after a few hours. We process articles in chunks of 1000 to get some feedback as well as being able to resume if something goes wrong:
def get_cites(arxiv_id):
cites = []
base_url = "http://inspirehep.net/search?p=refersto:%s&of=hx&rg=250&jrec=%i"
offset = 1
while True:
print base_url%(arxiv_id, offset)
response = urllib2.urlopen(base_url%(arxiv_id, offset))
xml = response.read()
soup = BeautifulSoup(xml)
refs = "\n".join(cite.get_text() for cite in soup.findAll("pre"))
bib_database = bibtexparser.loads(refs)
if bib_database.entries:
cites += bib_database.entries
offset += 250
else:
break
return cites
step = 1000
for N in range(0,17):
print N
cites = df['id'][N*step:(N+1)*step].map(get_cites)
df.ix[N*step:(N+1)*step -1,'cited_by'] = cites
Data preservation¶
After investing so much time to gather the raw data it is a good idea to store it locally so we do not have to scrape it all again later:
store = pd.HDFStore("/Users/thead/git/arxiv-experiments/hep-ex.h5")
#store['df'] = df
#df = store['df']
store.close()
Word counts¶
Let's get to answering some questions. What are the ten most used words in hep-ex
abstracts?
word_bag = " ".join(df.abstract.apply(lambda t: t.lower()))
Counter(word_bag.split()).most_common(n=10)
Not too enlightening, boring little words close out the top ten. These words are known as stopwords and the NLTK
library provides a list of all of them. So let's remove them as well as basic mathematical symbols:
from nltk.corpus import stopwords
stops = [word for word in stopwords.words('english')]
stops += ["=", "->"]
words = filter(lambda w: w not in stops,
word_bag.split())
top_twenty = Counter(words).most_common(n=20)
bar_chart(top_twenty)
Experimental physics is all about data afterall! A shame that model
beats detector
but probably that is inevitable as there are many more theoretical models than experimental detectors. The higgs
boson beats the neutrino
but they reign supreme over all the other particles.
Towards the bottom we have measurements
and measured
. These should probably be counted as one entry, together with measurement
, measuring
, etc. The easiest way to achieve this is to stem the words before counting. Stemming is the process of reducing derived words to their stem, for example:
import nltk.stem as stem
porter = stem.PorterStemmer()
for w in ("measurement", "measurements", "measured", "measure"):
print w, "->", porter.stem(w)
Like in this case the stem does not have to be a real word itself. By stemming words before counting how often they occur the entries for measurements
and measured
get added together. Using the stem of every word we get the following ranking:
word_stems = map(lambda w: (porter.stem(w),w), words)
stem2words = defaultdict(set)
for stem, word in word_stems:
stem2words[stem].add(word)
top_twenty = Counter(w[0] for w in word_stems).most_common(n=20)
bar_chart(top_twenty)
# list all words which correspond to each top twenty stem
for stem,count in top_twenty:
print stem, "<-", ", ".join(stem2words[stem])
Turns out experimental phsics is all about measuring things. The stemming is not perfect, but good enough for now.
Cite me, cite me, no cite me!¶
In science citations is the currency used to measure the success of a paper. What does the distribution of citations look like then?
A simple question to ask is: how often are articles cited? As articles have to be read and understood before they can be cited we only look at articles created before the beginning of 2014.
before_2014 = datetime.datetime(2014,1,1)
plt.hist(df[df.created<before_2014].cited_by.map(len),
bins=200, normed=True, range=(0,200))
plt.xlabel("Number of citations")
plt.ylabel("Fraction")
This plot shows the fraction of articles cited zero, one, two, three, ... times. The single most likely number of citations for an article on hep-ex
is zero! A whopping 13% of articles never get cited and nearly a third of articles are cited less than four times.
df['citation_count'] = df.cited_by.map(len)
df[df.created<before_2014]['citation_count'].describe()
The average number of citations is about 33. The average is misleading for a steeply falling distribution like this, afterall we reach the 50% percentile at only 10 citations!
Citations top ten¶
The prize for most cited paper with a whopping 4138 citations goes to:
df.iloc[df.citation_count.idxmax()]
Read the paper yourself: New Generation of Parton Distributions with Uncertainties from Global QCD Analysis and see if you agree.
We can also easily compute the top ten papers. This is an interesting mix of articles. Number two and three are the papers by the ATLAS and CMS experiments reporting on the discovery of the Higgs boson. While most of the papers in the top ten are older these two were only published in 2012 and have already overtaken the top quark discovery which was published in 1995! Curious fact, the ATLAS paper has ever so few more citations than the CMS one.
df.sort('citation_count', ascending=False).head(10)
There are many more interesting things to be done with analysing the words used in abstracts as well as anlysing who cites who. This will be covered in the second part of this post as this one is already fairly lengthy.
Just one more thing, the top cited paper of 2014: First combination of Tevatron and LHC measurements of the top-quark mass celebrating collaboration across the globe:
This post started life as a ipython notebook, download it or view it online.