Analysing the arXiv

16 December 2014

This post is about what you can learn about scientific articles posted on the arXiv by using Natural Language Processing (NLP). Said differently: I had some questions about papers posted on the arXiv and used it as an excuse to teach myself the basics of NLP. We also look at citation counts and reveal the top cited paper of 2014!

The arXiv makes its data available via a simple API which allows you to download almost everything about an article short of its full text. For each article we can look up information about who has been citing it on inspire. Combined this is a powerful dataset that can answer some interesting questions like: what are the most used words, can we auto generate abstracts, what about summarising abstracts or finding the most cited article of 2014.

Let's get going!

First some standard imports that we will need later. Some of them you might need to install but nothing too obscure:

In [175]:
import time
import urllib2
import datetime
from itertools import ifilter
from collections import Counter, defaultdict
import xml.etree.ElementTree as ET

from bs4 import BeautifulSoup
import matplotlib.pylab as plt
import pandas as pd
import numpy as np
import bibtexparser

pd.set_option('mode.chained_assignment','warn')
In [2]:
%matplotlib inline

The harvest function will query the arXiv API for all articles modified between January, 1st 2010 and the end of the year 2014. This is a subtlety worth noting: with this query you will also get articles created before 2010 if their entry was modified after 2010.

The arXiv itself covers so many topics that it is organised into seperate arxivs (I know unfortunate doulbe use of the name arxiv), one for each topic. By default harvest will collect articles from the physics:hep-ex arxiv. This is because I am a experimental particle physicist. If you are into theory try physics:hep-th or stats if you are a stats guru. The API gives you a full list of all sets of topics to explore.

If you do not care for the technicalities of how to scrape the data skip right ahead to the first factoid.

Harvesting articles

Most of harvest is pretty straight forward. The API returns a big XML document containing information about at most 1000 articles which we can parse with ElementTree and store. If there are more than 1000 articles for a particular query we can get those using the resumptionToken in the XML. API access can be throttled so on occasion the arXiv will reply with a 503 error asking us to retry later. The information we harvest is stored in a pandas dataframe.

In [4]:
OAI = "{http://www.openarchives.org/OAI/2.0/}"
ARXIV = "{http://arxiv.org/OAI/arXiv/}"

def harvest(arxiv="physics:hep-ex"):
    df = pd.DataFrame(columns=("title", "abstract", "categories", "created", "id", "doi"))
    base_url = "http://export.arxiv.org/oai2?verb=ListRecords&"
    url = (base_url +
           "from=2010-01-01&until=2014-12-31&" +
           "metadataPrefix=arXiv&set=%s"%arxiv)
    
    while True:
        print "fetching", url
        try:
            response = urllib2.urlopen(url)
            
        except urllib2.HTTPError, e:
            if e.code == 503:
                to = int(e.hdrs.get("retry-after", 30))
                print "Got 503. Retrying after {0:d} seconds.".format(to)

                time.sleep(to)
                continue
                
            else:
                raise
            
        xml = response.read()

        root = ET.fromstring(xml)

        for record in root.find(OAI+'ListRecords').findall(OAI+"record"):
            arxiv_id = record.find(OAI+'header').find(OAI+'identifier')
            meta = record.find(OAI+'metadata')
            info = meta.find(ARXIV+"arXiv")
            created = info.find(ARXIV+"created").text
            created = datetime.datetime.strptime(created, "%Y-%m-%d")
            categories = info.find(ARXIV+"categories").text

            # if there is more than one DOI use the first one
            # often the second one (if it exists at all) refers
            # to an eratum or similar
            doi = info.find(ARXIV+"doi")
            if doi is not None:
                doi = doi.text.split()[0]
                
            contents = {'title': info.find(ARXIV+"title").text,
                        'id': info.find(ARXIV+"id").text,#arxiv_id.text[4:],
                        'abstract': info.find(ARXIV+"abstract").text.strip(),
                        'created': created,
                        'categories': categories.split(),
                        'doi': doi,
                        }

            df = df.append(contents, ignore_index=True)

        # The list of articles returned by the API comes in chunks of
        # 1000 articles. The presence of a resumptionToken tells us that
        # there is more to be fetched.
        token = root.find(OAI+'ListRecords').find(OAI+"resumptionToken")
        if token is None or token.text is None:
            break

        else:
            url = base_url + "resumptionToken=%s"%(token.text)
            
    return df
    

Set harvest running and go chat with someone for a few minutes while it gathers the information about your articles.

In [ ]:
df = harvest()

What does all that stuff we just downloaded look like? Here are the first five entries in the dataframe:

In [73]:
df.head()
Out[73]:
title abstract categories created id doi cited_by
0 Measurement of the Hadronic Form Factor in D0 ... The shape of the hadronic form factor f+(q2) i... [hep-ex] 2007-03-31 0704.0020 doi:10.1103/PhysRevD.76.052005 [{u'slaccitation': u'%%CITATION = ARXIV:1411.3...
1 Measurement of B(D_S^+ --> ell^+ nu) and the D... We examine e+e- --> Ds- Ds*+ and Ds*- Ds+ inte... [hep-ex, hep-lat, hep-ph] 2007-04-03 0704.0437 10.1103/PhysRevD.76.072002 [{u'author': u'Yelton, John M.', u'journal': u...
2 A unified analysis of the reactor neutrino pro... We present in this article a detailed quantita... [hep-ex] 2007-04-04 0704.0498 10.1088/1742-6596/110/8/082013 [{u'author': u'Queval, Rachel', u'title': u'Ch...
3 Measurement of Decay Amplitudes of B -->(c cba... We perform the first three-dimensional measure... [hep-ex] 2007-04-04 0704.0522 10.1103/PhysRevD.76.031102 [{u'author': u'Giurgiu, Gavril', u'journal': u...
4 Measurement of the Decay Constant $f_D{_S^+}$ ... We measure the decay constant fDs using the Ds... [hep-ex, hep-lat, hep-ph] 2007-04-04 0704.0629 10.1103/PhysRevLett.99.071802 [{u'author': u'Jackson, Graham', u'type': u'ar...

First factoid

In [71]:
def bar_chart(items):
    """Make a bar chart showing the count associated with each key
    
    `items` is a list of (key, count) pairs.
    """
    width = 0.5
    ind = np.arange(len(items))
    fig, ax = plt.subplots(figsize=(8,8))
    rects1 = ax.bar(ind, zip(*items)[1], width, color='r')
    ax.set_xticks(ind+width)
    ax.set_xticklabels(zip(*items)[0])
    fig.autofmt_xdate()
    plt.show()

edits_per_year = Counter(df.created.map(lambda x: x.year))
bar_chart(edits_per_year.items())
new_articles = sum(edits_per_year[year] for year in (2010,2011,2012,2013,2014))
print "Unique arXiv IDs edited between 2010 and 2014:", len(df.id.unique())
print "of which %i entries were created in that time period."%(new_articles)
Unique arXiv IDs edited between 2010 and 2014: 16321
of which 11958 entries were created in that time period.

Here is our first factoid about the arXiv: There are about 16000 articles in hep-ex which were edited between the beginning of 2010 and the end of 2014. Including about 12000 newly created articles. The other 4000 papers were created before 2010 and were updated after creation. Amazing to see that papers created in 1994 were still being edited almost ten years later!

Let's take a look at those, maybe there is something interesting:

In [72]:
df[df.created<datetime.date(1995,1,1)]
Out[72]:
title abstract categories created id doi cited_by
13366 DUMAND and AMANDA: High Energy Neutrino Astrop... The field of high energy neutrino astrophysics... [astro-ph, hep-ex] 1994-12-06 astro-ph/9412019 None [{u'author': u'Al Samarai, Imen', u'type': u'a...
13375 Detection of nuclear recoils in prototype dark... This work is part of an ongoing project to dev... [cond-mat, hep-ex] 1994-11-17 cond-mat/9411072 10.1016/0168-9002(95)00036-4 [{u'doi': u'10.1016/j.astropartphys.2004.06.00...
15144 Precise Measurement of the Left-Right Cross Se... We present a precise measurement of the left-r... [hep-ex, hep-ph] 1994-04-27 hep-ex/9404001 10.1103/PhysRevLett.73.25 [{u'doi': u'10.1088/1742-6596/335/1/012078', u...
15145 An optimal method of moments to measure the ch... Parity violation at LEP or SLC can be measured... [hep-ex] 1994-05-11 hep-ex/9405002 10.1016/0168-9002(94)90847-8 []
15146 Observation of Anisotropic Event Shapes and Tr... Event shapes for Au + Au collisions at 11.4 Ge... [hep-ex] 1994-05-13 hep-ex/9405003 10.1103/PhysRevLett.73.2532 [{u'primaryclass': u'nucl-th', u'author': u'Wa...
15147 Measurement of the Charged Multiplicity of $Z ... Using an impact parameter tag to select an enr... [hep-ex, hep-ph] 1994-05-13 hep-ex/9405004 10.1103/PhysRevLett.72.3145 [{u'doi': u'10.1140/epjc/s2005-02424-5', u'pri...
15148 Evidence for Top Quark Production in $\bar{p}p... We summarize a search for the top quark with t... [hep-ex, hep-ph] 1994-05-16 hep-ex/9405005 10.1103/PhysRevLett.73.225 [{u'primaryclass': u'hep-ex', u'author': u'Ger...
15149 Precise Determination of the Weak Mixing Angle... In the 1993 SLC/SLD run, the SLD recorded 50,0... [hep-ex] 1994-05-20 hep-ex/9405011 None [{u'doi': u'10.1016/0370-1573(95)00072-0', u'a...
15150 A Neural Network for Locating the Primary Vert... Using simulated collider data for $p+p\rightar... [hep-ex] 1994-06-21 hep-ex/9406003 10.1016/0168-9002(94)01133-8 []
15151 Semileptonic Branching Fraction of Charged and... An examination of leptons in ${\Upsilon (4S)}$... [hep-ex] 1994-06-23 hep-ex/9406004 10.1103/PhysRevLett.73.3503 [{u'author': u'Ivarsson, Jenny', u'title': u'P...
15152 Measurement of the B -> D^* l nu Branching Fra... We study the exclusive semileptonic B meson de... [hep-ex] 1994-06-24 hep-ex/9406005 10.1103/PhysRevD.51.1014 [{u'author': u'Borean, Cristiano', u'type': u'...
15153 Spin Asymmetry in Muon--Proton Deep Inelastic ... We measured the spin asymmetry in the scatteri... [hep-ex] 1994-08-06 hep-ex/9408001 10.1016/0370-2693(94)00968-6 [{u'primaryclass': u'nucl-ex', u'author': u'Pa...
15154 Measurement of the polarization of Lambda0 Ant... The polarization of Lambda0, AntiLambda0, Sigm... [hep-ex] 1994-09-16 hep-ex/9409001 10.1007/BF01291194 [{u'doi': u'10.1088/1742-6596/509/1/012056', u...
15155 Search for slowly moving magnetic monopoles We report a search for slowly moving magnetic ... [hep-ex] 1994-10-05 hep-ex/9410006 10.1016/0920-5632(94)90257-7 []
15156 Polarized Bhabha Scattering and a Precision Me... We present the first measurement of the left-r... [hep-ex] 1994-10-10 hep-ex/9410009 10.1103/PhysRevLett.74.2880 [{u'author': u'Quast, Gunther', u'title': u'Me...
15157 A Measurement of the $D^{*\pm}$ Cross Section ... We have measured the inclusive $D^{*\pm}$ prod... [hep-ex] 1994-11-29 hep-ex/9411002 10.1103/PhysRevD.50.1879 [{u'author': u'Ngac, An Bang', u'title': u'Mea...
15158 Measurement of the $D^{*\pm}$ Cross Section us... The differential cross section of $d\sigma(e^+... [hep-ex] 1994-12-01 hep-ex/9412001 10.1016/0370-2693(94)91515-6 [{u'author': u'Ngac, An Bang', u'title': u'Mea...
15159 $K^0(\bar{K^0})$ Production in Two-Photon Proc... We have carried out an inclusive measurement o... [hep-ex] 1994-12-05 hep-ex/9412003 10.1016/0370-2693(94)90315-8 [{u'doi': u'10.1016/S0370-2693(02)01769-0', u'...
15160 New Tagging Method of B Flavor of Neutral B Me... In CP violation measurements in asymmetric B-f... [hep-ex] 1994-12-08 hep-ex/9412005 10.1143/JPSJ.63.3542 [{u'author': u'Foland, Andrew Dean', u'title':...
15161 Kinematic Evidence for Top Quark Pair Producti... We present a study of $W+$multijet events that... [hep-ex] 1994-12-13 hep-ex/9412009 10.1103/PhysRevD.51.4623 [{u'author': u'Hinchliffe, Ian and Paige, FE a...
15162 Feasibility Study of Single-Photon Counting Us... The fine-mesh phototube is one type of photode... [hep-ex] 1994-12-13 hep-ex/9412010 10.1016/0168-9002(93)90749-8 [{u'doi': u'10.1140/epjc/s10052-014-3026-9', u...
15163 Measurement of inclusive electron cross sectio... We have studied open charm production in $\gam... [hep-ex] 1994-12-16 hep-ex/9412011 10.1016/0370-2693(94)01349-7 [{u'doi': u'10.1016/S0370-2693(02)01769-0', u'...
15164 Measurement of the forward-backward asymmetrie... We have measured, with electron tagging, the f... [hep-ex] 1994-12-18 hep-ex/9412012 10.1016/0370-2693(94)91310-2 [{u'doi': u'10.1103/PhysRevD.65.053002', u'pri...
15165 J/psi,psi(2S) to mu+ mu- and B to J/psi,psi(2S... This paper presents a measurement of J/psi,psi... [hep-ex] 1994-12-23 hep-ex/9412013 None [{u'slaccitation': u'%%CITATION = ARXIV:1411.3...
15166 Measurement of inclusive particle spectra and ... Inclusive momentum spectra are measured for al... [hep-ex] 1994-12-27 hep-ex/9412015 10.1016/0370-2693(94)01685-6 [{u'slaccitation': u'%%CITATION = ARXIV:1412.2...
15167 Measurement of the Bs Meson Lifetime The lifetime of the $B_s$ meson is measured us... [hep-ex] 1994-12-27 hep-ex/9412017 10.1103/PhysRevLett.74.4988 [{u'doi': u'10.1007/s00601-014-0871-x', u'prim...
16101 Exclusive Hadronic B Decays to Charm and Charm... We have fully reconstructed decays of both B0 ... [hep-ph, hep-ex] 1994-03-15 hep-ph/9403295 10.1103/PhysRevD.50.43 [{u'author': u'Sabelli, Chiara', u'title': u'T...
16102 Observation of a New Charmed Strange Meson Using the CLEO-II detector, we have obtained e... [hep-ph, hep-ex] 1994-03-21 hep-ph/9403325 10.1103/PhysRevLett.72.1972 [{u'slaccitation': u'%%CITATION = ARXIV:1410.5...
16103 Study of the Decay $\Lambda_c \to \Lambda l^+ ... Using the CLEO II detector at CESR, we observe... [hep-ph, hep-ex] 1994-03-21 hep-ph/9403326 10.1016/0370-2693(94)90295-X [{u'doi': u'10.1140/epjc/s10052-014-3194-7', u...
16104 Precision Measurement of the $D_s^{*+}- D_s^+$... We have measured the vector-pseudoscalar mass ... [hep-ph, hep-ex] 1994-03-21 hep-ph/9403327 10.1103/PhysRevD.50.1884 [{u'doi': u'10.1007/JHEP06(2013)065', u'primar...
16105 A Measurement of ${\cal B}(D_s \to \phi l^+ \n... Using the CLEO~II detector at CESR, we have me... [hep-ph, hep-ex] 1994-03-21 hep-ph/9403328 10.1016/0370-2693(94)90416-2 [{u'doi': u'10.1103/RevModPhys.84.65', u'prima...
16106 Measurement of Cabibbo Suppressed Decays of th... Branching ratios for the dominant Cabibbo-supp... [hep-ph, hep-ex] 1994-03-21 hep-ph/9403329 10.1103/PhysRevLett.73.1079 [{u'doi': u'10.1103/PhysRevD.87.073016', u'pri...
16107 Production and Decay of D_1(2420)^0 and D_2^*(... We have investigated $D^{+}\pi^{-}$ and $D^{*+... [hep-ph, hep-ex] 1994-03-24 hep-ph/9403359 10.1016/0370-2693(94)90968-7 [{u'slaccitation': u'%%CITATION = ARXIV:1410.5...
16108 Two-Photon Production of Charged Pion and Kaon... A measurement of the cross section for the com... [hep-ph, hep-ex] 1994-03-28 hep-ph/9403379 10.1103/PhysRevD.50.3027 [{u'slaccitation': u'%%CITATION = ARXIV:1307.0...
16109 Measurement of the Branching Fraction for D^+ ... Using the CLEO-II detector at CESR we have mea... [hep-ph, hep-ex] 1994-03-28 hep-ph/9403382 10.1103/PhysRevLett.72.2328 [{u'doi': u'10.1103/RevModPhys.84.65', u'prima...
16110 Measurement of the Spin-Dependent Structure Fu... We have measured the spin-dependent structure ... [hep-ph, hep-ex] 1994-04-15 hep-ph/9404270 10.1016/0370-2693(94)90793-5 [{u'doi': u'10.1103/PhysRevD.90.012009', u'pri...
16111 A Measurement of the Branching Fraction ${\cal... Using data from the CLEO II detector at CESR, ... [hep-ph, hep-ex] 1994-04-20 hep-ph/9404310 10.1103/PhysRevLett.72.3762 [{u'primaryclass': u'hep-ex', u'author': u'Amh...
16112 Supersymmetry at the DiTevatron We study the signals for supersymmetry at the ... [hep-ph, hep-ex] 1994-06-07 hep-ph/9406248 10.1103/PhysRevD.50.5676 [{u'doi': u'10.1103/PhysRevD.82.035009', u'pri...
16113 Detecting Tau Neutrino Oscillations at PeV Ene... It is suggested that a large deep underocean (... [hep-ph, astro-ph, hep-ex] 1994-08-15 hep-ph/9408296 10.1016/0927-6505(94)00043-3 [{u'slaccitation': u'%%CITATION = ARXIV:1412.1...

After scrolling through the list I can not spot a particular pattern to the edits. Though it does seem like the list contains articles on interesting topics like evidence for the top quark (index 15148), PeV $\tau$ neutrinos (index 16113) and an article about determining the weak mixing angle at SLD (index 15149).

Collecting citations

Unfortunately Inspire does not provide a real API, so we have to scrape their webpages to get what we want. The get_cites function will look up the citations of an article by its arxiv_id. Having to make one HTTP request per article means this takes quite a while. So set it going and come back after a few hours. We process articles in chunks of 1000 to get some feedback as well as being able to resume if something goes wrong:

In [226]:
def get_cites(arxiv_id):
    cites = []
    base_url = "http://inspirehep.net/search?p=refersto:%s&of=hx&rg=250&jrec=%i"
    offset = 1
    
    while True:
        print base_url%(arxiv_id, offset)
        response = urllib2.urlopen(base_url%(arxiv_id, offset))
        xml = response.read()
        soup = BeautifulSoup(xml)

        refs = "\n".join(cite.get_text() for cite in soup.findAll("pre"))

        bib_database = bibtexparser.loads(refs)
        if bib_database.entries:
            cites += bib_database.entries
            offset += 250
            
        else:
            break

    return cites

step = 1000
for N in range(0,17):
    print N
    cites = df['id'][N*step:(N+1)*step].map(get_cites)
    df.ix[N*step:(N+1)*step -1,'cited_by'] = cites

Data preservation

After investing so much time to gather the raw data it is a good idea to store it locally so we do not have to scrape it all again later:

In [236]:
store = pd.HDFStore("/Users/thead/git/arxiv-experiments/hep-ex.h5")
#store['df'] = df
#df = store['df']
store.close()

Word counts

Let's get to answering some questions. What are the ten most used words in hep-ex abstracts?

In [77]:
word_bag = " ".join(df.abstract.apply(lambda t: t.lower()))

Counter(word_bag.split()).most_common(n=10)
Out[77]:
[('the', 161889),
 ('of', 79075),
 ('and', 54236),
 ('in', 41418),
 ('a', 38591),
 ('to', 36425),
 ('for', 25128),
 ('is', 23440),
 ('with', 23015),
 ('we', 22510)]

Not too enlightening, boring little words close out the top ten. These words are known as stopwords and the NLTK library provides a list of all of them. So let's remove them as well as basic mathematical symbols:

In [113]:
from nltk.corpus import stopwords

stops = [word for word in stopwords.words('english')]
stops += ["=", "->"]
words = filter(lambda w: w not in stops,
               word_bag.split())
top_twenty = Counter(words).most_common(n=20)

bar_chart(top_twenty)

Experimental physics is all about data afterall! A shame that model beats detector but probably that is inevitable as there are many more theoretical models than experimental detectors. The higgs boson beats the neutrino but they reign supreme over all the other particles.

Towards the bottom we have measurements and measured. These should probably be counted as one entry, together with measurement, measuring, etc. The easiest way to achieve this is to stem the words before counting. Stemming is the process of reducing derived words to their stem, for example:

In [108]:
import nltk.stem as stem

porter = stem.PorterStemmer()
for w in ("measurement", "measurements", "measured", "measure"):
    print w, "->", porter.stem(w)
measurement -> measur
measurements -> measur
measured -> measur
measure -> measur

Like in this case the stem does not have to be a real word itself. By stemming words before counting how often they occur the entries for measurements and measured get added together. Using the stem of every word we get the following ranking:

In [115]:
word_stems = map(lambda w: (porter.stem(w),w), words)
stem2words = defaultdict(set)
for stem, word in word_stems:
    stem2words[stem].add(word)

top_twenty = Counter(w[0] for w in word_stems).most_common(n=20)

bar_chart(top_twenty)

# list all words which correspond to each top twenty stem
for stem,count in top_twenty:
    print stem, "<-", ", ".join(stem2words[stem])
measur <- measuring, measures, measurment, measurements, measure, measurable, measurably, measureable, measurability, measurement, measured
decay <- decayed, decays, decaying, decay
use <- use, used, useful, uses, usefulness, using
data <- data
mass <- masses, mass
model <- models, modeled, modelled, modelling, modeling, model
result <- resulted, resultant, resulting, results, result
energi <- energies, energy
product <- product, productive, productivity, productions, production, products
detector <- detector, detectors
neutrino <- neutrino, neutrinos
present <- presently, presented, presentations, presents, presentational, presenting, presentation, present
search <- searches, searchs, search, searching, searched
new <- new
studi <- studying, study, studied, studies
observ <- observational, observable, observation, observer, observes, observed, observe, observations, observables, observability, observing
standard <- standards, standardize, standardized, standard
higg <- higgs
cross <- crossing, crossings, crosses, crossed, cross
experi <- experiement, experiments, experiences, experiment, experience

Turns out experimental phsics is all about measuring things. The stemming is not perfect, but good enough for now.

Cite me, cite me, no cite me!

In science citations is the currency used to measure the success of a paper. What does the distribution of citations look like then?

A simple question to ask is: how often are articles cited? As articles have to be read and understood before they can be cited we only look at articles created before the beginning of 2014.

In [235]:
before_2014 = datetime.datetime(2014,1,1)
plt.hist(df[df.created<before_2014].cited_by.map(len),
         bins=200, normed=True, range=(0,200))
plt.xlabel("Number of citations")
plt.ylabel("Fraction")
Out[235]:
<matplotlib.text.Text at 0x13a304c50>

This plot shows the fraction of articles cited zero, one, two, three, ... times. The single most likely number of citations for an article on hep-ex is zero! A whopping 13% of articles never get cited and nearly a third of articles are cited less than four times.

In [233]:
df['citation_count'] = df.cited_by.map(len)
df[df.created<before_2014]['citation_count'].describe()
Out[233]:
count    14059.000000
mean        32.630486
std        100.773436
min          0.000000
25%          2.000000
50%         10.000000
75%         31.000000
max       4138.000000
Name: citation_count, dtype: float64

The average number of citations is about 33. The average is misleading for a steeply falling distribution like this, afterall we reach the 50% percentile at only 10 citations!

Citations top ten

The prize for most cited paper with a whopping 4138 citations goes to:

In [234]:
df.iloc[df.citation_count.idxmax()]
Out[234]:
title             New Generation of Parton Distributions with Un...
abstract          A new generation of parton distribution functi...
categories                                         [hep-ph, hep-ex]
created                                         2002-01-21 00:00:00
id                                                   hep-ph/0201195
doi                                   10.1088/1126-6708/2002/07/012
cited_by          [{u'slaccitation': u'%%CITATION = ARXIV:1412.4...
citation_count                                                 4138
Name: 15779, dtype: object

Read the paper yourself: New Generation of Parton Distributions with Uncertainties from Global QCD Analysis and see if you agree.

We can also easily compute the top ten papers. This is an interesting mix of articles. Number two and three are the papers by the ATLAS and CMS experiments reporting on the discovery of the Higgs boson. While most of the papers in the top ten are older these two were only published in 2012 and have already overtaken the top quark discovery which was published in 1995! Curious fact, the ATLAS paper has ever so few more citations than the CMS one.

In [241]:
df.sort('citation_count', ascending=False).head(10)
Out[241]:
title abstract categories created id doi cited_by citation_count
15779 New Generation of Parton Distributions with Un... A new generation of parton distribution functi... [hep-ph, hep-ex] 2002-01-21 hep-ph/0201195 10.1088/1126-6708/2002/07/012 [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... 4138
7371 Observation of a new particle in the search fo... A search for the Standard Model Higgs boson in... [hep-ex] 2012-07-31 1207.7214 10.1016/j.physletb.2012.08.020 [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... 3653
7372 Observation of a new boson at a mass of 125 Ge... Results are presented from searches for the st... [hep-ex] 2012-07-31 1207.7235 10.1016/j.physletb.2012.08.021 [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... 3592
15181 Observation of Top Quark Production in Pbar-P ... We establish the existence of the top quark us... [hep-ex, hep-ph] 1995-03-02 hep-ex/9503002 10.1103/PhysRevLett.74.2626 [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... 2519
16241 Direct Evidence for Neutrino Flavor Transforma... Observations of neutral current neutrino inter... [nucl-ex, hep-ex] 2002-04-21 nucl-ex/0204008 10.1103/PhysRevLett.89.011301 [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... 2425
15690 HERWIG 6.5: an event generator for Hadron Emis... HERWIG is a general-purpose Monte Carlo event ... [hep-ph, hep-ex] 2000-11-29 hep-ph/0011363 10.1088/1126-6708/2001/01/010 [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... 2324
14026 First Results from KamLAND: Evidence for React... KamLAND has been used to measure the flux of $... [hep-ex, nucl-ex] 2002-12-09 hep-ex/0212021 10.1103/PhysRevLett.90.021802 [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... 2235
16152 A Supersymmetry Primer I provide a pedagogical introduction to supers... [hep-ph, hep-ex, hep-th] 1997-09-15 hep-ph/9709356 None [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... 2160
16235 Measurement of the rate of nu_e + d --> p + p ... Solar neutrinos from the decay of $^8$B have b... [nucl-ex, hep-ex] 2001-06-18 nucl-ex/0106015 10.1103/PhysRevLett.87.071301 [{u'slaccitation': u'%%CITATION = ARXIV:1412.4... 2030
13678 The BABAR Detector BABAR, the detector for the SLAC PEP-II asymme... [hep-ex] 2001-05-16 hep-ex/0105044 10.1016/S0168-9002(01)02012-5 [{u'title': u'Measurement of the Partial Branc... 1828

There are many more interesting things to be done with analysing the words used in abstracts as well as anlysing who cites who. This will be covered in the second part of this post as this one is already fairly lengthy.

Just one more thing, the top cited paper of 2014: First combination of Tevatron and LHC measurements of the top-quark mass celebrating collaboration across the globe:

This post started life as a ipython notebook, download it or view it online.