This post is about what you can learn about scientific articles posted on the arXiv by using Natural Language Processing (NLP). Said differently: I had some questions about papers posted on the arXiv and used it as an excuse to teach myself the basics of NLP. We also look at citation counts and reveal the top cited paper of 2014!

The arXiv makes its data available via a simple API which allows you to download almost everything about an article short of its full text. For each article we can look up information about who has been citing it on inspire. Combined this is a powerful dataset that can answer some interesting questions like: what are the most used words, can we auto generate abstracts, what about summarising abstracts or finding the most cited article of 2014.

Let's get going!

First some standard imports that we will need later. Some of them you might need to install but nothing too obscure:

In [175]:

import time
import urllib2
import datetime
from itertools import ifilter
from collections import Counter, defaultdict
import xml.etree.ElementTree as ET

from bs4 import BeautifulSoup
import matplotlib.pylab as plt
import pandas as pd
import numpy as np
import bibtexparser

pd.set_option('mode.chained_assignment','warn')

In [2]:

%matplotlib inline

The harvest function will query the arXiv API for all articles modified between January, 1st 2010 and the end of the year 2014. This is a subtlety worth noting: with this query you will also get articles created before 2010 if their entry was modified after 2010.

The arXiv itself covers so many topics that it is organised into seperate arxivs (I know unfortunate doulbe use of the name arxiv), one for each topic. By default harvest will collect articles from the physics:hep-ex arxiv. This is because I am a experimental particle physicist. If you are into theory try physics:hep-th or stats if you are a stats guru. The API gives you a full list of all sets of topics to explore.

If you do not care for the technicalities of how to scrape the data skip right ahead to the first factoid.

Harvesting articles¶

Most of harvest is pretty straight forward. The API returns a big XML document containing information about at most 1000 articles which we can parse with ElementTree and store. If there are more than 1000 articles for a particular query we can get those using the resumptionToken in the XML. API access can be throttled so on occasion the arXiv will reply with a 503 error asking us to retry later. The information we harvest is stored in a pandas dataframe.

In [4]:

OAI = "{http://www.openarchives.org/OAI/2.0/}"
ARXIV = "{http://arxiv.org/OAI/arXiv/}"

def harvest(arxiv="physics:hep-ex"):
    df = pd.DataFrame(columns=("title", "abstract", "categories", "created", "id", "doi"))
    base_url = "http://export.arxiv.org/oai2?verb=ListRecords&"
    url = (base_url +
           "from=2010-01-01&until=2014-12-31&" +
           "metadataPrefix=arXiv&set=%s"%arxiv)
    
    while True:
        print "fetching", url
        try:
            response = urllib2.urlopen(url)
            
        except urllib2.HTTPError, e:
            if e.code == 503:
                to = int(e.hdrs.get("retry-after", 30))
                print "Got 503. Retrying after {0:d} seconds.".format(to)

                time.sleep(to)
                continue
                
            else:
                raise
            
        xml = response.read()

        root = ET.fromstring(xml)

        for record in root.find(OAI+'ListRecords').findall(OAI+"record"):
            arxiv_id = record.find(OAI+'header').find(OAI+'identifier')
            meta = record.find(OAI+'metadata')
            info = meta.find(ARXIV+"arXiv")
            created = info.find(ARXIV+"created").text
            created = datetime.datetime.strptime(created, "%Y-%m-%d")
            categories = info.find(ARXIV+"categories").text

            # if there is more than one DOI use the first one
            # often the second one (if it exists at all) refers
            # to an eratum or similar
            doi = info.find(ARXIV+"doi")
            if doi is not None:
                doi = doi.text.split()[0]
                
            contents = {'title': info.find(ARXIV+"title").text,
                        'id': info.find(ARXIV+"id").text,#arxiv_id.text[4:],
                        'abstract': info.find(ARXIV+"abstract").text.strip(),
                        'created': created,
                        'categories': categories.split(),
                        'doi': doi,
                        }

            df = df.append(contents, ignore_index=True)

        # The list of articles returned by the API comes in chunks of
        # 1000 articles. The presence of a resumptionToken tells us that
        # there is more to be fetched.
        token = root.find(OAI+'ListRecords').find(OAI+"resumptionToken")
        if token is None or token.text is None:
            break

        else:
            url = base_url + "resumptionToken=%s"%(token.text)
            
    return df

Set harvest running and go chat with someone for a few minutes while it gathers the information about your articles.

In [ ]:

df = harvest()

What does all that stuff we just downloaded look like? Here are the first five entries in the dataframe:

In [73]:

df.head()

Out[73]:

	title	abstract	categories	created	id	doi	cited_by
0	Measurement of the Hadronic Form Factor in D0 ...	The shape of the hadronic form factor f+(q2) i...	[hep-ex]	2007-03-31	0704.0020	doi:10.1103/PhysRevD.76.052005	[{u'slaccitation': u'%%CITATION = ARXIV:1411.3...
1	Measurement of B(D_S^+ --> ell^+ nu) and the D...	We examine e+e- --> Ds- Ds+ and Ds- Ds+ inte...	[hep-ex, hep-lat, hep-ph]	2007-04-03	0704.0437	10.1103/PhysRevD.76.072002	[{u'author': u'Yelton, John M.', u'journal': u...
2	A unified analysis of the reactor neutrino pro...	We present in this article a detailed quantita...	[hep-ex]	2007-04-04	0704.0498	10.1088/1742-6596/110/8/082013	[{u'author': u'Queval, Rachel', u'title': u'Ch...
3	Measurement of Decay Amplitudes of B -->(c cba...	We perform the first three-dimensional measure...	[hep-ex]	2007-04-04	0704.0522	10.1103/PhysRevD.76.031102	[{u'author': u'Giurgiu, Gavril', u'journal': u...
4	Measurement of the Decay Constant $f_D{_S^+}$ ...	We measure the decay constant fDs using the Ds...	[hep-ex, hep-lat, hep-ph]	2007-04-04	0704.0629	10.1103/PhysRevLett.99.071802	[{u'author': u'Jackson, Graham', u'type': u'ar...

First factoid¶

In [71]:

def bar_chart(items):
    """Make a bar chart showing the count associated with each key
    
    `items` is a list of (key, count) pairs.
    """
    width = 0.5
    ind = np.arange(len(items))
    fig, ax = plt.subplots(figsize=(8,8))
    rects1 = ax.bar(ind, zip(*items)[1], width, color='r')
    ax.set_xticks(ind+width)
    ax.set_xticklabels(zip(*items)[0])
    fig.autofmt_xdate()
    plt.show()

edits_per_year = Counter(df.created.map(lambda x: x.year))
bar_chart(edits_per_year.items())
new_articles = sum(edits_per_year[year] for year in (2010,2011,2012,2013,2014))
print "Unique arXiv IDs edited between 2010 and 2014:", len(df.id.unique())
print "of which %i entries were created in that time period."%(new_articles)

Unique arXiv IDs edited between 2010 and 2014: 16321
of which 11958 entries were created in that time period.

Here is our first factoid about the arXiv: There are about 16000 articles in hep-ex which were edited between the beginning of 2010 and the end of 2014. Including about 12000 newly created articles. The other 4000 papers were created before 2010 and were updated after creation. Amazing to see that papers created in 1994 were still being edited almost ten years later!

Let's take a look at those, maybe there is something interesting:

In [72]:

df[df.created<datetime.date(1995,1,1)]

Out[72]:

	title	abstract	categories	created	id	doi	cited_by
13366	DUMAND and AMANDA: High Energy Neutrino Astrop...	The field of high energy neutrino astrophysics...	[astro-ph, hep-ex]	1994-12-06	astro-ph/9412019	None	[{u'author': u'Al Samarai, Imen', u'type': u'a...
13375	Detection of nuclear recoils in prototype dark...	This work is part of an ongoing project to dev...	[cond-mat, hep-ex]	1994-11-17	cond-mat/9411072	10.1016/0168-9002(95)00036-4	[{u'doi': u'10.1016/j.astropartphys.2004.06.00...
15144	Precise Measurement of the Left-Right Cross Se...	We present a precise measurement of the left-r...	[hep-ex, hep-ph]	1994-04-27	hep-ex/9404001	10.1103/PhysRevLett.73.25	[{u'doi': u'10.1088/1742-6596/335/1/012078', u...
15145	An optimal method of moments to measure the ch...	Parity violation at LEP or SLC can be measured...	[hep-ex]	1994-05-11	hep-ex/9405002	10.1016/0168-9002(94)90847-8	[]
15146	Observation of Anisotropic Event Shapes and Tr...	Event shapes for Au + Au collisions at 11.4 Ge...	[hep-ex]	1994-05-13	hep-ex/9405003	10.1103/PhysRevLett.73.2532	[{u'primaryclass': u'nucl-th', u'author': u'Wa...
15147	Measurement of the Charged Multiplicity of $Z ...	Using an impact parameter tag to select an enr...	[hep-ex, hep-ph]	1994-05-13	hep-ex/9405004	10.1103/PhysRevLett.72.3145	[{u'doi': u'10.1140/epjc/s2005-02424-5', u'pri...
15148	Evidence for Top Quark Production in $\bar{p}p...	We summarize a search for the top quark with t...	[hep-ex, hep-ph]	1994-05-16	hep-ex/9405005	10.1103/PhysRevLett.73.225	[{u'primaryclass': u'hep-ex', u'author': u'Ger...
15149	Precise Determination of the Weak Mixing Angle...	In the 1993 SLC/SLD run, the SLD recorded 50,0...	[hep-ex]	1994-05-20	hep-ex/9405011	None	[{u'doi': u'10.1016/0370-1573(95)00072-0', u'a...
15150	A Neural Network for Locating the Primary Vert...	Using simulated collider data for $p+p\rightar...	[hep-ex]	1994-06-21	hep-ex/9406003	10.1016/0168-9002(94)01133-8	[]
15151	Semileptonic Branching Fraction of Charged and...	An examination of leptons in ${\Upsilon (4S)}$...	[hep-ex]	1994-06-23	hep-ex/9406004	10.1103/PhysRevLett.73.3503	[{u'author': u'Ivarsson, Jenny', u'title': u'P...
15152	Measurement of the B -> D^* l nu Branching Fra...	We study the exclusive semileptonic B meson de...	[hep-ex]	1994-06-24	hep-ex/9406005	10.1103/PhysRevD.51.1014	[{u'author': u'Borean, Cristiano', u'type': u'...
15153	Spin Asymmetry in Muon--Proton Deep Inelastic ...	We measured the spin asymmetry in the scatteri...	[hep-ex]	1994-08-06	hep-ex/9408001	10.1016/0370-2693(94)00968-6	[{u'primaryclass': u'nucl-ex', u'author': u'Pa...
15154	Measurement of the polarization of Lambda0 Ant...	The polarization of Lambda0, AntiLambda0, Sigm...	[hep-ex]	1994-09-16	hep-ex/9409001	10.1007/BF01291194	[{u'doi': u'10.1088/1742-6596/509/1/012056', u...
15155	Search for slowly moving magnetic monopoles	We report a search for slowly moving magnetic ...	[hep-ex]	1994-10-05	hep-ex/9410006	10.1016/0920-5632(94)90257-7	[]
15156	Polarized Bhabha Scattering and a Precision Me...	We present the first measurement of the left-r...	[hep-ex]	1994-10-10	hep-ex/9410009	10.1103/PhysRevLett.74.2880	[{u'author': u'Quast, Gunther', u'title': u'Me...
15157	A Measurement of the $D^{*\pm}$ Cross Section ...	We have measured the inclusive $D^{*\pm}$ prod...	[hep-ex]	1994-11-29	hep-ex/9411002	10.1103/PhysRevD.50.1879	[{u'author': u'Ngac, An Bang', u'title': u'Mea...
15158	Measurement of the $D^{*\pm}$ Cross Section us...	The differential cross section of $d\sigma(e^+...	[hep-ex]	1994-12-01	hep-ex/9412001	10.1016/0370-2693(94)91515-6	[{u'author': u'Ngac, An Bang', u'title': u'Mea...
15159	$K^0(\bar{K^0})$ Production in Two-Photon Proc...	We have carried out an inclusive measurement o...	[hep-ex]	1994-12-05	hep-ex/9412003	10.1016/0370-2693(94)90315-8	[{u'doi': u'10.1016/S0370-2693(02)01769-0', u'...
15160	New Tagging Method of B Flavor of Neutral B Me...	In CP violation measurements in asymmetric B-f...	[hep-ex]	1994-12-08	hep-ex/9412005	10.1143/JPSJ.63.3542	[{u'author': u'Foland, Andrew Dean', u'title':...
15161	Kinematic Evidence for Top Quark Pair Producti...	We present a study of $W+$multijet events that...	[hep-ex]	1994-12-13	hep-ex/9412009	10.1103/PhysRevD.51.4623	[{u'author': u'Hinchliffe, Ian and Paige, FE a...
15162	Feasibility Study of Single-Photon Counting Us...	The fine-mesh phototube is one type of photode...	[hep-ex]	1994-12-13	hep-ex/9412010	10.1016/0168-9002(93)90749-8	[{u'doi': u'10.1140/epjc/s10052-014-3026-9', u...
15163	Measurement of inclusive electron cross sectio...	We have studied open charm production in $\gam...	[hep-ex]	1994-12-16	hep-ex/9412011	10.1016/0370-2693(94)01349-7	[{u'doi': u'10.1016/S0370-2693(02)01769-0', u'...
15164	Measurement of the forward-backward asymmetrie...	We have measured, with electron tagging, the f...	[hep-ex]	1994-12-18	hep-ex/9412012	10.1016/0370-2693(94)91310-2	[{u'doi': u'10.1103/PhysRevD.65.053002', u'pri...
15165	J/psi,psi(2S) to mu+ mu- and B to J/psi,psi(2S...	This paper presents a measurement of J/psi,psi...	[hep-ex]	1994-12-23	hep-ex/9412013	None	[{u'slaccitation': u'%%CITATION = ARXIV:1411.3...
15166	Measurement of inclusive particle spectra and ...	Inclusive momentum spectra are measured for al...	[hep-ex]	1994-12-27	hep-ex/9412015	10.1016/0370-2693(94)01685-6	[{u'slaccitation': u'%%CITATION = ARXIV:1412.2...
15167	Measurement of the Bs Meson Lifetime	The lifetime of the $B_s$ meson is measured us...	[hep-ex]	1994-12-27	hep-ex/9412017	10.1103/PhysRevLett.74.4988	[{u'doi': u'10.1007/s00601-014-0871-x', u'prim...
16101	Exclusive Hadronic B Decays to Charm and Charm...	We have fully reconstructed decays of both B0 ...	[hep-ph, hep-ex]	1994-03-15	hep-ph/9403295	10.1103/PhysRevD.50.43	[{u'author': u'Sabelli, Chiara', u'title': u'T...
16102	Observation of a New Charmed Strange Meson	Using the CLEO-II detector, we have obtained e...	[hep-ph, hep-ex]	1994-03-21	hep-ph/9403325	10.1103/PhysRevLett.72.1972	[{u'slaccitation': u'%%CITATION = ARXIV:1410.5...
16103	Study of the Decay $\Lambda_c \to \Lambda l^+ ...	Using the CLEO II detector at CESR, we observe...	[hep-ph, hep-ex]	1994-03-21	hep-ph/9403326	10.1016/0370-2693(94)90295-X	[{u'doi': u'10.1140/epjc/s10052-014-3194-7', u...
16104	Precision Measurement of the $D_s^{*+}- D_s^+$...	We have measured the vector-pseudoscalar mass ...	[hep-ph, hep-ex]	1994-03-21	hep-ph/9403327	10.1103/PhysRevD.50.1884	[{u'doi': u'10.1007/JHEP06(2013)065', u'primar...
16105	A Measurement of ${\cal B}(D_s \to \phi l^+ \n...	Using the CLEO~II detector at CESR, we have me...	[hep-ph, hep-ex]	1994-03-21	hep-ph/9403328	10.1016/0370-2693(94)90416-2	[{u'doi': u'10.1103/RevModPhys.84.65', u'prima...
16106	Measurement of Cabibbo Suppressed Decays of th...	Branching ratios for the dominant Cabibbo-supp...	[hep-ph, hep-ex]	1994-03-21	hep-ph/9403329	10.1103/PhysRevLett.73.1079	[{u'doi': u'10.1103/PhysRevD.87.073016', u'pri...
16107	Production and Decay of D_1(2420)^0 and D_2^*(...	We have investigated $D^{+}\pi^{-}$ and $D^{*+...	[hep-ph, hep-ex]	1994-03-24	hep-ph/9403359	10.1016/0370-2693(94)90968-7	[{u'slaccitation': u'%%CITATION = ARXIV:1410.5...
16108	Two-Photon Production of Charged Pion and Kaon...	A measurement of the cross section for the com...	[hep-ph, hep-ex]	1994-03-28	hep-ph/9403379	10.1103/PhysRevD.50.3027	[{u'slaccitation': u'%%CITATION = ARXIV:1307.0...
16109	Measurement of the Branching Fraction for D^+ ...	Using the CLEO-II detector at CESR we have mea...	[hep-ph, hep-ex]	1994-03-28	hep-ph/9403382	10.1103/PhysRevLett.72.2328	[{u'doi': u'10.1103/RevModPhys.84.65', u'prima...
16110	Measurement of the Spin-Dependent Structure Fu...	We have measured the spin-dependent structure ...	[hep-ph, hep-ex]	1994-04-15	hep-ph/9404270	10.1016/0370-2693(94)90793-5	[{u'doi': u'10.1103/PhysRevD.90.012009', u'pri...
16111	A Measurement of the Branching Fraction ${\cal...	Using data from the CLEO II detector at CESR, ...	[hep-ph, hep-ex]	1994-04-20	hep-ph/9404310	10.1103/PhysRevLett.72.3762	[{u'primaryclass': u'hep-ex', u'author': u'Amh...
16112	Supersymmetry at the DiTevatron	We study the signals for supersymmetry at the ...	[hep-ph, hep-ex]	1994-06-07	hep-ph/9406248	10.1103/PhysRevD.50.5676	[{u'doi': u'10.1103/PhysRevD.82.035009', u'pri...
16113	Detecting Tau Neutrino Oscillations at PeV Ene...	It is suggested that a large deep underocean (...	[hep-ph, astro-ph, hep-ex]	1994-08-15	hep-ph/9408296	10.1016/0927-6505(94)00043-3	[{u'slaccitation': u'%%CITATION = ARXIV:1412.1...

After scrolling through the list I can not spot a particular pattern to the edits. Though it does seem like the list contains articles on interesting topics like evidence for the top quark (index 15148), PeV $\tau$ neutrinos (index 16113) and an article about determining the weak mixing angle at SLD (index 15149).

Collecting citations¶

Unfortunately Inspire does not provide a real API, so we have to scrape their webpages to get what we want. The get_cites function will look up the citations of an article by its arxiv_id. Having to make one HTTP request per article means this takes quite a while. So set it going and come back after a few hours. We process articles in chunks of 1000 to get some feedback as well as being able to resume if something goes wrong:

In [226]:

def get_cites(arxiv_id):
    cites = []
    base_url = "http://inspirehep.net/search?p=refersto:%s&of=hx&rg=250&jrec=%i"
    offset = 1
    
    while True:
        print base_url%(arxiv_id, offset)
        response = urllib2.urlopen(base_url%(arxiv_id, offset))
        xml = response.read()
        soup = BeautifulSoup(xml)

        refs = "\n".join(cite.get_text() for cite in soup.findAll("pre"))

        bib_database = bibtexparser.loads(refs)
        if bib_database.entries:
            cites += bib_database.entries
            offset += 250
            
        else:
            break

    return cites

step = 1000
for N in range(0,17):
    print N
    cites = df['id'][N*step:(N+1)*step].map(get_cites)
    df.ix[N*step:(N+1)*step -1,'cited_by'] = cites

Data preservation¶

After investing so much time to gather the raw data it is a good idea to store it locally so we do not have to scrape it all again later:

In [236]:

store = pd.HDFStore("/Users/thead/git/arxiv-experiments/hep-ex.h5")
#store['df'] = df
#df = store['df']
store.close()

Word counts¶

Let's get to answering some questions. What are the ten most used words in hep-ex abstracts?

In [77]:

word_bag = " ".join(df.abstract.apply(lambda t: t.lower()))

Counter(word_bag.split()).most_common(n=10)

Out[77]:

[('the', 161889),
 ('of', 79075),
 ('and', 54236),
 ('in', 41418),
 ('a', 38591),
 ('to', 36425),
 ('for', 25128),
 ('is', 23440),
 ('with', 23015),
 ('we', 22510)]

Not too enlightening, boring little words close out the top ten. These words are known as stopwords and the NLTK library provides a list of all of them. So let's remove them as well as basic mathematical symbols:

In [113]:

from nltk.corpus import stopwords

stops = [word for word in stopwords.words('english')]
stops += ["=", "->"]
words = filter(lambda w: w not in stops,
               word_bag.split())
top_twenty = Counter(words).most_common(n=20)

bar_chart(top_twenty)

Experimental physics is all about data afterall! A shame that model beats detector but probably that is inevitable as there are many more theoretical models than experimental detectors. The higgs boson beats the neutrino but they reign supreme over all the other particles.

Towards the bottom we have measurements and measured. These should probably be counted as one entry, together with measurement, measuring, etc. The easiest way to achieve this is to stem the words before counting. Stemming is the process of reducing derived words to their stem, for example:

In [108]:

import nltk.stem as stem

porter = stem.PorterStemmer()
for w in ("measurement", "measurements", "measured", "measure"):
    print w, "->", porter.stem(w)

measurement -> measur
measurements -> measur
measured -> measur
measure -> measur

Like in this case the stem does not have to be a real word itself. By stemming words before counting how often they occur the entries for measurements and measured get added together. Using the stem of every word we get the following ranking:

In [115]:

word_stems = map(lambda w: (porter.stem(w),w), words)
stem2words = defaultdict(set)
for stem, word in word_stems:
    stem2words[stem].add(word)

top_twenty = Counter(w[0] for w in word_stems).most_common(n=20)

bar_chart(top_twenty)

# list all words which correspond to each top twenty stem
for stem,count in top_twenty:
    print stem, "<-", ", ".join(stem2words[stem])

measur <- measuring, measures, measurment, measurements, measure, measurable, measurably, measureable, measurability, measurement, measured
decay <- decayed, decays, decaying, decay
use <- use, used, useful, uses, usefulness, using
data <- data
mass <- masses, mass
model <- models, modeled, modelled, modelling, modeling, model
result <- resulted, resultant, resulting, results, result
energi <- energies, energy
product <- product, productive, productivity, productions, production, products
detector <- detector, detectors
neutrino <- neutrino, neutrinos
present <- presently, presented, presentations, presents, presentational, presenting, presentation, present
search <- searches, searchs, search, searching, searched
new <- new
studi <- studying, study, studied, studies
observ <- observational, observable, observation, observer, observes, observed, observe, observations, observables, observability, observing
standard <- standards, standardize, standardized, standard
higg <- higgs
cross <- crossing, crossings, crosses, crossed, cross
experi <- experiement, experiments, experiences, experiment, experience

Turns out experimental phsics is all about measuring things. The stemming is not perfect, but good enough for now.

Cite me, cite me, no cite me!¶

In science citations is the currency used to measure the success of a paper. What does the distribution of citations look like then?

A simple question to ask is: how often are articles cited? As articles have to be read and understood before they can be cited we only look at articles created before the beginning of 2014.

In [235]:

before_2014 = datetime.datetime(2014,1,1)
plt.hist(df[df.created<before_2014].cited_by.map(len),
         bins=200, normed=True, range=(0,200))
plt.xlabel("Number of citations")
plt.ylabel("Fraction")

Out[235]:

<matplotlib.text.Text at 0x13a304c50>

This plot shows the fraction of articles cited zero, one, two, three, ... times. The single most likely number of citations for an article on hep-ex is zero! A whopping 13% of articles never get cited and nearly a third of articles are cited less than four times.

In [233]:

df['citation_count'] = df.cited_by.map(len)
df[df.created<before_2014]['citation_count'].describe()

Out[233]:

count    14059.000000
mean        32.630486
std        100.773436
min          0.000000
25%          2.000000
50%         10.000000
75%         31.000000
max       4138.000000
Name: citation_count, dtype: float64

The average number of citations is about 33. The average is misleading for a steeply falling distribution like this, afterall we reach the 50% percentile at only 10 citations!

Citations top ten¶

The prize for most cited paper with a whopping 4138 citations goes to:

In [234]:

df.iloc[df.citation_count.idxmax()]

Out[234]:

title             New Generation of Parton Distributions with Un...
abstract          A new generation of parton distribution functi...
categories                                         [hep-ph, hep-ex]
created                                         2002-01-21 00:00:00
id                                                   hep-ph/0201195
doi                                   10.1088/1126-6708/2002/07/012
cited_by          [{u'slaccitation': u'%%CITATION = ARXIV:1412.4...
citation_count                                                 4138
Name: 15779, dtype: object

Read the paper yourself: New Generation of Parton Distributions with Uncertainties from Global QCD Analysis and see if you agree.

We can also easily compute the top ten papers. This is an interesting mix of articles. Number two and three are the papers by the ATLAS and CMS experiments reporting on the discovery of the Higgs boson. While most of the papers in the top ten are older these two were only published in 2012 and have already overtaken the top quark discovery which was published in 1995! Curious fact, the ATLAS paper has ever so few more citations than the CMS one.

In [241]:

df.sort('citation_count', ascending=False).head(10)

Out[241]:

	title	abstract	categories	created	id	doi	cited_by	citation_count
15779	New Generation of Parton Distributions with Un...	A new generation of parton distribution functi...	[hep-ph, hep-ex]	2002-01-21	hep-ph/0201195	10.1088/1126-6708/2002/07/012	[{u'slaccitation': u'%%CITATION = ARXIV:1412.4...	4138
7371	Observation of a new particle in the search fo...	A search for the Standard Model Higgs boson in...	[hep-ex]	2012-07-31	1207.7214	10.1016/j.physletb.2012.08.020	[{u'slaccitation': u'%%CITATION = ARXIV:1412.4...	3653
7372	Observation of a new boson at a mass of 125 Ge...	Results are presented from searches for the st...	[hep-ex]	2012-07-31	1207.7235	10.1016/j.physletb.2012.08.021	[{u'slaccitation': u'%%CITATION = ARXIV:1412.4...	3592
15181	Observation of Top Quark Production in Pbar-P ...	We establish the existence of the top quark us...	[hep-ex, hep-ph]	1995-03-02	hep-ex/9503002	10.1103/PhysRevLett.74.2626	[{u'slaccitation': u'%%CITATION = ARXIV:1412.4...	2519
16241	Direct Evidence for Neutrino Flavor Transforma...	Observations of neutral current neutrino inter...	[nucl-ex, hep-ex]	2002-04-21	nucl-ex/0204008	10.1103/PhysRevLett.89.011301	[{u'slaccitation': u'%%CITATION = ARXIV:1412.4...	2425
15690	HERWIG 6.5: an event generator for Hadron Emis...	HERWIG is a general-purpose Monte Carlo event ...	[hep-ph, hep-ex]	2000-11-29	hep-ph/0011363	10.1088/1126-6708/2001/01/010	[{u'slaccitation': u'%%CITATION = ARXIV:1412.4...	2324
14026	First Results from KamLAND: Evidence for React...	KamLAND has been used to measure the flux of $...	[hep-ex, nucl-ex]	2002-12-09	hep-ex/0212021	10.1103/PhysRevLett.90.021802	[{u'slaccitation': u'%%CITATION = ARXIV:1412.4...	2235
16152	A Supersymmetry Primer	I provide a pedagogical introduction to supers...	[hep-ph, hep-ex, hep-th]	1997-09-15	hep-ph/9709356	None	[{u'slaccitation': u'%%CITATION = ARXIV:1412.4...	2160
16235	Measurement of the rate of nu_e + d --> p + p ...	Solar neutrinos from the decay of $^8$B have b...	[nucl-ex, hep-ex]	2001-06-18	nucl-ex/0106015	10.1103/PhysRevLett.87.071301	[{u'slaccitation': u'%%CITATION = ARXIV:1412.4...	2030
13678	The BABAR Detector	BABAR, the detector for the SLAC PEP-II asymme...	[hep-ex]	2001-05-16	hep-ex/0105044	10.1016/S0168-9002(01)02012-5	[{u'title': u'Measurement of the Partial Branc...	1828

There are many more interesting things to be done with analysing the words used in abstracts as well as anlysing who cites who. This will be covered in the second part of this post as this one is already fairly lengthy.

Just one more thing, the top cited paper of 2014: First combination of Tevatron and LHC measurements of the top-quark mass celebrating collaboration across the globe: