Can we predict the citations of preprints?… And should we?

It’s interesting to compare preprints with journals. Let’s take a look at all of the articles published on ArXiv in 2010 using data from the ArXiv OAI-PMH API. Then we can match them up with the journal-versions of those same articles using the CrossRef API.

Why are we looking at 2010? 2010 might seem like a long time ago. But, it might surprise you to learn that some articles uploaded to ArXiv in 2010 are still being published now (the most recent having been published in late-December 2018).

For the most part, content appears first on ArXiv and then later in a journal.

Image for post
Image for post
The x-axis shows the date of CrossRef registration minus the date of upload to ArXiv. So, negative values on the x-axis represent articles which were published in journals before being uploaded to ArXiv. Due to the log-transformation of the y-axis, it might be unclear that only 10% of articles fall into this category.

The cost of peer-review

There’s something quite important here. Millions of articles are peer-reviewed every year. Each one of those peer-reviews costs the time of a number of people and results in an acceptance or a rejection of the manuscript. In cases of rejection, it’s assumed that rejected articles go on to be peer-reviewed elsewhere again and again until they are eventually accepted (or not).

The total cost of peer-review is not well-understood and neither is the process by which articles get from being written to being published. The irony is that the papers which cost the most to peer-review are the ones which are least valuable to readers.

The process described above seems to be in evidence. Assuming low-quality papers are rejected from journals, they will take longer to be published. Low-quality papers are generally not well-cited. And, indeed, we see a negative correlation between time-on-ArXiv and citation-counts.

Image for post
Image for post
Note that the spike between 1100–1200 on the x axis is caused by a small number of review articles. Ideally, review articles would be filtered out of this dataset.

Again, this is noisy data and there are a number of other reasons why we might see this trend:

  • Articles published more recently have less time to accrue citations.
  • Citations to preprints may not be counted by CrossRef.

However, it certainly does seem that time-on-ArXiv gives us a weak predictor of the citation potential of preprints.

Citation prediction

I’ve said before that machines are learning from your papers. One of the many reasons to do this is to help predict citation performance. Citation prediction can actually be done to quite high accuracy by combining a number of weak predictors like the one described above.

There are good things about citation predictions:

  • Articles with high citation potential can be promoted in order that they reach their ideal audience. This saves readers time in browsing and ensures that authors get recognition for their highly-citeable work.
  • Those articles with lower citation potential can be identified and offered publication in a suitable venue earlier so that less time is spent on them. Earlier publication means more time to accrue citations.

But there are potential downsides, too:

  • Does citation prediction prejudice peer-review?
  • Is it fair to promote some authors’ work and not others on the basis of a prediction?
  • Citations are also a controversial measure of research impact/quality. An excellent and highly-specialised research paper is likely to have low-citations simply because its specialised audience is, naturally, small. In that case, the citations might not truly reflect the value of the work.

It’s clear that we can use machine learning to improve efficiencies in the research communication system. However, bias in machine-learning is a serious issue. As the technology for processing and understanding the research literature matures, ensuring that we handle it responsibly should be a key concern.

Written by

Data scientist working in research communication. #webapps #python #machinelearning #ai

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store