Rejected article tracking with the CrossRef API

Nothing weighs more heavily on the heart of a journal editor than rejecting an article. Partly, this is because you know you’re giving someone, somewhere, some bad news, and also because there are no prizes for peer-review. No one’s watching — no one’s measuring your ability to identify and reject bad content. In fact, you might as well just accept the paper and take the money, right? Because no one’s watching. If your Impact Factor’s high enough, a bad paper that doesn’t get cited won’t even make a dent, right? No one’s watching!

Well, that’s not so… There are a number of ways to indirectly track peer-review including numerous commercial rejected article trackers. I recently shared a quick and dirty Rejected Article Tracker (or ‘RAT’ if you want a name connoting quickness and dirtiness). The application is very simple — input a list of titles + authors for your rejected articles and the RAT will search the CrossRef API for papers with those titles (and those authors). Once you receive the results, you can start to analyse where your rejected articles are going.

As a journal editor, this shows you:

  • Which of your competitors are publishing your rejected articles. This can be a great source of mirthful schadenfreude — something which might raise your spirits after all those depressing rejections.
  • Whether you rejected any good papers that went on to be published and cited elsewhere (although sometimes it will show you that bad papers do indeed get published and cited). This might be detrimental to your mood once again, but it could also be valuable feedback for your peer-review processes.
  • You may also spot the occasional case of dual submission (a form of author-misconduct where an article is submitted to 2 or more journals at once — some of these cases are obvious if we allow results where the publication date recorded by CrossRef precedes your rejection date). It looks like someone else thought that no one was watching, too!

This data is interesting. If you could track this data for all publishers, then you would know the path of all articles through the peer-review system. This might provide scope for much needed evaluation of peer-review services. It would also give us some more insight into the cost and value of peer-review. When we have a scholarly ecosystem which is expanding at a geometric rate, this is important.

CrossRef RAT evaluation

I recently tested a set of ~47,000 articles published on ArXiv in 2012. This set was limited to articles where ArXiv displays an author-provided DOI showing where articles had been published.

This allowed me to check the results. If the RAT finds an article in CrossRef, I can compare the DOI with the author-provided-DOI on ArXiv and confirm if the RAT is performing properly.

Here are some headline stats:

  • The RAT found results for around 78% of the articles. There are a few things we can do to improve this, but if the title (or authorship) of an article changes substantially, the RAT won’t find it. Around 72% of the DOIs had correct matches (but they also had some incorrect matches).
  • A small percentage (<1%) of articles have the wrong DOI recorded on ArXiv. This seems to have 2 causes: typos and articles where the DOI does actually change. For any authors reading this — I recommend checking! Machine-readability is important. Computers read your work more than humans do.
  • Results were limited to those where the Levenshtein distance was ≥70 (Levenshtein distance is a simple measure of the textual similarity between the title of an article and each match found by the RAT). For the 78% of articles with matches, this gave a baseline accuracy of 81%. I.e. if I classified all of the results as correct, then 81% of those classifications would indeed be correct.
  • The best predictor I could find of the accuracy of results was Levenshtein distance.
Image for post
Image for post
Distributions of Levenshtein distance (t_sim) for correct and incorrect results.
  • There are a few numerical columns of data. Here is the r² correlation matrix:
Image for post
Image for post
correct_yn is the thing we are trying to predict. 1.0 for a correct result and 0.0 for an incorrect one. The ‘correct_yn’ row in the matrix shows how this quantity correlates with other variables. match_all = 1.0 if all author names match on a result, 0.0 if not. cr_score = the score provided by CrossRef with each result. rank = the rank of the result among all results returned by CrossRef. n_days = no. of days on arXiv before journal publication.
  • You can improve results by simply setting the t_sim threshold a bit higher or ignoring every result with rank >1.
  • The best results I found came from taking the available numerical data and creating a simple Logistic regression classifier, which boosts that 81% accuracy figure as high as 97%.
  • That’s a lot of percentages and percentages of percentages. Put simply, this adds up to correctly tracking just over 70% of all of the input articles.
Image for post
Image for post

The caveats…

There’s more to do to track down the missing 22% and see if they can be retrieved in an automated way.

ArXiv is also quite focused on Physics, Maths, Computing etc and it’s likely that other areas of science will exhibit different behaviour.

I hope this rejected article tracker is useful. Please do get in touch with any questions or comments.

Written by

Data scientist working in research communication. #webapps #python #machinelearning #ai

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store