How I built a list of coronavirus-related research papers using the Microsoft Academic Graph

EDIT: since writing this post, AllenAI’s Semantic Scholar have posted a customizable feed of COVID-19 research as well as a datadump of past research. This is essentially the ‘ideal thing’ I’m describing below.

Do you know what rhabdomyolysis is? … No? Neither do I…

Many publishers are making literature relating to the ongoing coronavirus outbreak free-to-read. My employer, SAGE, is no exception and we’re actively promoting related content from our journals.

But which literature is related? We can’t just search for papers on ‘coronavirus’, since many relevant papers don’t use that word. Take papers on past pandemics such as the recent MERS outbreak or the Spanish Flu pandemic of 1918–1919. Are those relevant? How about high-level papers on health policy and disaster response — where should we draw the line?

The ideal thing would be to build a machine-learning classifier of related papers so that all publishers could check an individual paper for relevance to the topic. Such a classifier could also be the basis of a feed of new papers and preprints for researchers. Unfortunately, building it is a challenge because we don’t have any labelled data to train such a classifier. Furthermore, the datasets we might use are quite disparate: PubMed, preprint servers, CrossRef etc…

So, if we can’t do that, let’s try something a more simple and practical using the Microsoft Academic Graph (MAG) — a dataset which Microsoft kindly makes freely available.

Here’s the process:

The distribution of scores assigned to keywords by MAG. Where do you draw the line?

The interesting thing that comes up in doing this is that you find a lot of keywords that are relevant, but that you hadn’t thought to look for. Take ‘rhabdomyolysis’. I’d never heard that word before and I had to look it up. It’s a condition that can be caused by a serious virus. It doesn’t look hugely relevant to the coronavirus outbreak, but without knowing more, it might make sense to classify papers about rhabdomyolysis as being related anyway.

The list we got from this process is far from perfect. It includes some articles which are not relevant to coronavirus and no-doubt it also fails to include articles which are relevant. However, it was fast and easily implemented and — most importantly — it unearthed much more relevant literature than we found with a simple keyword search as well as a lot of keywords we hadn’t thought to search for.

Edit: Digital Science’s Dimensions have a list of related research resources including publications, datasets and clinical trials. I can also recommend Microsoft Academic as a good place to search for research on this topic and I recommend the Unpaywall browser extension in case you are still unable to access research papers directly via the publisher.

Here’s the script I used to pull my list.

If you’re just getting started with the MAG, and you want to try this process, you might want to consider the best way to do it. In this case, we downloaded a partial copy of the MAG to a SQL database and queried that database to find related papers. SQL is only one of the database systems you could use and Microsoft recommend various tools for working with the data.

Written by

Data scientist working in research communication. #webapps #python #machinelearning #ai

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store