One thing that helps with initial exploration of data is to visualise the data and look for clusters and other patterns. It is relatively trivial to do.
Unfortunately, visualising the CORD-19 data on its own does not put the data into context. For that, we need to compare with a much larger dataset. Most of the world’s medical science journals (at least, the good ones) are indexed by a service called PubMed. So, I downloaded their content and overlayed the CORD-19 dataset on top of that.
This is much better:
Now you can see the green regions — which represent medical science and the red regions — which represent CORD-19.
It’s clear that there are several key clusters of papers here.
- Zooming in on the big red region, we find several subclusters around virology and epidemiology.
- Up in the top right is a cluster which appears to be specifically to do with health policy around pandemics.
- There are also large areas with hardly any red dots e.g. oncology, sports science and materials science.
Hopefully visualisations like this can help publishers to work out which content they have which is relevant to the pandemic. It may also help researchers to find relevant work and put it into context.
That said, I still recommend using discovery tools such as Semantic Scholar or Microsoft Academic to find COVID-19 related research. Visual representations of data like this only get us so far.
How it’s done
My code will be uploaded here — at the time of writing I haven’t got round to it yet. You can also find similar visualisations and code in some recent Kaggle entries.
I’ll also outline the basic process for non-python folks:
- To begin, what you need is an ordered iterable of text data. In my case, this was a large pandas dataframe with a column containing every title+abstract from (a) PubMed published in 2019–2020 and (b) Semantic Scholar CORD-19 data dump. We want to compare (a) with (b).
- Then you need to turn the text into numbers (this is called ‘vectorisation’). A good choice for this is SciSpaCy, a package released by AllenAI (the same people who make Semantic Scholar). You can also use any one of a number of algorithms made available by Gensim, or HuggingFace.
- Once you have turned all of your documents into vectors, you need to compress those vectors into simple x and y coordinates so that they can be plotted on to a map. There is a very easy to use package called umap-learn which does this in one line.
- Finally, you can append those x and y coordinates to the original dataframe and plot the data using any one of a number of packages. I used the Bokeh package for plotting given that it has a lot of options for interactive features.
The result is essentially a map of medical science with the subfields related to covid-19 highlighted. We can explore this map to find related literature and also see covid-19 related research in context.
You can develop this process by adding clustering techniques to automatically detect subfields in the literature.