In a recent blog post, we saw how we can use data to define a semantic space for covid-19-related papers and get a nice visualisation of that space, like this:
You can see that the coronavirus papers (which come from the CORD-19 dataset by Semantic Scholar) and the other papers (a random sample of PubMed) occupy quite different regions of the space. That’s good, because it means that we can build a machine-learning classifier to discriminate between those 2 datasets.
That classifier is essentially a tool which can assign a probability that any document in this space is relevant to coronavirus.
You can see that the classifier is good at assigning articles to one class or another. Articles are classified correctly around 93% of the time simply by drawing a line at the 50% mark. But there is some uncertainty. Unfortunately, there is no objective definition for ‘relevant’ and there is no such thing as a perfect classifier.
Here’s a visualisation of the same space, but this time instead of coronavirus papers in red and other literature in green, we are assigning a colour based on probability of being about coronavirus. Red=low probability and yellow = high. (There are actually several shades of orange inbetween, but very few articles fit those categories, so they are not easy to see here!)
Edit: I removed some links to a downloadable version of the visualisation. It’s a large file and so sharing it through S3 was potentially costly. However, you can still make your own using the code in the github repo.
It’s important to not overstate what this classifier can do. ‘Relevance’ is a subjective concept and it depends on your perspective what constitutes a relevant article.
As I’ve said in past blog posts, if you are following research on coronavirus, I recommend using a classifier that will learn what you are interested in (instead of a classifier like this, which just tries to broadly classify everything). Semantic Scholar’s custom feed might be a good choice. It will learn what you like based on how you interact with it.
What our classifier is really doing is telling us how much an article looks like the CORD-19 dataset COMPARED to a small sample of PubMed articles. With that in mind, it should be clear that if the CORD-19 dataset doesn’t include all relevant articles, and PubMed DOES include relevant articles, then the classifier is going to have some biases.
It would take some successive iteration with a more refined list of related and non-related articles, but I think there’s scope to get a much better classifier if we do. New articles about covid-19 appearing on preprint servers may also be a good source of data to teach our classifier which topics are associated with the virus.
This is also a first-pass. Machine learning is inherently iterative process and there are a few other things — besides improving our data — that we can try which can help us to get better results like trying different classification algorithms, different preprocessing and vectorisation etc.