It is probably fair to say that the artificial intelligence (AI) community is absolutely obsessed with deep learning (DL) nowadays. As a researcher active in this field, I haven’t seen any AI related papers that did not incorporate DL in some way in a long time. From an industry perspective, I believe even the general public is aware that most of the “fancy” stuff that tech companies do with AI — like curating the content users get to see on social media platforms and developing smart assistants that can understand natural language — is done using DL techniques, even though they may not exactly know what DL is. Companies love to use it as a buzzword, though, and try to incorporate it in all their products.
Given all this hype, one would expect DL to outperform most or all AI techniques that came before it, right? The question of whether DL is really better than other approaches is especially important today, since most of the DL systems companies deploy in practice require considerable resources to develop and maintain. For one, the sheer amount of hardware infrastructure necessary to accommodate state of the art DL systems presents important environmental and economical challenges (Bender et al. (2021); Strubell et al. (2019)). Deep learning has an atrocious carbon footprint: training even a single DL model can emit on average as much CO2 as can be expected of five cars during their entire lifetime. One might object to this by saying that DL models need to be trained only once and therefore the damage isn’t so bad, but I do not agree for several reasons:
Aside from heavy computational requirements, most DL systems also need tons and tons of data. Since more data generally equals a better performing model, companies and academic researchers tend to cut a lot of corners in the data collection process. This raises a number of problems, the most significant of which are ethical and privacy considerations of scraping potentially sensitive data from the internet without any form of consent (Gebru et al. (2018)). In the case of the well-known ImageNet data set, public outcry regarding the unauthorized inclusion of people’s faces forced the creators to censor large portions of the data set. However, there exist many other data sets similar to ImageNet that contain lots of personally identifiable information that was never taken out, such as the CASIA WebFace and VGGFace data sets. On the whole, the field of ML is still largely built on large amounts personal information that were obtained illicitly.
So we put all this effort into developing DL systems. We collect enormous data sets containing hundreds of millions of samples and we build large GPU clusters to train our models on. But is it worth it? In this post, I show that the answer to this question is negative for at least one highly popular domain of application for deep learning: neural news recommendation, that is, recommending news articles to users via deep learning models.
The basic objective of news recommendation is simple: given a news article and a user profile, predict how likely it is for a user with that profile to be interested in the article. Usually, this interest is interpreted as a “click probability,” i.e., the probability that the user would click on (and presumably read) the article if it were presented to them. A user profile can contain any information the company has on the user in question. In its most basic form, it simply consists of a set of news articles the user has viewed in the past. The news recommendation system then essentially has to discern the tastes and interests of the user based on their browsing history in order to determine if they would be interested in some new article they haven’t yet seen.1
A typical approach towards neural news recommendation is depicted in the figure below, taken from the Neural News Recommendation with Attentive Multi-View Learning (NAML) paper by Wu et al. (2019).
Basically, the model takes two pieces of information as input:
The goal of the model is therefore to take the set of articles previously viewed by a user and to use this information to intelligently determine the likelihood that this user would be interested in another given article they have not yet seen. The NAML model computes this likelihood by encoding both the user history and the candidate news as vectors and taking their inner product. The larger this inner product is, the more interesting the candidate news is supposed to be for the user. These encodings are created using several word embeddings and deep convolutional neural networks, making the NAML model rather complex and sophisticated.
At the time of its publication at IJCAI 2019, the NAML model was considered state of the art and outperformed most of the contemporary existing alternatives. The figure below is another excerpt from the NAML paper, where the method is compared to seven different approaches according to various metrics. For each metric, a higher number is more desirable.
We can see that the improvement of NAML over prior work is impressive across all metrics. So when I started playing around with it, I was pleased to see that I could replicate their results almost exactly. The results were reported for the MIND dataset, which is freely available online. There are open source implementations of NAML as well, so it’s pretty easy to get started.
However, when experimenting with data poisoning strategies, I made the surprising observation that performance of NAML was almost entirely unaffected by manipulations of the title and body of the news articles. In fact, you can completely erase all titles and bodies of the news items in the MIND data set and the performance of the model will barely change across all of the metrics. It doesn’t even matter whether the model is retrained without this information or not, since it appears to always completely ignore the title and body features of the data. Without this information, the only features left in the data set are the category and subcategory classifications. These features only tell you what the broad topic of the article might be, such as politics or sports, but tell you nothing about the actual content.
This finding prompted me to investigate whether deep learning is necessary for this task at all. After all, each item in the MIND data set only receives a single category and subcategory marker, so there really isn’t a whole lot of information you can extract from these features.
My first idea for a baseline for the MIND data set was extremely simple: we just take the category that the candidate news item belongs to and compute the fraction of items in the browsed news set that also belong to this same category. This fraction is our predicted click probability. With this completely trivial model, lo and behold, you can obtain these results:
AUC : 61.70 MRR : 29.42 nDCG@5 : 31.84 nDCG@10: 37.91
Note that the
AUC is better than all of the models NAML was compared against. In some sense, the
AUC is the most important metric of all, since it measures the probability of ranking an item that the user would be interested in more highly than an item the user would not want to see. A higher
AUC value means we are more likely to recommend items the user wants to see than items the user does not want to see.
nDCG@5 metric is better than the Wide&Deep model which utilizes deep learning. The other two metrics,
nDCG@10, are worse for our model than the alternatives. We can still improve these numbers, however, by also taking into account the subcategory to which the news item belongs. Specifically, for both the category and subcategory features, we compute the fraction of browsed news articles that have the same feature value. These two fractions are then averaged to obtain the final click probability. With this model, we have the following results:
AUC : 62.15 MRR : 29.74 nDCG@5 : 32.30 nDCG@10: 38.37
We still beat every alternative except NAML itself when it comes to
nDCG@5, we now beat four out of seven models as opposed to one out of seven. The
nDCG@10 scores have also improved, but not enough to surpass any of the other models.
The simple baselines I have proposed here do not universally outperform all DL approaches across all relevant metrics, but they do outperform many of them on a number of important metrics. Considering that the DL approaches are generally highly complex and require a lot of effort to get right, the fact that we can match and even outperform these methods using such an incredibly trivial model is perplexing. The actual improvements obtained over the simple baselines using DL are vanishingly small compared to the overhead it introduces. I would even go so far as to say that an average end-user would not even notice the difference whether a company employs the NAML model or my simple baseline, except that the system would probably be much more responsive using the baseline because of its simplicity. I certainly don’t think there would be enough of a qualitative difference in practice to warrant the additional effort of training and deploying such a complicated deep learning model.
So where does this leave us? As far as I see it, there are a few lessons that can be drawn from this particular case study. The first and most obvious take-away is that we shouldn’t tacitly assume that DL is the best solution for every problem to which it can be applied. This assumption creates a sort of “deep learning filter bubble,” where researchers only pay attention to methods that incorporate DL and completely neglect any potential alternatives that do not. This is a major blind spot in the ML community, because DL is expensive: it takes a lot of physical infrastructure to develop and deploy it, which comes with environmental and economical concerns; it also requires a lot of data, which incentivizes researchers to collect as much data as quickly as possible, leading to privacy concerns and poor data collection practices resulting in low-quality data sets. If DL can be avoided, it should be.
A second, more optimistic lesson we could learn from this is that DL may be used as a precursor to simpler and more specialized models. For a problem like news recommendation, it may be very difficult to design simple, specific models that work well enough for our purposes; we may simply have no idea where to start. In such cases, one could employ DL as a very broad, general-purpose technique for designing models that work well. Once these models produce satisfactory results, we can analyze them and investigate what they have learned with the aim of extracting simpler relationships which we can implement in more lightweight models. In this particular case, I noticed after fitting a complicated DL model that the network only paid attention to category and subcategory information. This prompted me to experiment with a few simple models based on just these features, but there is no guarantee that NAML actually computes the click probability in this way. Indeed, the resulting baselines I created perform worse than NAML, but they do outperform some of the DL approaches. We would have to dig deeper to find out where the additional performance improvements come from exactly, but this shouldn’t be too difficult in this case since there is not that much that can be done with just the category and subcategory features. In general, this can require a lot of effort, but there exist many techniques to help with this. In fact, there is an entire subfield of ML dedicated to this problem, variously known as “interpretable” or “explainable” AI. These can help in simplifying a complicated DL network to a much more manageable scale, such as a linear or quadratic function of a limited number of features.
Of course, there will be times when deep networks provide the best available solution, and no effective simplifications can be found. Computer vision is a good example of this: ML models used for image processing learn such complicated functions that simpler methods tend not to be able to compete at all. Sometimes the added complexity introduced by DL is actually necessary, but not always. We should keep this in mind when doing machine learning.