The ML Doctor Says: Don’t Build Fancy Models Before You Set a Simple Baseline
Editor’s note: In this new series, The ML Doctor diagnoses common problems that affect ML development, prescribes a treatment, and shares stories of recovery from former sufferers.
So you’ve taken a few online courses in machine learning and landed a data scientist role in your first industry job. Or maybe you’re an ML engineer with a few years of industry experience. Either way, there’s a chance that you’re itching to apply the new and shiny ML techniques you’ve been reading about in the news to take on big challenges: predicting the 3D structure of proteins, creating open dialogue chatbots, and generating beautiful images from text.
The trickier and more complex the problem, the more alluring are the sophisticated model architectures.
However, sometimes, the impulse to train fancy models that you’ve seen in the headlines can get ahead of the basic foundational work required to ship models that actually make a difference to the business. And if you’ve neglected that foundation, the whole effort is at risk of collapsing under its own weight.
Let’s examine this syndrome and talk about treatments.
You're a data scientist at an edtech company, “Wedemy”1 (“we-demy”), and you've been tasked with creating an ML system that recommends new courses to students based on their past interactions with the website, along with static information about them like user-provided topical interests.
- It's been 3 months since the project kickoff, and you're still iterating on the SOTA deep-learning architecture that you got really excited about in your literature review.
- You don’t have an end-to-end software system that plugs into the company's main website to fulfill the product requirements.
- The main metric you have in mind is a model metric (e.g. loss, precision, recall) and you haven’t really defined a solid business KPI that your model is supposed to improve (e.g. course sign-ups)
FOMOSOTA-itis (fear of missing out on state of the art)
- Learn about product thinking and user-centered design.
- Read "Machine Learning: The High Interest Credit Card of Technical Debt".
- Understand your data deeply by exploring it, analyzing it, and talking to your data engineers.
- Implement simple heuristics based on past historical data or use industry-standard techniques (e.g. collaborative filtering).
- After you’ve established a baseline, explore fancier models to improve performance.
- Timebox your model experimentation so that the exploration of higher-performance models doesn’t go on indefinitely.
In the beginning of the project, the modeling team at Wedemy, made up of two data scientists (Bob and Alice) laid out what they needed to do: create a recommendation system that analyzes a student’s interaction history on the website and recommends new courses that might interest them. Their business objective was to increase course sign-ups.
Bob: Hey, so last night I came across this paper called “Transformer Embeddings Are All You Need for Effective Recommenders.” I think we can use it to get state-of-the-art results on our course recommendation project!
Alice: Yeah — I skimmed through it, and it looks really promising! I guess we could try it out … Does the paper come with code?
Bob: No, but there are a few github repos of various folks trying to reproduce the work. Perhaps we can start off there?
Alice: Maybe … Although I think we should look at our own data first and get a sense of its quirks. The data engineering team mentioned that there are some gotchas with it and we’ll need to filter out some of the interaction data at specific time windows.
Bob: Cool, do you want to get started on that? I can try prototyping a model using one of those third-party repos.
Alice: … I could really use your help understanding our data better — but sure, that sounds good!
A week goes by… While Bob was still debugging his Pytorch model on toy data, Alice was able to complete a knowledge transfer with the data engineering team to create a clean dataset. On top of that, she took an off-the-shelf implementation of the Netflix collaborative filtering algorithm and managed to get decent results on the test set for their metric of choice, MAP@5 (mean average precision for the top 5 recommendations).
Alice: Hey, how’s the model-training going?
Bob: I finally have something working!
Alice: How do you know?
Bob: The loss is going down. 🙃
Alice: That’s a good sign. … I’ve already established a baseline using SVD, which got 73% MAP@5 on the test set, so that’s the score to beat!
At the tail end of the model development phase of the project, Bob was able to achieve a MAP@5 score of 77%, which was 4% points above the baseline. Even with the higher performance of the transformer, Alice and Bob decided to use the baseline model because they wanted to simplify the deployment and model monitoring process. They figured that a 4% point improvement wasn’t quite worth the added complexity of dealing with a 1+ million parameter transformer model, at least with their current MLOps infrastructure. Even so, they ended up increasing the sign-up rate for recommended courses by 27%.
1 A company in an alternate universe where WeWork expanded into education