K#7. Why Tech Data Scientists don't do Machine Learning
Lessons from a one week ML sprint at Spotify
Heyy 👋
Welcome back to K’s DataLadder ✨! Each week, I share a story from my life as a Tech Data Scientist to help you level up your career.
We’ve grown to a wonderful community of 906 curious and driven data lovers. Thank you for being part of this journey ❤️
This week’s edition is more special that usual because this week was Hack Week at Spotify! And for the first time in a year, I jumped into an ML project. Things didn’t go as expected though, and I want to tell you what I learned from this experience.
Reading Time: 5 minutes
Agenda
This week’s story
ML lessons learned from Hack Week
Why you should care
But first, if you haven’t done that already, you can:
Subscribe to my YouTube channel. New video coming up very soon!
🎉 Big celebration 🎉
I launched my YouTube channel 4 weeks ago and this week it hit the 1000 subscribers mark!! I couldn’t be more grateful, so thank you all for supporting me and making the +40h I spent doing weirdass editing worth the time-investment 🫶
This Week’s Story
Spotify’s Hack Week is a time we look forward to every year.
It’s a unique opportunity where we're all free to explore ideas and engage in any creative project to improve Spotify. Many of the cool features you see today, like the Discover Weekly playlist, were born from these personal initiatives.
Hack Week helps us connect more deeply with our love for the product by giving us the space to contribute with our own creativity and passion.
In the past, I’ve used this week to learn new skills rather than exploring personal ideas. For instance last year, I used the opportunity to study A/B testing because I’d never found the time to explore the topic thoroughly before.
This year, I felt my machine learning (ML) skills – which I bled myself to learn – were getting… rusty. So I decided to reconnect with the "ghost of ML past" by working on an ML idea related to my current project on music videos at Spotify.
I mean what better way to kill two birds with one stone, right?
Right…. I jumped into the project on Monday, but by Friday, I still hadn't developed any ML model yet… What possibly went wrong?
ML lessons learned from Hack Week
Lesson #1: ML projects take time
ML projects take weeks, if not months to complete. It often also takes a whole team effort instead of a one-person job to fulfil.
So one week is certainly never enough. I don’t know how I thought I could do this? I mean my project idea wasn’t super ambitious, so why wasn’t it enough?
Well, it’s because the most time-consuming part in any ML project is not modelling but rather preparing the grounds for it. It’s collecting the data.
No matter how optimised it is, any ML model will spout nonsense if the data it learns from is shit (noisy, small, not enough features, etc), but I already knew that.
That’s why this week, I spent five days trying to assemble a small sample of 10% from huge datasets of hundreds of terabytes, running processes for hours, only to realise I needed to start with a much smaller sample of 1% to test.
Faulty joins can also mess up the system. I spent a long time trouble-shooting why I was getting the output “no data available”, only to realise I forgot a feature in one of my table joins.
So I restructured my query breaking down the joins into separate CTEs to track down where things broke in the first place. Remember if the data collection and engineering part is flawed, everything else will be too, so this step is the most crucial of all!
Lesson #2: ML is risky & unpredictable
It’s a sad but true reality and it’s because in tech companies, we need results fast.
The tech scene is competitive and moves extremely quickly, so we can't afford to spend months developing models only to end up with something subpar.
Most often, it's better to rely on data analysis and statistical methods such as causal inference, correlation, and visualising distributions because they guarantee results.
ML is uncertain & time-expensive, and I got a clear reminder of it this week.
I spent 5 days gathering data and could have easily spent another two weeks optimising this process. Then comes building and optimising the model, again with zero guarantee that the end result will be usable or conclusive.
ML is exciting and attractive, but just like having a crush on the bad boy, it’s also unpredictable and risky so it’s often not worth the ride.
In many tech companies, ML is not always the best option for time-sensitive projects that require quick insights, which is the case for most projects. Statistical analysis is often enough because it provides sure & safe results.
Why you should care
I meet a lot of aspiring data scientists who are so hyped to become data scientists because they think they’ll be doing ML all day.
I’m sorry to break it to you, but Machine Learning ≠ Data Science.
In the best tech companies, ML-focused roles are most often if not always research-focused as opposed to data science roles which are more business-driven.
They are inherently different.
What this means is that opportunities for ML roles are more scarce because they fall in the realm of pure research, which is uncertain so companies are more cautious in their research investments.
Data Scientists vs. Machine Learning Engineers
It doesn’t mean your ML skills are useless.
Having ML skills is certainly valuable today more than ever, especially in environments where you can afford to use them.
Just be aware that opportunities to play in such environments are limited.
I actually met with a staff Machine Learning Engineer (MLE) from Spotify this week to learn more about his role since I’m considering transitioning to it in a year or so. The conversation clarified to me why data science has split into Data Scientists and MLEs.
Data scientists focus on exploration that yields guaranteed results now
MLEs venture into the unknown, not always sure what they’ll find or even if they’ll find anything
It’s a costly adventure, which is why there are much fewer roles for MLEs. For example, there are almost twice as many Data Scientists at Spotify than MLEs. Tech companies prefer investing more heavily in certainty.
So if you’re planning to venture into a data science role, it’s better to be aware of these contextual differences when applying for roles.
If you prefer spending time doing cool ML wrangling but with the possibility of finding nothing, then that’s the route you should go for.
But if you prefer to investigate with the certainty you’ll find a light at the end of the tunnel, then data science might be more your speed.
You don’t have to choose
Both routes are great for your career, and it doesn’t mean you can’t do both at different times.
Like I said, I’m considering switching in the near future to learn different skills.
The MLE told me I could be a great fit for the role because most of the time MLEs focus on building the model and optimising it, and forget the purpose behind the model.
This is why it’s important to cultivate a product-driven mindset, which is best acquired as a data scientist, since our job focuses a lot on being business-driven.
In smaller companies, the borders are more blurred. Not all companies can afford hiring two types of data science experts so a lot of the time, one person does it all.
Most of the time, data scientists don’t do ML because:
ML is over-hyped, it’s not the core of the data science job, statistics are
ML projects take time, from the data collection step to the model deployment
ML does not guarantee results, it’s risky and unpredictable, so companies are cautious on how they choose to invest in leveraging ML, and even more who gets to contribute into that investment.
Please keep that in mind to adjust your expectations of the job.
I hope you liked this edition! See you next week for more data stories 🫶
If you’re enjoying these insights, don’t forget to subscribe and follow along on YouTube, Instagram & LinkedIn for more updates and stories.
Thanks K for this article! I'm really glad it came out while I was spending my Sunday watching videos to understand what an MLE actually does, haha. It would be fantastic if you could share more articles like this, explaining various data job families you worked in/with, what they do, and their skill set. Data is a fascinating field, but there are clearly many roles out there
Very interesting. At some point, I was stuck between deciding in investing more on the data science side of things or the MLE side of things when preparing for data roles, only to discover that MLE are fundamentally software engineers. Boris meinardus made it very clear when I talked to him. So before diving into such a role, one needs to love coding more than delivering insights. Add to that what you just said about ML not always delivering immediate results compared to data science. This clears things up a little bit more for me