Debunking the myth of the data science bubble
We’ve all read articles indicating the looming decline of data science. Some coined the term ‘data science bubble,’ some even went so far as set a date for the ‘death of data-science’ (they give it five years before the bubble implodes). This reached a point where anyone working in the field needed to start paying attention to these signals. I have investigated the arguments backing this ‘imminent death’ diagnostic, detected some biases, drafted an early answer on LinkedIn, the Zalando communication team picked on it, and following their encouragements, I prepared a revised version for the Zalando Blog. This post doesn’t aim at making any bold predictions about the future without proper evidence. I always found these to be relatively pointless. It just aims to point out that, for all the noise, there is no solid reason to believe that any of us should worry about our jobs in the years to come. In fact, the very arguments used to prognose a ‘data science bubble’ can be turned around as reasons not to worry.
The arguments used by proponents of the data science bubble are generally of three sorts:
1- Increased commoditization
2- Data scientists should not become software engineers
3- Full automation
It is clear that data science work is getting increasingly commoditized: almost all ML frameworks now come with libraries of off-the-shelf models that are pre-architectured, pre-trained and pre-tuned. Want to do image classification? Download a pre-trained ResNet for your favorite deep-learning framework and you are almost ready to go. The net effect is that a single well-rounded data scientist can now solve in a week what a full team couldn't solve in six months 10 years ago.
Does that mean less demand for data scientists? Certainly not, it only means that investing in data science is now viable for a lot of domains for which data science was simply too expensive or too complex before. Hence a rising demand for data science and data scientists. It is useful to take software engineering as a comparison here. Over the years, most of the complexity around programming has been abstracted and commoditized. Only a few could start anything in assembly, C made it much easier to develop complex projects, Java commoditised memory management, etc… Did it make the demand for software engineers vanish? Certainly not, on the contrary, it increased their productivity and hence their net value to any organisation.
Data-scientists should not become software engineers:
I strongly disagree with this assessment: one wouldn’t believe the number of data science projects that end up in a powerpoint presentation with pretty graphs and then just an ignominious death. Why? Because data scientists often lack the ability to make their projects deliver continuous value in a well-maintained and monitored production environment. 95% of the data science projects I see do not make it past the POC stage. Going beyond the POC requires a software engineering mindset.
It is still rare to find data-scientists actually capable of (1) putting a model in a production environment, and then (2) guaranteeing that machine-learned based value is continuously delivered, monitored and maintained in the long run. Sadly, that is precisely where the ROI for any data science investment lies. I am not sure pushing data scientists to move towards management would help there: chronic over-powerpointing and the urge for serial POCs that never make it beyond the MVP stage is very much a management-induced sickness. I am not saying data scientists should become software engineers but, if anything, data-scientists need better engineering and software architecture abilities, not less.
The risk of automation
Full automation is very unlikely, because in many regards, data science is still more an art than it is a technique. There is a huge gap between the 'hello Mnist’ tensorFlow example and applying ML to a new domain for which no golden data-set or known model archetype exists. Ever had to use crowdsourcing for gathering labels? Ever ventured into the uncharted territories of ML? Ever had to solve a problem for which you couldn’t piggyback on an existing git repo? You will know what I am talking about…
And there we enter the real discussion: Data scientists that are not able to go beyond the TensorFlow Mnist-CNN example, the ResNet boilerplate or the vanilla word2vec + lstm archetype are indeed going to become extinct. The same way no programmer can make a living out of the ‘Hello World’ code he/she wrote during the first year of college. But for those who know how to go beyond that and make ML actually work in a continuous delivery environment, there is a bright future in front of them and there are good reasons to think it will span much longer than the five years to come.