Continuous Delivery Lessons for Machine Learning Projects

4 minute read


Continuous Delivery (CD) is the practice of getting changes in a codebase into the hands of our users in a frequent and predictable way. Where CD is implemented, developers push their code as frequently as possible and software is built, tested, and released with great speed and frequency.

The benefits of implementing a CD system in your development infrastructure are well demonstrated. I’ll just highlight what, to me are the most important ones:

  • Lower risk: if we deploy changes frequently, the change deployed will be small in size, there is less code to audit in case of a problem. And there’s an extra benefit: smaller changesets are easier to review, resulting in higher quality.
  • Faster feedback. The sooner the customer starts using the new feature, the sooner you will hear what is working, what doesn’t work, and what improvements are needed.
  • Faster Time to Market. The faster a feature reaches the customer, the better. Having a feature ready for production, but not deploying it, is wasteful.

Traditionally, Continuous Delivery has been thought as a practice to deploy application code. Nevertheless it should be applied to every aspect of software delivery: Application Code, Configuration Management and, of course Data Pipelines and Machine Learning.

I’ve always seen CD practices as Agile-enabler, in the sense it gives us the the ability to:

  • Iterate faster
  • See changes quickly
  • Take informed decisions fast

Now let’s see what this means in the Machine Learning world.

A real-world ML project

Let’s put ourselves on the shoes of an engineer of the monetisation team in a casual game company. They want to maximise the revenue from a new promotion the company is launching. In order to do that, they want to produce a fine tailored campaign, where each player will receive a personalised promotion, with different products and different text messages. They think they can create a ML model that picks the set of products and the message that maximises the revenue from that player.

This is a classic revenue prediction problem, and considering the amount of data the company has about the players, they’re confident they’ll succeed.

So now, let’s execute on the project. This is roughly the process in everyone’s head when thinking on a project of this kind:

  1. Generate training data. No one has received a promotion, so we don’t have any data to build the model. We’ll start by showing random promotions for a few days/weeks.
  2. Data preparation to obtain the features of our model.
  3. Train several models, until we’re happy with the results.
  4. Deploy the model
  5. Retrain with new data and deploy new model
  6. Profit!

The plan is sound. Our engineer starts executing over the plan:

  1. Training data is generated by showing different promotions randomly to different users. After a week, we have some data to start training models.
  2. Data preparation. User data is extracted using SQL from different Databases. An ETL job is also written to extract user aggregated player data from the event logs.
  3. The first models are trained, but the result happens to be inconclusive. Seems we don’t have enough data. We wait a couple of weeks.
  4. Model training happens again. Finally some results are conclusive, we have a model to deploy!
  5. We contact IT to deploy our model. After a week and some back and forth, the model is finally running!

It’s taken longer than expected, but we have the model ready to start serving personalised promotions. We’re ready to include it in our app!

However, problems start arising. Our engineer realises that, in order to query the model, we need to use the same features used during training. Some of those features come from existing DBs, so we can use the same SQL Queries to retrieve features for the player. But what about aggregated player data extracted from ETL jobs? Model queries should be blazing fast, we can’t run jobs that take up to hours for every query.

Fortunately, there’s a solution for that: write new ETL jobs that store aggregated data in a Data Warehouse. After a few days of work, there’s an ETL job running periodically storing aggregated data that can be queried using SQL. We’re back in the game (no pun intended)!

The app is updated and the model starts serving predictions. The team analyses the performance of the campaign after a few days. Tragically the model is not performing as expected. The results are poorer than trained examples. The engineer spots the problem after some investigation: features for aggregated player data are too old to perform a good prediction. Since ETL jobs run every 24 hours, the features used to query model are stale. The team manages to reach a compromise with IT: jobs will run every 6 hours. Not ideal, but the quality of predictions starts to be better.

In the meantime, something bad happens. The model is showing a big latency. Some queries take 30 seconds to return a prediction. Apparently the problem is with the Data Warehouse queries that are underperforming. This is more problematic, and may even require working closely with IT if the capacity of the Data Warehouse has to be increased.

After a lot of effort and several weeks later than expected, the model is up and running. Revenue numbers go up and everyone in the team smiles relieved.

The problems I’ve described above are not uncommon in ML projects, quite the opposite. And much too often the final outcome is not a smile. A good thing is that we can learn from experience, and the team will likely be able to foresee these problems before jumping to a new project.

Lessons from Continuous Delivery

Which lessons learned from CD can we apply to ML projects of this kind?

Let’s say something obvious: An ok-ish model running in production is more valuable than the most accurate model living in a shelf. CD practices say that the best time to first deploy your software is in your first commit. We should start with that. Build a simple model. Why not starting with linear regression. or even simpler, a rule-based system? Deploy it and make sure your app can use it. This is key: under continuous delivery, we decouple deployment and release. We’re not releasing our model to the wild, it’s not ready yet.

Once we have our first model deployed, we can spot initial problems: is feature lookup fast enough? Do the model performs as predicted? Those are questions we can start answering very quickly. And the smaller the data we’re including in the feature set, the easiest to address the problem.

Is feature lookup fast enough? Do the model performs as predicted? Those are questions we can start answering very quickly. And the smaller the data we’re including in the feature set, the easiest to address the problem.

Progressively, we’ll start adding features and testing new models and retraining ones. We can start with relational databases, and only when the accuracy of the model is not good enough we start adding data from complex sources like data warehouses and data lakes.

The Continuous Delivery Pipeline should also take care of re-training and deploying models periodically. Again, the recommendation is to set up this infrastructure at the very beginning of the project.

The same goes for monitoring and metric gathering. Evaluating the performance of the model will drive the evolution of the project. This should be ready since day one.

Only starting small, and applying the principles behind Continuous Delivery we’ll be able to address ML project complexity in a true Agile fashion.

We would like to know what you think! Please fill out the following form or contact us at We welcome your comments and suggestions!