Machine Learning

Many real-life Machine Learning (ML) use cases imply multi-tenant architecture and require training a model for every user.

Let’s take ML-driven sales forecasting. If you manage a retail store chain, you might want to use purchase history data to predict revenue on a per-shop basis. The same goes for tax calculations, where it is illegal to combine the data obtained from different clients to train a single ML model; instead, we have to make predictions on a case-by-case basis.

A common way to deploy Machine Learning models is to write a Flask service with a /predict endpoint and wrap it into a Docker container. There are a lot of examples of single-model ML servers; platforms like MLflow help deploy a single-model server in just one line of code. When it comes to creating ML servers supporting multiple models though, developers don’t have that many tools in their toolbox.

In multi-tenant applications, the number of tenants is not known in advance and is essentially unlimited. On day one, you may have just a single client; within a year, you might be serving a separate model per user to thousands of users. But here’s where the limitations of the traditional deployment approach begin to show up:

  • If we spin up a Docker container for every tenant, we’ll get a bulky app that is also expensive to manage
  • A single container that has all the models inside its image does not work for us either: as we mentioned earlier, there could be thousands of models running on a server with new models being added at runtime

Then how should we approach this problem?


In this solution, it is assumed that model training is handled apart from model serving. For instance, an Airflow job is doing model training and saving it to S3, so the only ML server responsibility is predictions.

A trained ML model is just a file on the disk, so we need to store the file and a mapping: user id -> model id.

Solution components

To keep the server agnostic from model storage implementation and underlying ML framework, several abstractions are used:

  • Model — an abstract model that provides predict() API; its implementation may be SklearnModel, TensorFlowModel, MyCustomModel, etc.
  • ModelInfoRepository — an abstract repository that provides user_id -> model_id mappings. For instance, it can be implemented as SQAlchemyModelInfoRepository.
  • ModelRepository — an abstract repository that can return a model by its ID. It can be FileSystemRepository, S3Repository, or any other repository implementation.


Now let’s assume that we have trained a sklearn model, which is stored in Amazon S3 with the user_id -> model_id mappings defined in Postgres database.

This makes the server implementation extremely simple:

Note that thanks to the abstractions, the Flask server is totally independent from a particular model and storage implementation; we can replace sklearn with TensorFlow and S3 with a local folder, and no lines are changed in the Flask server code.

A note about caching

Since some models can be queried more often than others, it might be costly to load them from storage every time. To solve this problem, we can use caching. Here is how it can be implemented as a composition of existing model repository and caching library cachetools:

Usage example:

Before going to production

Such a multi-model server is one of many parts needed to run production-grade applications with ML capabilities. Enterprise-level security, scalability, MLOps etc. might be even more important for project success and reliability than a slightly more accurate ML model. Always keep in mind a brilliant rule #4 by Google: Keep the first model simple and get the infrastructure right.

By the way, here at Exadel we have hands-on experience building complex applications across different industries for world-famous companies. Check our open positions if you are ready for challenging and exciting projects.

How can we help you?
Contact Us