Streaming-First Infrastructure for Real-Time Machine Learning





Key Takeaways

  • A streaming infrastructure can improve ML prediction latency and continual learning
  • Batch processing on static data is a subset of processing streaming data, so a streaming system can be used for both cases
  • Avoid common production problems by using a single streaming pipeline for online prediction and continual learning
  • An event-driven microservices architecture is a better choice  for using continual learning than is a REST-based architecture
  • To know if continual learning is right for you, you should quantify the value of data freshness and fast iteration

Many companies have begun using machine learning (ML) models to improve their customer experience. In this article, I will talk about the benefits of streaming-first infrastructure for real-time ML. There are two scenarios of real-time ML that I want to cover. The first is online prediction, where a model can receive a request and make predictions as soon as the request arrives. The other is continual learning. Continual learning is when machine learning models are capable of continually adapting to change in data distributions in production.

Online Prediction

Online prediction is pretty straightforward to deploy. If you have developed a model, the easiest way to deploy is to containerize it, then upload it to a platform like AWS or GCP to create an online prediction web service endpoint. If you send data to that endpoint, you can get predictions for this data.

The problem with online predictions is latency. Research shows that no matter how good your model predictions are, if it takes even a few milliseconds too long to return results, users will leave your site or click on something else. A common trend in ML is toward bigger models. These give better accuracy, but it generally also means that inference takes longer, and users don’t want to wait.

How do you make online prediction work? You actually need two components. The first is a model that is capable of returning fast inference. One solution is to use model compression techniques such as quantization and distillation. You could also use more powerful hardware, which allows models to do computation faster.

However, the solution I recommend is to use a real-time pipeline. What does that mean? A pipeline that can process data, input the data into the model, and generate predictions and return predictions in real time to users.

To illustrate a real-time pipeline, imagine you’re building a fraud detection model for a ride-sharing service like Uber or Lyft. To detect whether a transaction is fraudulent, you want information about that transaction specifically as well as the user’s other recent transactions. You also need to know about the specific credit card’s recent transactions, because when a credit card is stolen, the thief wants to make the most out of that credit card by using it for multiple transactions at the same time. You also want to look into recent in-app fraud, because there might be a trend, and maybe this specific transaction is related to those other fraudulent transactions.

A lot of this is recent information, and the question is: how do you quickly assess these recent features? You don’t want to move the data in and out of your permanent storage because it might take too long and users are impatient.

Real-Time Transport and Stream Processing

The solution is to use in-memory storage. When you have incoming events – a user books a trip, picks a location, cancels trip, contacts the driver – then you put all the events into in-memory storage, and then you keep them there for as long as those events are useful for real-time purposes. At some point, say after a few days, you can either discard those events or move them to permanent storage, such as AWS S3.

The in-memory storage is generally what is called real-time transport, based on a system such as Kafka, Kinesis, or Pulsar. Because these platforms are event-based, this kind of processing is called event-driven processing.

Now, I want to differentiate between static data and streaming data. Static data is a fixed dataset, which contains features that don’t change, or else change very slowly: things like a user’s age or when an account was created. Also, static data is bounded: you know exactly how many data samples there are.

Streaming data, on the other hand, is continually being generated: it is unbounded. Streaming data includes information that is very recent, about features that can change very quickly. For example: a user’s location in the last 10 minutes, or the web pages a user has visited in the last few minutes.

Static data is often stored in a file format like comma-separated values (CSV) or Parquet and processed using a batch-processing system such as Hadoop. Because static data is bounded, when each data sample has been processed, you know the job is complete. By contrast, streaming data is usually accessed through a real-time transport platform like Kafka or Kinesis and handled using a stream-processing tool such as Flink or Samza. Because the data is unbounded, processing is never complete!

One Model, Two Pipelines

The problem with separating data into batch processing and stream processing, is that now you have two different pipelines for one ML model. First, training a model uses static data and batch processing to generate features. During inference, however, you do online predictions with streaming data and stream processing to extract features.

This mismatch is a very common source for errors in production when a change in one pipeline isn’t replicated in the pipeline. I personally have encountered that a few times. In one case, I had models that performed really well during development. When I deployed the models to production, however, the performance was poor.

To trouble-shoot this, I took a sample of data and ran it through the prediction function in the training pipeline, and then the prediction function in the inference pipeline. The two pipelines produced different results, and I eventually realized there was a mismatch between them.

The solution is to unify the batch and stream processing by using a streaming-first infrastructure. The key insight is that batch processing is a special case of streaming processing, because a bounded dataset is actually a special case of the unbounded data from streaming: if a system can deal with an unbounded data stream, it can work with a bounded dataset. On the other hand, if a system is designed to process a bounded dataset, it’s very hard to make it work with an unbounded data stream.

Request-Driven to Event-Driven Architecture

In the domain of microservices, a concept related to event-driven processing is event-driven architecture, as opposed to request-driven architecture. In the last decade, the rise of microservices is very tightly coupled with the rise of the REST API. A REST API is request driven, meaning that there is an interaction of a client and server. The client sends a request, such as a POST or GET request to the server, which returns a response. This is a synchronous operation, and the server has to be listening for the requests continuously. If the server is down, the client will keep resending new requests until it gets a response, or until it times out.

One problem that can arise when you have a lot of microservices is inter-service communications, because different services will have to send requests to each other and get information from each other. In the figure below, there are three microservices, and there are many arrows showing the flow of information back and forth. If there are hundreds or thousands of microservices, it can be extremely complex and slow.

Another problem is how to map data transformation through the entire system. I have already mentioned how difficult it can be to understand machine models in production. To add to this complexity, you often don’t have the full view of the data flow through the system, so it can be very hard for monitoring and observability.

Instead of having request-driven communications, an alternative is an event-driven architecture. Instead of services communicating directly with each other, there is a central event stream. Whenever a service wants to publish something, it pushes that information onto the stream. Other services listen to the stream, and if an event is relevant to them, then they can take it and they can produce some result, which may also be published to the stream.

It’s possible for all services to publish to the same stream, and all services can also subscribe to the stream to get the information they need. The stream can be segregated into different topics, so that it’s easier to find the information relevant to a service.

There are several advantages to this event-driven architecture. First, it reduces the need for inter-service communications. Another is that because all the data transformation is now in the stream, you can just query the stream and understand how a piece of data is transformed by different services through the entire system. It’s really a nice property for monitoring.

From Monitoring to Continual Learning

It’s no secret that model performance degrades in production. There are many different reasons, but one key reason is data distribution shifts. Things change in the real world. The changes can be sudden – due to a pandemic, perhaps – or they can be cyclical. For example, ride sharing demand is probably different on the weekend compared to workdays. The change can also be gradual; for example, the way people talk slowly changes over time.

Monitoring helps you detect changing data distributions, but it’s a very shallow solution, because you detect the changes…and then what? What you really want is continual learning. You want to continually adapt models to changing data distributions.

When people hear continual learning, they think about the case where you have to update the models with every incoming sample. This has several drawbacks. For one thing, models could suffer from catastrophic forgetting. Another is that it can get unnecessarily expensive. A lot of hardware backends today are built to process a lot of data at the same time, so using that to process one sample at a time would be very wasteful. Instead, a better strategy is to update models with micro-batches of 500 or 1000 samples.

Iteration Cycle

You’ve made an update to the model, but you shouldn’t deploy the update until you have evaluated that update. In fact, with continual learning, you actually don’t update the production model. Instead, you create a replica of that model, and then update that replica, which now becomes a candidate model. You only want to deploy that candidate model production after it has been evaluated.

First, you use a static data test set to do offline evaluation, to ensure that the model isn’t doing something completely unexpected; think of this as a “smoke test.” You also need to do online evaluation, because the whole point of continual learning is to adapt a model to change in distributions, so it doesn’t make sense to test this on a stationary test set. The only way to be sure that the model is going to work is to do online evaluations.

There are a lot of ways for you to do it safely: through A/B testing, canary analysis, and multi-armed bandits. I’m especially excited about those, because they allow you to test multiple models: you treat each model as an arm of the bandit.

For continual learning, the iteration cycles can be done in order of minutes. For example, Weibo has an iteration cycle of around 10 minutes. You can see similar examples with Alibaba, TikTok, and Shein. This speed is remarkable, given the results of a recent study by Algorithmia which found that 64% of companies have cycles of a month or longer.

Continual Learning: Use Cases

There are a lot of great use cases for continual learning. First, it allows a model to adapt to rare events very quickly. As an example, Black Friday happens only once a year in the U.S., so there’s no way you can have enough historical information to accurately predict how a user is going to behave on Black Friday. For the best performance you would continually train the model during Black Friday. In China, Singles’ Day is a shopping holiday similar to Black Friday in the U.S., and it is one of the use cases where Alibaba is using continual learning.

Continual learning also helps you overcome the continuous cold start problem. This is when you have new users, or users get a new device, so you don’t have enough historic confirmations to make predictions for them. If you can update your model during sessions, then you can actually overcome the continuous cold start problem. Within a session, you can learn what users want, even though you don’t have historical data, and you can make relevant predictions. As an example, TikTok is very successful because they are able to use continual learning to adapt to users’ preference within a session.

Continual learning is especially good for tasks with natural labels, for example on recommendation systems. If you show users recommendations, and they click on it, then it was a good prediction. If after a certain period of time there are no clicks, then it’s a bad prediction. It is one short feedback loop, on order of minutes. This is applicable to much online content like short videos, Reddit posts, or Tweets.

However, not all recommendation systems have short feedback loops. For example, a marketplace like Stitchfix could recommend items that users might want, but would have to wait for the items to be shipped and for users to try them on, with a total cycle time of weeks before finding out if the prediction was good.

Is Continual Learning Right for You?

Continual learning sounds great, but is it right for you? First, you have to quantify the value of data freshness. People say that fresh data is better, but how much better? One thing you can do is you can try to measure how much model performance changes, if you switch from retraining monthly to weekly, to daily, or even to hourly.

For example, back in 2014, Facebook did a study that found if they went from training weekly to daily, they could increase their click-through rate by 1%, which was significant enough for them to change the pipeline to daily.

You also want to understand the value of model iteration and data iteration. Model iteration is when you make significant changes to a model architecture, and data iteration is training the same model on newer data. In theory, you can do both. In practice, the more you do of one, the fewer resources you have to spend on the other. I’ve seen a lot of companies that found out that data iteration actually gave them much higher return than model iteration.

You should also quantify the value of fast iterations. If you can run experiments very quickly and get feedback from the experiment quickly, then how many more experiments can you run? The more experiments you can run, the more likely you are to find models that work better for you and give you better return.

One problem that a lot of people are worried about is the cloud cost. Model training costs money, so you might think that the more often you train the model, the more expensive it’s going to be. That’s actually not always the case for continual learning.

In batch learning, when it takes longer to retrain the model, since you have to retrain the model from scratch. In continual learning, however, you just train the model more frequently, so you don’t have to retrain the model from scratch; you essentially just continue training the model on fresh data. It actually requires less data and fewer compute resources.

There was a really great study from Grubhub. When they switched from monthly training to daily training, they saw a 45x savings on training compute cost. At the same time, they achieved a more than 20% increase in their evaluation metrics.

Barriers to Streaming-first Infrastructure

Streaming-first infrastructure sounds great: you can use it for online prediction and  for continual learning. So why doesn’t everyone use it? One reason is that many companies don’t see the benefits of streaming; perhaps because their systems are not at a scale where inter-service communication has become a problem.

Another barrier is that companies simply have never tried it before. Because they’ve never tried that before, they don’t see the benefits. It’s a chicken and egg problem, because to see the benefits of streaming-first, you need to implement it.

Also, there’s a high initial investment in infrastructure and many companies worry that they need employees with specialized knowledge. This may have been true in the past, but the good news is that there are so many tools being built that make it extremely easy to switch to streaming-first infrastructure.

Bet on the Future

I think it’s important to make a bet in the future. Many companies are now moving to streaming-first because their metric increases have plateaued, and they know that for big metric wins they will need to try out new technology. By leveraging streaming-first infrastructure, companies can build a platform to make it easier to do real-time machine learning.

 





Leave a Comment