Amazon recently announced that SageMaker Serverless Inference is generally available. Designed for workloads with intermittent or infrequent traffic patterns, the new option provisions and scales compute capacity according to the volume of inference requests the model receives.
Similar to other serverless services on AWS, SageMaker Serverless Inference endpoints automatically start the compute resources and scale them in and out depending on traffic, without choosing an instance type or managing scaling, and can scale instantly from tens to thousands of inferences within seconds. It is also possible to specify the memory requirements for the serverless inference endpoint. Antje Barth, principal developer advocate at AWS, explains the benefits of the new option:
In a lot of conversations with ML practitioners, I’ve picked up the ask for a fully managed ML inference option that lets you focus on developing the inference code while managing all things infrastructure for you. SageMaker Serverless Inference now delivers this ease of deployment.
The preview of the serverless option was introduced at re:Invent 2021 and since then the cloud provider has added support for the Amazon SageMaker Python SDK and Model Registry, a capability to integrate the serverless inference endpoints with a MLOps workflow.
The need of a serverless option and alternatives to SageMaker were discussed in the past on a Reddit thread. Leveraging container image support in AWS Lambda is another approach to run serverless machine learning workloads as explained by Luca Bianchi, CTO at Neosperience.
Philipp Schmid, technical lead at Hugging Face, writes:
SageMaker Serverless Inference will 100% help you accelerate your machine learning journey and enables you to build fast and cost-effective proofs-of-concept where cold starts or scalability is not mission-critical, which can quickly be moved to GPUs or more high scale environments.
In a separate article, Schmid and co-authors from AWS explain how to host Hugging Face transformer models using SageMaker Serverless Inference. Barth adds a warning on how to handle cold-starts:
If the endpoint does not receive traffic for a while, it scales down the compute resources. If the endpoint suddenly receives new requests, you might notice that it takes some time for the endpoint to scale up the compute resources to process the requests. This cold-start time greatly depends on your model size and the start-up time of your container. To optimize cold-start times, you can try to minimize the size of your model, for example, by applying techniques such as knowledge distillation, quantization, or model pruning.
In addition to the latest serverless addition, Amazon SageMaker has other three model inference options to support different use cases: SageMaker Real-Time Inference, designed for workloads with low latency requirements in the order of milliseconds, SageMaker Asynchronous Inference, suggested for inferences with large payload sizes or requiring long processing times, and SageMaker Batch Transform to run predictions on batches of data.
Customers can create and update a serverless inference endpoint using the SageMaker console, the AWS SDKs, the SageMaker Python SDK, the AWS CLI, or AWS CloudFormation. The pricing is billed by the millisecond based on the compute time to run the inference code and the amount of data processed. There is a free tier usage per month for the first two months of “150,000 seconds of inference duration”.