Amazon SageMaker async inference now supports inline request payloads

AWS ML Blog | Written by: 이현민 | Jun 21, 2026 | 0 views | ★★★★☆

Amazon Web Services has updated its SageMaker AI Async Inference feature to support inline request payloads. This enhancement allows users to directly send inference data up to 128,000 bytes within the API request, eliminating the prior requirement to upload input data to Amazon S3. The change simplifies the workflow for smaller payloads and reduces operational overhead for AI model deployment.

Amazon Web Services (AWS) has introduced inline payload support for Amazon SageMaker AI Async Inference. This new capability allows customers to directly embed inference data within the request body of the InvokeEndpointAsync API, bypassing the previous requirement to first upload input data to Amazon S3. The update specifically caters to payloads up to 128,000 bytes, streamlining the process for smaller data sets. This feature is now available across 31 commercial AWS Regions, enhancing the efficiency of AI model deployment and operation for a wide range of users.

Prior to this update, Amazon SageMaker AI Async Inference, designed for workloads with large payloads, variable traffic, or those tolerant of seconds-to-minutes latency, mandated a two-step process. Users had to upload their input payload to an S3 bucket and then invoke the endpoint by passing the S3 object URI. While effective for very large data like images or multi-megabyte documents, this S3 dependency introduced unnecessary complexity for customers dealing with smaller input payloads, typically in kilobytes, who still required the longer processing times offered by asynchronous inference. The service's ability to automatically scale to zero instances made it a cost-efficient solution for bursty or batch-style workloads, but the initial setup for smaller data was cumbersome.

The introduction of inline payloads significantly simplifies client-side code and reduces the operational surface area for asynchronous inference workloads involving smaller data. By removing an entire network round-trip, developers can achieve more direct and efficient data transfer for their AI models. This enhancement is particularly beneficial for applications where quick, small-scale inferences are needed without the overhead of managing S3 uploads for every request. The new Body parameter in the InvokeEndpointAsync API is mutually exclusive with the InputLocation parameter, ensuring a clear choice between direct inline data or S3-based input, while output behavior remains unchanged, with results still written to a configured S3 output location. This move by AWS reflects a continuous effort to refine and simplify the machine learning deployment pipeline, making AI services more accessible and efficient for a broader spectrum of use cases.

Amazon SageMaker async inference now supports inline request payloads

What this means for the market

How this issue is unfolding