Skip to content

Commit 3f5c5ac

Browse files
authored
feat(sagemaker): add support for serverless inference endpoints (#35557)
Implements SageMaker Serverless Inference endpoints as requested in issue #23148. - Add ServerlessProductionVariantProps interface with maxConcurrency, memorySizeInMB, and provisionedConcurrency - Extend EndpointConfig to support serverless variants alongside existing instance variants - Add comprehensive validation for serverless configuration parameters - Enforce mutual exclusivity between instance and serverless variants - Add CloudFormation template generation for ServerlessConfig properties - Include extensive test coverage for validation scenarios and error cases ### Issue # 23148 Closes #23148. ### Reason for this change AWS SageMaker Serverless Inference is not supported in the CDK SageMaker L2 constructs. Users can only configure instance-based endpoints, missing the serverless option for intermittent/unpredictable traffic patterns that could benefit from cost-effective serverless inference. This feature was explicitly planned in the original [SageMaker Endpoint L2 construct RFC](https://github.com/aws/aws-cdk-rfcs/blob/master/text/0431-sagemaker-l2-endpoint.md#feature-additions) with Instance-prefixed classes designed to make room for Serverless-prefixed analogs. ### Description of changes Implements AWS SageMaker Serverless Inference support in CDK SageMaker L2 constructs, enabling cost-effective serverless endpoints for intermittent workloads: - **New `ServerlessProductionVariantProps` interface** extending `ProductionVariantProps` with AWS-compliant serverless properties: - `maxConcurrency`: 1-200 range (required) - `memorySizeInMB`: 1024-6144MB in 1GB increments (required) - `provisionedConcurrency`: 1-200 range, optional, must be ≤ maxConcurrency - **New `addServerlessProductionVariant()` method** with comprehensive input validation - **Extended `EndpointConfigProps`** with optional `serverlessProductionVariant` property - **Mutual exclusivity enforcement** between instance and serverless variants per AWS constraints - **Single serverless variant limit** per endpoint configuration (AWS limitation) - **Comprehensive synthesis-time validation** with clear, actionable error messages - **CloudFormation integration** leveraging existing L1 construct `ServerlessConfig` support **Usage Example**: ```typescript import * as sagemaker from '@aws-cdk/aws-sagemaker-alpha'; declare const model: sagemaker.IModel; // Create serverless endpoint configuration const endpointConfig = new sagemaker.EndpointConfig(this, 'ServerlessEndpointConfig', { serverlessProductionVariant: { model: model, variantName: 'serverlessVariant', maxConcurrency: 10, memorySizeInMB: 2048, provisionedConcurrency: 5, // optional }, }); ``` ### Describe any new or updated permissions being added N/A - No new IAM permissions required. Leverages existing SageMaker model and endpoint permissions. ### Description of how you validated changes - **Unit tests**: Added 12 comprehensive serverless variant tests covering all validation scenarios: - Memory size validation (1024-6144MB in 1GB increments) - Concurrency range validation (1-200 for both max and provisioned) - Mutual exclusivity enforcement between instance and serverless variants - Single serverless variant limit per AWS constraints - Cross-environment model compatibility validation - Error condition testing with clear error messages - CloudFormation template generation verification - **Integration tests**: Extended existing integration test with serverless endpoint configuration, verified CloudFormation template generation with correct `ServerlessConfig` properties: ```yaml ServerlessEndpointConfig: Type: AWS::SageMaker::EndpointConfig Properties: ProductionVariants: - ServerlessConfig: MaxConcurrency: 10 MemorySizeInMB: 2048 ProvisionedConcurrency: 5 VariantName: serverlessVariant ``` - **Comprehensive testing results**: 63/63 unit tests pass (100% success rate), 4/4 integration tests pass, no regressions detected across 16,024+ CDK tests ### Checklist - [x] My code adheres to the [CONTRIBUTING GUIDE](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md) and [DESIGN GUIDELINES](https://github.com/aws/aws-cdk/blob/main/docs/DESIGN_GUIDELINES.md) ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
1 parent 60096ac commit 3f5c5ac

File tree

16 files changed

+34385
-1375
lines changed

16 files changed

+34385
-1375
lines changed

packages/@aws-cdk/aws-sagemaker-alpha/README.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -214,6 +214,38 @@ const endpointConfig = new sagemaker.EndpointConfig(this, 'EndpointConfig', {
214214
});
215215
```
216216

217+
### Serverless Inference
218+
219+
Amazon SageMaker Serverless Inference is a purpose-built inference option that makes it easy for you to deploy and scale ML models. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. For more information, see [SageMaker Serverless Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html).
220+
221+
To create a serverless endpoint configuration, use the `serverlessProductionVariant` property:
222+
223+
```typescript
224+
import * as sagemaker from '@aws-cdk/aws-sagemaker-alpha';
225+
226+
declare const model: sagemaker.Model;
227+
228+
const endpointConfig = new sagemaker.EndpointConfig(this, 'ServerlessEndpointConfig', {
229+
serverlessProductionVariant: {
230+
model: model,
231+
variantName: 'serverlessVariant',
232+
maxConcurrency: 10,
233+
memorySizeInMB: 2048,
234+
provisionedConcurrency: 5, // optional
235+
},
236+
});
237+
```
238+
239+
Serverless inference is ideal for workloads with intermittent or unpredictable traffic patterns. You can configure:
240+
241+
- `maxConcurrency`: Maximum concurrent invocations (1-200)
242+
- `memorySizeInMB`: Memory allocation in 1GB increments (1024, 2048, 3072, 4096, 5120, or 6144 MB)
243+
- `provisionedConcurrency`: Optional pre-warmed capacity to reduce cold starts
244+
245+
**Note**: Provisioned concurrency incurs charges even when the endpoint is not processing requests. Use it only when you need to minimize cold start latency.
246+
247+
You cannot mix serverless and instance-based variants in the same endpoint configuration.
248+
217249
### Endpoint
218250

219251
When you create an endpoint from an `EndpointConfig`, Amazon SageMaker launches the ML compute

packages/@aws-cdk/aws-sagemaker-alpha/lib/endpoint-config.ts

Lines changed: 188 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,31 @@ export interface InstanceProductionVariantProps extends ProductionVariantProps {
7575
readonly instanceType?: InstanceType;
7676
}
7777

78+
/**
79+
* Construction properties for a serverless production variant.
80+
*/
81+
export interface ServerlessProductionVariantProps extends ProductionVariantProps {
82+
/**
83+
* The maximum number of concurrent invocations your serverless endpoint can process.
84+
*
85+
* Valid range: 1-200
86+
*/
87+
readonly maxConcurrency: number;
88+
/**
89+
* The memory size of your serverless endpoint. Valid values are in 1 GB increments:
90+
* 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB.
91+
*/
92+
readonly memorySizeInMB: number;
93+
/**
94+
* The number of concurrent invocations that are provisioned and ready to respond to your endpoint.
95+
*
96+
* Valid range: 1-200, must be less than or equal to maxConcurrency.
97+
*
98+
* @default - none
99+
*/
100+
readonly provisionedConcurrency?: number;
101+
}
102+
78103
/**
79104
* Represents common attributes of all production variant types (e.g., instance, serverless) once
80105
* associated to an EndpointConfig.
@@ -119,6 +144,26 @@ export interface InstanceProductionVariant extends ProductionVariant {
119144
readonly instanceType: InstanceType;
120145
}
121146

147+
/**
148+
* Represents a serverless production variant that has been associated with an EndpointConfig.
149+
*
150+
* @internal
151+
*/
152+
interface ServerlessProductionVariant extends ProductionVariant {
153+
/**
154+
* The maximum number of concurrent invocations your serverless endpoint can process.
155+
*/
156+
readonly maxConcurrency: number;
157+
/**
158+
* The memory size of your serverless endpoint.
159+
*/
160+
readonly memorySizeInMB: number;
161+
/**
162+
* The number of concurrent invocations that are provisioned and ready to respond to your endpoint.
163+
*/
164+
readonly provisionedConcurrency?: number;
165+
}
166+
122167
/**
123168
* Construction properties for a SageMaker EndpointConfig.
124169
*/
@@ -142,9 +187,21 @@ export interface EndpointConfigProps {
142187
* A list of instance production variants. You can always add more variants later by calling
143188
* `EndpointConfig#addInstanceProductionVariant`.
144189
*
190+
* Cannot be specified if `serverlessProductionVariant` is specified.
191+
*
145192
* @default - none
146193
*/
147194
readonly instanceProductionVariants?: InstanceProductionVariantProps[];
195+
196+
/**
197+
* A serverless production variant. Serverless endpoints automatically launch compute resources
198+
* and scale them in and out depending on traffic.
199+
*
200+
* Cannot be specified if `instanceProductionVariants` is specified.
201+
*
202+
* @default - none
203+
*/
204+
readonly serverlessProductionVariant?: ServerlessProductionVariantProps;
148205
}
149206

150207
/**
@@ -207,6 +264,7 @@ export class EndpointConfig extends cdk.Resource implements IEndpointConfig {
207264
public readonly endpointConfigName: string;
208265

209266
private readonly instanceProductionVariantsByName: { [key: string]: InstanceProductionVariant } = {};
267+
private serverlessProductionVariant?: ServerlessProductionVariant;
210268

211269
constructor(scope: Construct, id: string, props: EndpointConfigProps = {}) {
212270
super(scope, id, {
@@ -215,13 +273,22 @@ export class EndpointConfig extends cdk.Resource implements IEndpointConfig {
215273
// Enhanced CDK Analytics Telemetry
216274
addConstructMetadata(this, props);
217275

276+
// Validate mutual exclusivity
277+
if (props.instanceProductionVariants && props.serverlessProductionVariant) {
278+
throw new Error('Cannot specify both instanceProductionVariants and serverlessProductionVariant. Choose one variant type.');
279+
}
280+
218281
(props.instanceProductionVariants || []).map(p => this.addInstanceProductionVariant(p));
219282

283+
if (props.serverlessProductionVariant) {
284+
this.addServerlessProductionVariant(props.serverlessProductionVariant);
285+
}
286+
220287
// create the endpoint configuration resource
221288
const endpointConfig = new CfnEndpointConfig(this, 'EndpointConfig', {
222289
kmsKeyId: (props.encryptionKey) ? props.encryptionKey.keyRef.keyArn : undefined,
223290
endpointConfigName: this.physicalName,
224-
productionVariants: cdk.Lazy.any({ produce: () => this.renderInstanceProductionVariants() }),
291+
productionVariants: cdk.Lazy.any({ produce: () => this.renderProductionVariants() }),
225292
});
226293
this.endpointConfigName = this.getResourceNameAttribute(endpointConfig.attrEndpointConfigName);
227294
this.endpointConfigArn = this.getResourceArnAttribute(endpointConfig.ref, {
@@ -238,6 +305,9 @@ export class EndpointConfig extends cdk.Resource implements IEndpointConfig {
238305
*/
239306
@MethodMetadata()
240307
public addInstanceProductionVariant(props: InstanceProductionVariantProps): void {
308+
if (this.serverlessProductionVariant) {
309+
throw new Error('Cannot add instance production variant when serverless production variant is already configured');
310+
}
241311
if (props.variantName in this.instanceProductionVariantsByName) {
242312
throw new Error(`There is already a Production Variant with name '${props.variantName}'`);
243313
}
@@ -252,6 +322,30 @@ export class EndpointConfig extends cdk.Resource implements IEndpointConfig {
252322
};
253323
}
254324

325+
/**
326+
* Add serverless production variant to the endpoint configuration.
327+
*
328+
* @param props The properties of a serverless production variant to add.
329+
*/
330+
@MethodMetadata()
331+
public addServerlessProductionVariant(props: ServerlessProductionVariantProps): void {
332+
if (Object.keys(this.instanceProductionVariantsByName).length > 0) {
333+
throw new Error('Cannot add serverless production variant when instance production variants are already configured');
334+
}
335+
if (this.serverlessProductionVariant) {
336+
throw new Error('Cannot add more than one serverless production variant per endpoint configuration');
337+
}
338+
this.validateServerlessProductionVariantProps(props);
339+
this.serverlessProductionVariant = {
340+
initialVariantWeight: props.initialVariantWeight || 1.0,
341+
maxConcurrency: props.maxConcurrency,
342+
memorySizeInMB: props.memorySizeInMB,
343+
modelName: props.model.modelName,
344+
provisionedConcurrency: props.provisionedConcurrency,
345+
variantName: props.variantName,
346+
};
347+
}
348+
255349
/**
256350
* Get instance production variants associated with endpoint configuration.
257351
*
@@ -276,10 +370,20 @@ export class EndpointConfig extends cdk.Resource implements IEndpointConfig {
276370
}
277371

278372
private validateProductionVariants(): void {
279-
// validate number of production variants
280-
if (this._instanceProductionVariants.length < 1) {
373+
const hasServerlessVariant = this.serverlessProductionVariant !== undefined;
374+
375+
// validate at least one production variant
376+
if (this._instanceProductionVariants.length === 0 && !hasServerlessVariant) {
281377
throw new Error('Must configure at least 1 production variant');
282-
} else if (this._instanceProductionVariants.length > 10) {
378+
}
379+
380+
// validate mutual exclusivity
381+
if (this._instanceProductionVariants.length > 0 && hasServerlessVariant) {
382+
throw new Error('Cannot configure both instance and serverless production variants');
383+
}
384+
385+
// validate instance variant limits
386+
if (this._instanceProductionVariants.length > 10) {
283387
throw new Error('Can\'t have more than 10 production variants');
284388
}
285389
}
@@ -310,11 +414,69 @@ export class EndpointConfig extends cdk.Resource implements IEndpointConfig {
310414
}
311415
}
312416

417+
private validateServerlessProductionVariantProps(props: ServerlessProductionVariantProps): void {
418+
const errors: string[] = [];
419+
420+
// check variant weight is not negative
421+
if (props.initialVariantWeight && props.initialVariantWeight < 0) {
422+
errors.push('Cannot have negative variant weight');
423+
}
424+
425+
// check maxConcurrency range
426+
if (props.maxConcurrency < 1 || props.maxConcurrency > 200) {
427+
errors.push('maxConcurrency must be between 1 and 200');
428+
}
429+
430+
// check memorySizeInMB valid values (1GB increments from 1024 to 6144)
431+
const validMemorySizes = [1024, 2048, 3072, 4096, 5120, 6144];
432+
if (!validMemorySizes.includes(props.memorySizeInMB)) {
433+
errors.push(`memorySizeInMB must be one of: ${validMemorySizes.join(', ')} MB`);
434+
}
435+
436+
// check provisionedConcurrency range and relationship to maxConcurrency
437+
if (props.provisionedConcurrency !== undefined) {
438+
if (props.provisionedConcurrency < 1 || props.provisionedConcurrency > 200) {
439+
errors.push('provisionedConcurrency must be between 1 and 200');
440+
}
441+
if (props.provisionedConcurrency > props.maxConcurrency) {
442+
errors.push('provisionedConcurrency cannot be greater than maxConcurrency');
443+
}
444+
}
445+
446+
// check environment compatibility with model
447+
const model = props.model;
448+
if (!sameEnv(model.env.account, this.env.account)) {
449+
errors.push(`Cannot use model in account ${model.env.account} for endpoint configuration in account ${this.env.account}`);
450+
} else if (!sameEnv(model.env.region, this.env.region)) {
451+
errors.push(`Cannot use model in region ${model.env.region} for endpoint configuration in region ${this.env.region}`);
452+
}
453+
454+
if (errors.length > 0) {
455+
throw new Error(`Invalid Serverless Production Variant Props: ${errors.join(EOL)}`);
456+
}
457+
}
458+
459+
/**
460+
* Render the list of production variants (instance or serverless).
461+
*/
462+
private renderProductionVariants(): CfnEndpointConfig.ProductionVariantProperty[] {
463+
this.validateProductionVariants();
464+
465+
if (this.serverlessProductionVariant) {
466+
return this.renderServerlessProductionVariant();
467+
} else {
468+
return this.renderInstanceProductionVariants();
469+
}
470+
}
471+
313472
/**
314473
* Render the list of instance production variants.
315474
*/
316475
private renderInstanceProductionVariants(): CfnEndpointConfig.ProductionVariantProperty[] {
317-
this.validateProductionVariants();
476+
if (this._instanceProductionVariants.length === 0) {
477+
throw new Error('renderInstanceProductionVariants called but no instance variants are configured');
478+
}
479+
318480
return this._instanceProductionVariants.map( v => ({
319481
acceleratorType: v.acceleratorType?.toString(),
320482
initialInstanceCount: v.initialInstanceCount,
@@ -324,4 +486,25 @@ export class EndpointConfig extends cdk.Resource implements IEndpointConfig {
324486
variantName: v.variantName,
325487
}) );
326488
}
489+
490+
/**
491+
* Render the serverless production variant.
492+
*/
493+
private renderServerlessProductionVariant(): CfnEndpointConfig.ProductionVariantProperty[] {
494+
if (!this.serverlessProductionVariant) {
495+
throw new Error('renderServerlessProductionVariant called but no serverless variant is configured');
496+
}
497+
498+
const variant = this.serverlessProductionVariant;
499+
return [{
500+
initialVariantWeight: variant.initialVariantWeight,
501+
modelName: variant.modelName,
502+
variantName: variant.variantName,
503+
serverlessConfig: {
504+
maxConcurrency: variant.maxConcurrency,
505+
memorySizeInMb: variant.memorySizeInMB,
506+
provisionedConcurrency: variant.provisionedConcurrency,
507+
},
508+
}];
509+
}
327510
}

0 commit comments

Comments
 (0)