AWS SageMaker Integration
Monitor training job metrics, endpoint invocation latency, and model accuracy tracking across your SageMaker workloads. Get AI-powered anomaly detection and deployment correlation before model regressions impact production.
How It Works
Create IAM Role for Metric Streams
Provision an IAM role with CloudWatch read permissions and attach the TigerOps-provided policy. The role grants access to SageMaker namespaces in CloudWatch.
Enable CloudWatch Metric Streams
Deploy the TigerOps CloudFormation stack to create a Metric Stream targeting the AWS/SageMaker namespace. Metrics flow to your TigerOps Firehose endpoint in under two minutes.
Tag Endpoints and Training Jobs
Apply resource tags to your SageMaker endpoints and training jobs. TigerOps uses these tags to group metrics by model name, environment, and team for precise alerting.
Configure Model Performance Alerts
Set SLOs on invocation latency, error rates, and GPU utilization. TigerOps correlates endpoint degradation with recent model deployments and infrastructure changes automatically.
What You Get Out of the Box
Training Job Metrics
Track training loss, validation accuracy, GPU and CPU utilization, and data download time per training job. Spot underperforming runs before they waste compute budget.
Endpoint Invocation Latency
P50, P90, and P99 invocation latency per endpoint variant. TigerOps alerts when latency deviates from baseline and correlates spikes with concurrent request volume.
Model Accuracy Tracking
Ingest custom model quality metrics from SageMaker Model Monitor and track accuracy drift over time. Receive alerts when data drift exceeds configured thresholds.
Endpoint Auto-Scaling Visibility
Monitor invocation counts, instance counts, and scaling events per endpoint. Understand whether your auto-scaling policy keeps up with traffic demand.
GPU & Instance Utilization
Real-time GPU memory utilization, GPU compute utilization, and disk I/O for training instances. Ensure your compute spend maps to actual model training throughput.
AI Deployment Correlation
TigerOps automatically links endpoint latency regressions to new model deployments in your CI/CD pipeline, giving instant rollback context when quality degrades.
CloudFormation Stack for SageMaker Metric Streams
Deploy the TigerOps CloudFormation stack to start streaming SageMaker metrics in minutes.
# TigerOps CloudFormation — SageMaker Metric Streams
# aws cloudformation deploy \
# --template-file tigerops-sagemaker-streams.yaml \
# --stack-name tigerops-sagemaker \
# --capabilities CAPABILITY_IAM
Parameters:
TigerOpsApiKey:
Type: String
NoEcho: true
Environment:
Type: String
Default: production
Resources:
TigerOpsFirehoseRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Statement:
- Effect: Allow
Principal:
Service: firehose.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: TigerOpsFirehosePolicy
PolicyDocument:
Statement:
- Effect: Allow
Action:
- s3:PutObject
- s3:GetBucketLocation
Resource: !Sub "arn:aws:s3:::${BackupBucket}/*"
TigerOpsSageMakerStream:
Type: AWS::CloudWatch::MetricStream
Properties:
Name: tigerops-sagemaker-stream
FirehoseArn: !GetAtt TigerOpsDeliveryStream.Arn
RoleArn: !GetAtt MetricStreamRole.Arn
OutputFormat: opentelemetry0.7
IncludeFilters:
- Namespace: AWS/SageMaker
- Namespace: /aws/sagemaker/TrainingJobs
- Namespace: /aws/sagemaker/Endpoints
StatisticsConfigurations:
- AdditionalStatistics:
- p50
- p90
- p99
IncludeMetrics:
- Namespace: AWS/SageMaker
MetricName: ModelLatency
- Namespace: AWS/SageMaker
MetricName: OverheadLatency
TigerOpsDeliveryStream:
Type: AWS::KinesisFirehose::DeliveryStream
Properties:
HttpEndpointDestinationConfiguration:
EndpointConfiguration:
Url: https://ingest.atatus.net/api/v1/cloudwatch
AccessKey: !Ref TigerOpsApiKey
RequestConfiguration:
CommonAttributes:
- AttributeName: environment
AttributeValue: !Ref Environment
- AttributeName: service
AttributeValue: sagemaker
RetryOptions:
DurationInSeconds: 60Common Questions
Which SageMaker metrics does TigerOps collect?
TigerOps collects all metrics published to the AWS/SageMaker CloudWatch namespace, including Invocations, InvocationLatency, ModelLatency, OverheadLatency, Invocation4XXErrors, Invocation5XXErrors, CPUUtilization, GPUUtilization, and MemoryUtilization for both training and inference.
Can TigerOps monitor SageMaker Model Monitor quality metrics?
Yes. TigerOps ingests Model Monitor violation reports via an S3 event trigger and displays data quality, model quality, and bias drift metrics alongside your operational endpoint metrics in unified dashboards.
How do I track training job costs in TigerOps?
TigerOps correlates SageMaker training job duration and instance type metrics with AWS Cost and Usage Reports. You can set budget alerts per training pipeline and see cost-per-experiment trends over time.
Does TigerOps support multi-model endpoints?
Yes. Multi-model endpoint metrics are broken down per model variant using the VariantName dimension. TigerOps surfaces per-variant latency, invocation counts, and error rates so you can identify underperforming variants.
How quickly do SageMaker metrics appear in TigerOps after setup?
CloudWatch Metric Streams deliver metrics with approximately 2-3 minutes of latency. After deploying the CloudFormation stack, your first SageMaker metrics appear in TigerOps within 5 minutes.
Stop Discovering Model Regressions After the Fact
Training job metrics, endpoint latency monitoring, and AI deployment correlation. Deploy in 5 minutes.