All Integrations
CloudCloudWatch Metric Streams + IAM

AWS SageMaker Integration

Monitor training job metrics, endpoint invocation latency, and model accuracy tracking across your SageMaker workloads. Get AI-powered anomaly detection and deployment correlation before model regressions impact production.

Setup

How It Works

01

Create IAM Role for Metric Streams

Provision an IAM role with CloudWatch read permissions and attach the TigerOps-provided policy. The role grants access to SageMaker namespaces in CloudWatch.

02

Enable CloudWatch Metric Streams

Deploy the TigerOps CloudFormation stack to create a Metric Stream targeting the AWS/SageMaker namespace. Metrics flow to your TigerOps Firehose endpoint in under two minutes.

03

Tag Endpoints and Training Jobs

Apply resource tags to your SageMaker endpoints and training jobs. TigerOps uses these tags to group metrics by model name, environment, and team for precise alerting.

04

Configure Model Performance Alerts

Set SLOs on invocation latency, error rates, and GPU utilization. TigerOps correlates endpoint degradation with recent model deployments and infrastructure changes automatically.

Capabilities

What You Get Out of the Box

Training Job Metrics

Track training loss, validation accuracy, GPU and CPU utilization, and data download time per training job. Spot underperforming runs before they waste compute budget.

Endpoint Invocation Latency

P50, P90, and P99 invocation latency per endpoint variant. TigerOps alerts when latency deviates from baseline and correlates spikes with concurrent request volume.

Model Accuracy Tracking

Ingest custom model quality metrics from SageMaker Model Monitor and track accuracy drift over time. Receive alerts when data drift exceeds configured thresholds.

Endpoint Auto-Scaling Visibility

Monitor invocation counts, instance counts, and scaling events per endpoint. Understand whether your auto-scaling policy keeps up with traffic demand.

GPU & Instance Utilization

Real-time GPU memory utilization, GPU compute utilization, and disk I/O for training instances. Ensure your compute spend maps to actual model training throughput.

AI Deployment Correlation

TigerOps automatically links endpoint latency regressions to new model deployments in your CI/CD pipeline, giving instant rollback context when quality degrades.

Configuration

CloudFormation Stack for SageMaker Metric Streams

Deploy the TigerOps CloudFormation stack to start streaming SageMaker metrics in minutes.

tigerops-sagemaker-streams.yaml
# TigerOps CloudFormation — SageMaker Metric Streams
# aws cloudformation deploy \
#   --template-file tigerops-sagemaker-streams.yaml \
#   --stack-name tigerops-sagemaker \
#   --capabilities CAPABILITY_IAM

Parameters:
  TigerOpsApiKey:
    Type: String
    NoEcho: true
  Environment:
    Type: String
    Default: production

Resources:
  TigerOpsFirehoseRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Effect: Allow
            Principal:
              Service: firehose.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: TigerOpsFirehosePolicy
          PolicyDocument:
            Statement:
              - Effect: Allow
                Action:
                  - s3:PutObject
                  - s3:GetBucketLocation
                Resource: !Sub "arn:aws:s3:::${BackupBucket}/*"

  TigerOpsSageMakerStream:
    Type: AWS::CloudWatch::MetricStream
    Properties:
      Name: tigerops-sagemaker-stream
      FirehoseArn: !GetAtt TigerOpsDeliveryStream.Arn
      RoleArn: !GetAtt MetricStreamRole.Arn
      OutputFormat: opentelemetry0.7
      IncludeFilters:
        - Namespace: AWS/SageMaker
        - Namespace: /aws/sagemaker/TrainingJobs
        - Namespace: /aws/sagemaker/Endpoints
      StatisticsConfigurations:
        - AdditionalStatistics:
            - p50
            - p90
            - p99
          IncludeMetrics:
            - Namespace: AWS/SageMaker
              MetricName: ModelLatency
            - Namespace: AWS/SageMaker
              MetricName: OverheadLatency

  TigerOpsDeliveryStream:
    Type: AWS::KinesisFirehose::DeliveryStream
    Properties:
      HttpEndpointDestinationConfiguration:
        EndpointConfiguration:
          Url: https://ingest.atatus.net/api/v1/cloudwatch
          AccessKey: !Ref TigerOpsApiKey
        RequestConfiguration:
          CommonAttributes:
            - AttributeName: environment
              AttributeValue: !Ref Environment
            - AttributeName: service
              AttributeValue: sagemaker
        RetryOptions:
          DurationInSeconds: 60
FAQ

Common Questions

Which SageMaker metrics does TigerOps collect?

TigerOps collects all metrics published to the AWS/SageMaker CloudWatch namespace, including Invocations, InvocationLatency, ModelLatency, OverheadLatency, Invocation4XXErrors, Invocation5XXErrors, CPUUtilization, GPUUtilization, and MemoryUtilization for both training and inference.

Can TigerOps monitor SageMaker Model Monitor quality metrics?

Yes. TigerOps ingests Model Monitor violation reports via an S3 event trigger and displays data quality, model quality, and bias drift metrics alongside your operational endpoint metrics in unified dashboards.

How do I track training job costs in TigerOps?

TigerOps correlates SageMaker training job duration and instance type metrics with AWS Cost and Usage Reports. You can set budget alerts per training pipeline and see cost-per-experiment trends over time.

Does TigerOps support multi-model endpoints?

Yes. Multi-model endpoint metrics are broken down per model variant using the VariantName dimension. TigerOps surfaces per-variant latency, invocation counts, and error rates so you can identify underperforming variants.

How quickly do SageMaker metrics appear in TigerOps after setup?

CloudWatch Metric Streams deliver metrics with approximately 2-3 minutes of latency. After deploying the CloudFormation stack, your first SageMaker metrics appear in TigerOps within 5 minutes.

Get Started

Stop Discovering Model Regressions After the Fact

Training job metrics, endpoint latency monitoring, and AI deployment correlation. Deploy in 5 minutes.