All Integrations
CloudCloudWatch Metric Streams + IAM

AWS MSK Integration

Monitor managed Kafka broker metrics, partition health, and consumer lag across your MSK clusters. Get predictive lag alerts and AI-powered broker correlation before streaming incidents impact consumers.

Setup

How It Works

01

Enable Enhanced MSK Monitoring

In your MSK cluster settings, enable PER_BROKER or PER_TOPIC_PER_BROKER monitoring level. This unlocks the full set of broker and topic metrics published to CloudWatch.

02

Deploy CloudWatch Metric Streams

Use the TigerOps CloudFormation template to stream the AWS/Kafka namespace to your TigerOps Firehose endpoint. All broker, topic, and consumer group metrics begin flowing immediately.

03

Configure Consumer Lag Alerts

Set consumer group lag thresholds per topic in TigerOps. The platform uses predictive alerting to warn you before lag reaches critical levels based on growth rate analysis.

04

Correlate with Producing Services

TigerOps links MSK broker metrics with traces from producer and consumer microservices, giving full context when a throughput drop or partition imbalance triggers an incident.

Capabilities

What You Get Out of the Box

Broker Health Monitoring

Under-replicated partitions, offline partitions, active controller count, and broker disk usage per node. Get alerted the moment a broker falls behind its replicas.

Consumer Group Lag

Per-consumer-group and per-topic lag with historical trend analysis. TigerOps forecasts lag growth and fires early warnings before SLO breaches occur.

Partition-Level Metrics

Bytes in/out per partition, message rates, leader election counts, and log end offset tracking across all topics in your MSK cluster.

Network & Storage I/O

Bytes in/out per broker, replication traffic, and disk write rates. Identify network hotspots and storage saturation risk before they cause broker unavailability.

ZooKeeper & KRaft Health

ZooKeeper request latency and active connections (MSK pre-3.x) plus KRaft controller metrics for clusters running Kafka 3.x without ZooKeeper dependency.

AI Partition Imbalance Detection

TigerOps AI detects when partition leadership is skewed across brokers and correlates imbalances with producer throughput anomalies and consumer lag spikes.

Configuration

CloudFormation Stack for MSK Metric Streams

Deploy the TigerOps CloudFormation stack to start streaming MSK Kafka metrics in minutes.

tigerops-msk-streams.yaml
# TigerOps CloudFormation — MSK Metric Streams
# aws cloudformation deploy \
#   --template-file tigerops-msk-streams.yaml \
#   --stack-name tigerops-msk \
#   --capabilities CAPABILITY_IAM

Parameters:
  TigerOpsApiKey:
    Type: String
    NoEcho: true
  ClusterName:
    Type: String
    Description: MSK cluster name for tagging

Resources:
  TigerOpsMSKStream:
    Type: AWS::CloudWatch::MetricStream
    Properties:
      Name: tigerops-msk-stream
      FirehoseArn: !GetAtt TigerOpsDeliveryStream.Arn
      RoleArn: !GetAtt MetricStreamRole.Arn
      OutputFormat: opentelemetry0.7
      IncludeFilters:
        - Namespace: AWS/Kafka
        - Namespace: AWS/KafkaConnect
      StatisticsConfigurations:
        - AdditionalStatistics:
            - p50
            - p90
            - p99
          IncludeMetrics:
            - Namespace: AWS/Kafka
              MetricName: FetchConsumerTotalTimeMsMean
            - Namespace: AWS/Kafka
              MetricName: ProduceTotalTimeMsMean

  TigerOpsDeliveryStream:
    Type: AWS::KinesisFirehose::DeliveryStream
    Properties:
      HttpEndpointDestinationConfiguration:
        EndpointConfiguration:
          Url: https://ingest.atatus.net/api/v1/cloudwatch
          AccessKey: !Ref TigerOpsApiKey
        RequestConfiguration:
          CommonAttributes:
            - AttributeName: cluster
              AttributeValue: !Ref ClusterName
            - AttributeName: service
              AttributeValue: msk
        RetryOptions:
          DurationInSeconds: 60

# Enable enhanced monitoring on your MSK cluster:
# aws kafka update-monitoring \
#   --cluster-arn <CLUSTER_ARN> \
#   --current-version <VERSION> \
#   --enhanced-monitoring PER_TOPIC_PER_BROKER
FAQ

Common Questions

What MSK monitoring level should I enable for TigerOps?

TigerOps recommends PER_TOPIC_PER_BROKER for production clusters. This gives the richest metric granularity. For cost-sensitive environments, PER_BROKER is sufficient to cover broker health and aggregate throughput monitoring.

Does TigerOps support MSK Serverless?

Yes. MSK Serverless publishes metrics to the AWS/Kafka namespace just like provisioned clusters. TigerOps ingests these metrics and provides the same consumer lag and throughput dashboards, adjusted for the serverless billing model.

Can TigerOps monitor MSK Connect connectors?

Yes. MSK Connect worker and connector metrics are published to the AWS/KafkaConnect namespace. TigerOps includes these in your MSK dashboards, covering worker task count, record throughput, and connector error rates.

How does TigerOps handle multi-AZ MSK clusters?

TigerOps uses the BrokerID and AvailabilityZone CloudWatch dimensions to split metrics per broker and AZ. You can visualize cross-AZ replication traffic and per-AZ partition leader distribution on dedicated dashboard panels.

Can I correlate MSK metrics with my application traces?

Yes. TigerOps correlates MSK consumer lag spikes with distributed traces from the consuming service. When lag grows, TigerOps surfaces the corresponding slow consumers, error rates, and slow database queries in a single incident timeline.

Get Started

Stop Discovering MSK Consumer Lag After the Fact

Broker health monitoring, predictive lag alerts, and AI partition correlation. Deploy in 5 minutes.