AWS MSK Integration
Monitor managed Kafka broker metrics, partition health, and consumer lag across your MSK clusters. Get predictive lag alerts and AI-powered broker correlation before streaming incidents impact consumers.
How It Works
Enable Enhanced MSK Monitoring
In your MSK cluster settings, enable PER_BROKER or PER_TOPIC_PER_BROKER monitoring level. This unlocks the full set of broker and topic metrics published to CloudWatch.
Deploy CloudWatch Metric Streams
Use the TigerOps CloudFormation template to stream the AWS/Kafka namespace to your TigerOps Firehose endpoint. All broker, topic, and consumer group metrics begin flowing immediately.
Configure Consumer Lag Alerts
Set consumer group lag thresholds per topic in TigerOps. The platform uses predictive alerting to warn you before lag reaches critical levels based on growth rate analysis.
Correlate with Producing Services
TigerOps links MSK broker metrics with traces from producer and consumer microservices, giving full context when a throughput drop or partition imbalance triggers an incident.
What You Get Out of the Box
Broker Health Monitoring
Under-replicated partitions, offline partitions, active controller count, and broker disk usage per node. Get alerted the moment a broker falls behind its replicas.
Consumer Group Lag
Per-consumer-group and per-topic lag with historical trend analysis. TigerOps forecasts lag growth and fires early warnings before SLO breaches occur.
Partition-Level Metrics
Bytes in/out per partition, message rates, leader election counts, and log end offset tracking across all topics in your MSK cluster.
Network & Storage I/O
Bytes in/out per broker, replication traffic, and disk write rates. Identify network hotspots and storage saturation risk before they cause broker unavailability.
ZooKeeper & KRaft Health
ZooKeeper request latency and active connections (MSK pre-3.x) plus KRaft controller metrics for clusters running Kafka 3.x without ZooKeeper dependency.
AI Partition Imbalance Detection
TigerOps AI detects when partition leadership is skewed across brokers and correlates imbalances with producer throughput anomalies and consumer lag spikes.
CloudFormation Stack for MSK Metric Streams
Deploy the TigerOps CloudFormation stack to start streaming MSK Kafka metrics in minutes.
# TigerOps CloudFormation — MSK Metric Streams
# aws cloudformation deploy \
# --template-file tigerops-msk-streams.yaml \
# --stack-name tigerops-msk \
# --capabilities CAPABILITY_IAM
Parameters:
TigerOpsApiKey:
Type: String
NoEcho: true
ClusterName:
Type: String
Description: MSK cluster name for tagging
Resources:
TigerOpsMSKStream:
Type: AWS::CloudWatch::MetricStream
Properties:
Name: tigerops-msk-stream
FirehoseArn: !GetAtt TigerOpsDeliveryStream.Arn
RoleArn: !GetAtt MetricStreamRole.Arn
OutputFormat: opentelemetry0.7
IncludeFilters:
- Namespace: AWS/Kafka
- Namespace: AWS/KafkaConnect
StatisticsConfigurations:
- AdditionalStatistics:
- p50
- p90
- p99
IncludeMetrics:
- Namespace: AWS/Kafka
MetricName: FetchConsumerTotalTimeMsMean
- Namespace: AWS/Kafka
MetricName: ProduceTotalTimeMsMean
TigerOpsDeliveryStream:
Type: AWS::KinesisFirehose::DeliveryStream
Properties:
HttpEndpointDestinationConfiguration:
EndpointConfiguration:
Url: https://ingest.atatus.net/api/v1/cloudwatch
AccessKey: !Ref TigerOpsApiKey
RequestConfiguration:
CommonAttributes:
- AttributeName: cluster
AttributeValue: !Ref ClusterName
- AttributeName: service
AttributeValue: msk
RetryOptions:
DurationInSeconds: 60
# Enable enhanced monitoring on your MSK cluster:
# aws kafka update-monitoring \
# --cluster-arn <CLUSTER_ARN> \
# --current-version <VERSION> \
# --enhanced-monitoring PER_TOPIC_PER_BROKERCommon Questions
What MSK monitoring level should I enable for TigerOps?
TigerOps recommends PER_TOPIC_PER_BROKER for production clusters. This gives the richest metric granularity. For cost-sensitive environments, PER_BROKER is sufficient to cover broker health and aggregate throughput monitoring.
Does TigerOps support MSK Serverless?
Yes. MSK Serverless publishes metrics to the AWS/Kafka namespace just like provisioned clusters. TigerOps ingests these metrics and provides the same consumer lag and throughput dashboards, adjusted for the serverless billing model.
Can TigerOps monitor MSK Connect connectors?
Yes. MSK Connect worker and connector metrics are published to the AWS/KafkaConnect namespace. TigerOps includes these in your MSK dashboards, covering worker task count, record throughput, and connector error rates.
How does TigerOps handle multi-AZ MSK clusters?
TigerOps uses the BrokerID and AvailabilityZone CloudWatch dimensions to split metrics per broker and AZ. You can visualize cross-AZ replication traffic and per-AZ partition leader distribution on dedicated dashboard panels.
Can I correlate MSK metrics with my application traces?
Yes. TigerOps correlates MSK consumer lag spikes with distributed traces from the consuming service. When lag grows, TigerOps surfaces the corresponding slow consumers, error rates, and slow database queries in a single incident timeline.
Stop Discovering MSK Consumer Lag After the Fact
Broker health monitoring, predictive lag alerts, and AI partition correlation. Deploy in 5 minutes.