Observability Quickstart

Get Prometheus metrics and Kafka lag monitoring running in 5 minutes.

Quick Start

1. Enable Metrics in Config

Add to your config.yaml:

observability:
  metrics_enabled: true
  metrics_port: 9090
  lag_monitoring_enabled: true
  lag_monitoring_interval_secs: 30

2. Start Streamforge

CONFIG_FILE=config.yaml ./streamforge

You’ll see:

✅ Metrics registered successfully
🔍 Metrics server listening on http://0.0.0.0:9090
   Metrics endpoint: http://localhost:9090/metrics
   Health endpoint:  http://localhost:9090/health
✅ Consumer lag monitoring started (interval: 30s)

3. View Metrics

Browser:

http://localhost:9090/metrics

curl:

curl http://localhost:9090/metrics

Sample Output:

# HELP streamforge_messages_consumed_total Total messages consumed from source Kafka
# TYPE streamforge_messages_consumed_total counter
streamforge_messages_consumed_total 125000

# HELP streamforge_messages_produced_total Messages successfully produced to destinations
# TYPE streamforge_messages_produced_total counter
streamforge_messages_produced_total{destination="premium-events"} 45000
streamforge_messages_produced_total{destination="standard-events"} 80000

# HELP streamforge_consumer_lag Consumer lag per partition
# TYPE streamforge_consumer_lag gauge
streamforge_consumer_lag{topic="input-topic",partition="0"} 1250
streamforge_consumer_lag{topic="input-topic",partition="1"} 890

# HELP streamforge_processing_duration_seconds End-to-end processing latency per destination
# TYPE streamforge_processing_duration_seconds histogram
streamforge_processing_duration_seconds_bucket{destination="premium-events",le="0.001"} 35000
streamforge_processing_duration_seconds_bucket{destination="premium-events",le="0.005"} 43000
streamforge_processing_duration_seconds_bucket{destination="premium-events",le="0.01"} 44500
streamforge_processing_duration_seconds_bucket{destination="premium-events",le="+Inf"} 45000
streamforge_processing_duration_seconds_sum{destination="premium-events"} 67.5
streamforge_processing_duration_seconds_count{destination="premium-events"} 45000

Prometheus Setup

Add Scrape Config

Edit prometheus.yml:

scrape_configs:
  - job_name: 'streamforge'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 15s
    scrape_timeout: 10s

Start Prometheus

docker run -d \
  -p 9091:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

Access Prometheus UI: http://localhost:9091

Quick Queries

Message Throughput

# Messages per second
rate(streamforge_messages_consumed_total[5m])

# Per destination
sum(rate(streamforge_messages_produced_total[5m])) by (destination)

Error Rate

# Errors per second
rate(streamforge_processing_errors_total[5m])

# Error percentage
rate(streamforge_processing_errors_total[5m]) / 
rate(streamforge_messages_consumed_total[5m]) * 100

Consumer Lag

# Total lag
sum(streamforge_consumer_lag)

# Per partition
streamforge_consumer_lag

# Lag increasing (alert!)
delta(streamforge_consumer_lag[5m]) > 1000

Processing Latency

# P99 latency
histogram_quantile(0.99, 
  rate(streamforge_processing_duration_seconds_bucket[5m])
)

# Average latency
rate(streamforge_processing_duration_seconds_sum[5m]) /
rate(streamforge_processing_duration_seconds_count[5m])

Filter Effectiveness

# Pass rate percentage
rate(streamforge_filter_evaluations_total{result="pass"}[5m]) /
rate(streamforge_filter_evaluations_total[5m]) * 100

# Messages filtered out per destination
rate(streamforge_messages_filtered_total[5m])

Grafana Dashboard

Quick Dashboard JSON

Create a dashboard with these panels:

Panel 1: Message Throughput

{
  "title": "Message Throughput",
  "targets": [{
    "expr": "rate(streamforge_messages_consumed_total[5m])",
    "legendFormat": "Consumed"
  }, {
    "expr": "sum(rate(streamforge_messages_produced_total[5m]))",
    "legendFormat": "Produced"
  }]
}

Panel 2: Consumer Lag

{
  "title": "Consumer Lag by Partition",
  "targets": [{
    "expr": "streamforge_consumer_lag",
    "legendFormat": "-"
  }]
}

Panel 3: Error Rate

{
  "title": "Error Rate",
  "targets": [{
    "expr": "rate(streamforge_processing_errors_total[5m])",
    "legendFormat": ""
  }]
}

Panel 4: Processing Latency

{
  "title": "Processing Latency (P50, P95, P99)",
  "targets": [
    {
      "expr": "histogram_quantile(0.50, rate(streamforge_processing_duration_seconds_bucket[5m]))",
      "legendFormat": "P50"
    },
    {
      "expr": "histogram_quantile(0.95, rate(streamforge_processing_duration_seconds_bucket[5m]))",
      "legendFormat": "P95"
    },
    {
      "expr": "histogram_quantile(0.99, rate(streamforge_processing_duration_seconds_bucket[5m]))",
      "legendFormat": "P99"
    }
  ]
}

Import to Grafana:

# Coming soon: Pre-built dashboard JSON
# Check examples/grafana-dashboard.json

Alerting Rules

prometheus-alerts.yml

groups:
  - name: streamforge_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: StreamforgeHighErrorRate
        expr: rate(streamforge_processing_errors_total[5m]) > 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High error rate in Streamforge"
          description: "Error rate is  errors/sec"

      # Consumer lag increasing
      - alert: StreamforgeConsumerLagIncreasing
        expr: delta(streamforge_consumer_lag[5m]) > 10000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Consumer lag increasing"
          description: "Lag increased by  in 5 minutes"

      # High latency
      - alert: StreamforgeHighLatency
        expr: |
          histogram_quantile(0.99,
            rate(streamforge_processing_duration_seconds_bucket[5m])
          ) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency above 1 second"

      # Service down
      - alert: StreamforgeDown
        expr: up{job="streamforge"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Streamforge service is down"

Testing Locally

1. Generate Load

# Terminal 1: Start Streamforge
CONFIG_FILE=examples/config.with-observability.yaml ./streamforge

# Terminal 2: Produce test messages
kafka-console-producer.sh --topic input-topic --bootstrap-server localhost:9092

2. Watch Metrics

# Watch metrics update
watch -n 2 'curl -s http://localhost:9090/metrics | grep streamforge_messages'

# Check specific metric
curl -s http://localhost:9090/metrics | grep streamforge_consumer_lag

3. Verify Lag Monitoring

# Check lag metrics are updating
curl -s http://localhost:9090/metrics | grep consumer_lag

# Example output:
# streamforge_consumer_lag{topic="input-topic",partition="0"} 0
# streamforge_consumer_lag{topic="input-topic",partition="1"} 0

Troubleshooting

Metrics endpoint not accessible

Check if server started:

netstat -an | grep 9090
# Should show: tcp4  0  0  *.9090  *.*  LISTEN

Check logs:

2026-04-03T10:00:00Z INFO streamforge: Metrics server listening on http://0.0.0.0:9090

No lag metrics

Possible causes:

No partitions assigned yet (consumer just started)
Lag monitoring disabled in config
Consumer group has no committed offsets

Check:

# Wait 30 seconds for first lag check
sleep 30

# Check metrics
curl http://localhost:9090/metrics | grep consumer_lag

Metrics not updating

Verify:

Messages are being consumed (check logs)
Metrics are being incremented (check counter values)
Prometheus is scraping (check Prometheus UI → Targets)

Next Steps

Full Design Document - Complete metrics reference
Prometheus Documentation
Grafana Dashboard Tutorial
See examples/config.with-observability.yaml for full config example

Summary

You now have:

✅ Prometheus metrics exposed on :9090/metrics
✅ Kafka consumer lag monitoring
✅ Per-destination metrics (throughput, errors, latency)
✅ Filter and transform operation tracking
✅ Health check endpoint

Total setup time: < 5 minutes 🚀