Performance Guide

Comprehensive guide for optimizing StreamForge performance.

Table of Contents

Performance Overview

Key Metrics

Metric Typical Value Excellent Value
Throughput 10K-25K msg/s 50K+ msg/s
Latency (p50) 5-10ms <5ms
Latency (p99) 15-30ms <15ms
Memory Usage 50-100MB <50MB
CPU Usage 50-100% 200-400% (multi-core)

Performance Characteristics

Filter Performance:

  • Simple comparison: ~100ns
  • Boolean logic (AND/OR/NOT): ~100-300ns
  • Regular expressions: ~500ns-1µs
  • Array operations: ~1-10µs (size dependent)

Transform Performance:

  • JSON path extraction: ~50-100ns
  • Object construction: ~200-500ns
  • Array mapping: ~1-10µs (size dependent)
  • Arithmetic: ~50ns

Overall Overhead:

  • Per-message processing: ~2-10µs
  • Network I/O: Dominant factor (>99% of time)

Benchmarks

Throughput Tests

Configuration:

  • Message size: 1KB
  • Partitions: 10
  • Replicas: 3
  • Hardware: 4 CPU cores, 8GB RAM

Results:

Scenario Throughput CPU Memory
Simple mirroring (no filter) 45K msg/s 150% 45MB
With simple filter 42K msg/s 180% 48MB
With boolean logic (3 conditions) 38K msg/s 200% 50MB
With regex filter 35K msg/s 220% 52MB
With array operations 30K msg/s 250% 60MB
Multi-destination (5 topics) 40K msg/s 300% 65MB

Latency Tests

Configuration:

  • Message size: 1KB
  • Batch size: 100
  • Linger: 10ms

Results:

Percentile Simple With Filter Multi-Dest
p50 3ms 4ms 5ms
p95 8ms 10ms 12ms
p99 12ms 15ms 20ms
p99.9 25ms 30ms 40ms

Performance Characteristics

Streamforge performance at 10K msg/s baseline workload (1KB messages):

Metric Performance Capability
Throughput 25,000 msg/s High-volume sustained processing
CPU Usage 120% (4 cores) Efficient multi-core utilization
Memory 50MB Minimal memory footprint
Latency (p99) 15ms Consistent low latency
Startup 0.1s Rapid deployment and recovery
Scalability Linear Predictable resource growth

Configuration Tuning

Basic Configuration

Minimal (Low Throughput):

{
  "threads": 2,
  "consumer_properties": {
    "fetch.min.bytes": "1",
    "fetch.wait.max.ms": "100"
  },
  "producer_properties": {
    "batch.size": "16384",
    "linger.ms": "0"
  }
}

Balanced (Recommended):

{
  "threads": 4,
  "consumer_properties": {
    "fetch.min.bytes": "1048576",
    "fetch.wait.max.ms": "500",
    "max.poll.records": "500"
  },
  "producer_properties": {
    "batch.size": "65536",
    "linger.ms": "10",
    "compression.type": "gzip"
  }
}

High Throughput:

{
  "threads": 8,
  "consumer_properties": {
    "fetch.min.bytes": "1048576",
    "fetch.wait.max.ms": "500",
    "max.poll.records": "1000",
    "max.partition.fetch.bytes": "1048576"
  },
  "producer_properties": {
    "batch.size": "131072",
    "linger.ms": "10",
    "buffer.memory": "67108864",
    "compression.type": "snappy",
    "max.in.flight.requests.per.connection": "5"
  }
}

Low Latency:

{
  "threads": 4,
  "consumer_properties": {
    "fetch.min.bytes": "1",
    "fetch.wait.max.ms": "0",
    "max.poll.records": "100"
  },
  "producer_properties": {
    "batch.size": "16384",
    "linger.ms": "0",
    "acks": "1"
  }
}

Thread Configuration

Rule of thumb:

  • Start with: threads = CPU cores
  • Low throughput: threads = 2-4
  • High throughput: threads = CPU cores * 2
  • Very high throughput: threads = CPU cores * 2-4

Testing:

# Measure with different thread counts
for threads in 2 4 8 16; do
  echo "Testing with $threads threads..."
  # Update config and run
  # Monitor throughput
done

Consumer Tuning

fetch.min.bytes:

  • Low latency: 1 (don’t wait for data)
  • Balanced: 1048576 (1MB)
  • High throughput: 2097152 (2MB)

fetch.wait.max.ms:

  • Low latency: 0-100
  • Balanced: 500
  • High throughput: 1000

max.poll.records:

  • Low memory: 100-200
  • Balanced: 500
  • High throughput: 1000-2000

session.timeout.ms:

  • Stable network: 10000 (10s)
  • Unreliable network: 30000 (30s)
  • Very unreliable: 60000 (60s)

Producer Tuning

batch.size:

  • Low latency: 16384 (16KB)
  • Balanced: 65536 (64KB)
  • High throughput: 131072 (128KB)

linger.ms:

  • Low latency: 0-1
  • Balanced: 10
  • High throughput: 20-50

compression.type:

  • Fastest: snappy
  • Balanced: gzip
  • Best compression: zstd
  • None: none

acks:

  • Fastest: 0 (no acknowledgment)
  • Balanced: 1 (leader acknowledgment)
  • Most durable: all (all replicas)

Compression Selection

Benchmarks (1KB messages):

Type Compression Ratio CPU Usage Throughput
None 1.0x Low 50K msg/s
Snappy 2.5x Medium 45K msg/s
Gzip 4.0x High 35K msg/s
Zstd 4.5x Medium-High 40K msg/s

Recommendations:

  • Network bandwidth limited → Use zstd or gzip
  • CPU limited → Use snappy or none
  • Balanced → Use snappy
  • Storage limited → Use zstd

Best Practices

1. Filter Optimization

❌ Inefficient:

{
  "filter": "REGEX:/message,.*complex.*pattern.*with.*many.*terms.*"
}

✅ Efficient:

{
  "filter": "AND:/message/type,==,complex:/message/hasPattern,==,true"
}

Guidelines:

  • Use simple comparisons when possible
  • Avoid complex regex patterns
  • Put cheaper filters first in AND logic
  • Use NOT sparingly (still evaluates inner filter)

2. Transform Optimization

❌ Inefficient:

{
  "transform": "CONSTRUCT:f1=/a/b/c/d/e:f2=/a/b/c/d/f:f3=/a/b/c/d/g"
}

✅ Efficient:

{
  "transform": "/a/b/c/d"
}

Guidelines:

  • Extract parent object when possible
  • Avoid redundant field extraction
  • Use array operations efficiently
  • Minimize arithmetic operations

3. Partitioning Strategy

Hash Partitioning (Default):

{
  "partition": null
}
  • Pros: Even distribution
  • Cons: No ordering guarantees
  • Use: When order doesn’t matter

Field Partitioning:

{
  "partition": "/userId"
}
  • Pros: Maintains ordering per key
  • Cons: Potential hotspots
  • Use: When ordering important

Hotspot Prevention:

{
  "filter": "NOT:/userId,==,very-active-user"
}
  • Filter out high-volume keys
  • Use separate topics for hot keys
  • Monitor partition distribution

4. Multi-Destination Efficiency

❌ Inefficient:

{
  "destinations": [
    {"filter": "REGEX:/type,.*"},
    {"filter": "REGEX:/type,.*"},
    {"filter": "REGEX:/type,.*"}
  ]
}

✅ Efficient:

{
  "destinations": [
    {"filter": "/type,==,a"},
    {"filter": "/type,==,b"},
    {"filter": "/type,==,c"}
  ]
}

Guidelines:

  • Limit destinations to <10 for best performance
  • Use mutually exclusive filters when possible
  • Order by match probability (most likely first)
  • Combine related destinations

5. Resource Management

Memory:

{
  "consumer_properties": {
    "max.poll.records": "500",
    "fetch.max.bytes": "52428800"
  },
  "producer_properties": {
    "buffer.memory": "33554432"
  }
}

CPU:

  • Match threads to available cores
  • Leave 1-2 cores for OS
  • Monitor CPU saturation
  • Use CPU affinity in containers

Network:

  • Compression for bandwidth-limited networks
  • Increase batch sizes for high-latency networks
  • Use local Kafka clusters when possible
  • Monitor network saturation

6. Container Deployment

Docker Resource Limits:

docker run -d \
  --cpus="4" \
  --memory="512m" \
  --memory-reservation="256m" \
  streamforge:latest

Kubernetes Resource Limits:

resources:
  requests:
    memory: "256Mi"
    cpu: "1000m"
  limits:
    memory: "512Mi"
    cpu: "4000m"

Monitoring

Built-in Metrics

The application reports metrics every 10 seconds:

Stats: processed=10000 (1000.0/s), filtered=100 (10.0/s),
       completed=9900 (990.0/s), errors=0 (0.0/s)

Key Metrics:

  • processed: Total messages read
  • filtered: Messages rejected by filters
  • completed: Messages successfully sent
  • errors: Failed sends

Rates:

  • Monitor completed/s for throughput
  • Watch errors/s for issues
  • Check filtered/s for filter effectiveness

System Metrics

CPU:

# Overall CPU
top -p $(pgrep streamforge)

# Per-thread CPU
ps -eLo pid,tid,pcpu,comm | grep streamforge

Memory:

# Memory usage
ps aux | grep streamforge

# Detailed memory
pmap $(pgrep streamforge)

Network:

# Network traffic
iftop -f "port 9092"

# Per-process
nethogs

Kafka Metrics

Consumer Lag:

kafka-consumer-groups.sh \
  --bootstrap-server kafka:9092 \
  --group streamforge \
  --describe

Topic Metrics:

kafka-run-class.sh kafka.tools.JmxTool \
  --object-name kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec

Alerting

Key Alerts:

  1. Consumer lag > 10000 messages
  2. Error rate > 1%
  3. Throughput dropped > 50%
  4. CPU usage > 90%
  5. Memory usage > 80%

Troubleshooting

Low Throughput

Symptoms:

  • Throughput < expected
  • CPU usage < 50%

Diagnosis:

# Check consumer lag
kafka-consumer-groups.sh --describe

# Check producer metrics
# Enable debug logging
RUST_LOG=debug

Solutions:

  1. Increase thread count
  2. Increase batch size
  3. Increase linger.ms
  4. Check network latency
  5. Verify partition count

High CPU Usage

Symptoms:

  • CPU usage > 90%
  • Throughput plateaued

Diagnosis:

# CPU profiling
perf record -p $(pgrep streamforge)
perf report

# Check filter complexity
# Review regex patterns

Solutions:

  1. Reduce thread count
  2. Simplify filters
  3. Optimize regex patterns
  4. Reduce destinations
  5. Scale horizontally

High Memory Usage

Symptoms:

  • Memory usage > expected
  • OOM errors

Diagnosis:

# Memory profiling
valgrind --tool=massif ./streamforge

# Check message sizes
# Review batch sizes

Solutions:

  1. Reduce max.poll.records
  2. Reduce buffer.memory
  3. Reduce fetch.max.bytes
  4. Check for memory leaks
  5. Increase container limits

High Latency

Symptoms:

  • p99 latency > 50ms
  • Slow message delivery

Diagnosis:

# Network latency
ping kafka-broker

# Kafka latency
kafka-run-class.sh kafka.tools.JmxTool

Solutions:

  1. Reduce linger.ms
  2. Reduce batch.size
  3. Set fetch.wait.max.ms=0
  4. Use acks=1
  5. Optimize network path

Advanced Optimization

CPU Pinning

# Pin to specific CPUs
taskset -c 0-3 ./streamforge

# Docker with CPU affinity
docker run --cpuset-cpus="0-3" streamforge:latest

Huge Pages

# Enable huge pages
echo 512 > /proc/sys/vm/nr_hugepages

# Run with huge pages
MALLOC_MMAP_THRESHOLD_=131072 ./streamforge

Network Optimization

# Increase socket buffers
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216

# TCP tuning
sysctl -w net.ipv4.tcp_window_scaling=1
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

Profiling

CPU Profiling:

# Install perf
# Start profiling
perf record -g -p $(pgrep streamforge)

# Generate flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

Memory Profiling:

# Using valgrind
valgrind --tool=massif --massif-out-file=massif.out ./streamforge

# Analyze
ms_print massif.out

Load Testing

Generate Load:

# Using kafka-producer-perf-test
kafka-producer-perf-test.sh \
  --topic test \
  --num-records 1000000 \
  --record-size 1024 \
  --throughput 10000 \
  --producer-props bootstrap.servers=kafka:9092

Measure Performance:

# Monitor throughput
watch -n 1 'docker logs mirrormaker 2>&1 | tail -1'

# Measure latency
kafka-consumer-perf-test.sh \
  --topic output \
  --bootstrap-server kafka:9092 \
  --messages 100000

Performance Checklist

Pre-Production

  • Benchmark with production-like data
  • Load test at 2x expected throughput
  • Verify latency under load
  • Test failure scenarios
  • Profile CPU and memory usage
  • Validate filter performance
  • Check network bandwidth
  • Monitor consumer lag
  • Test with different thread counts
  • Verify compression benefits

Production

  • Set up monitoring
  • Configure alerting
  • Tune based on metrics
  • Monitor consumer lag
  • Track error rates
  • Review logs regularly
  • Plan for scaling
  • Document configuration
  • Set resource limits
  • Regular performance reviews

Summary

Quick Wins

  1. Enable compression → 2-4x bandwidth reduction
  2. Tune thread count → Match CPU cores
  3. Optimize batch size → Balance latency/throughput
  4. Simplify filters → Use simple comparisons
  5. Monitor metrics → Identify bottlenecks

Performance Targets

Environment Throughput Latency p99 CPU Memory
Development 5K msg/s 50ms <100% <100MB
Staging 15K msg/s 30ms <200% <150MB
Production 25K+ msg/s 15ms <400% <200MB

Next Steps

  • Test with your specific workload
  • Measure before optimizing
  • Optimize bottlenecks first
  • Monitor continuously
  • Iterate and improve

For more information:


Back to top

StreamForge — selective replication for Kafka, with Redpanda as a compatibility target. Apache 2.0 Licensed.

This site uses Just the Docs, a documentation theme for Jekyll.