Performance Guide

Comprehensive guide for optimizing StreamForge performance.

Performance Overview
Benchmarks
Configuration Tuning
Best Practices
Monitoring
Troubleshooting
Advanced Optimization

Performance Overview

Key Metrics

Metric	Typical Value	Excellent Value
Throughput	10K-25K msg/s	50K+ msg/s
Latency (p50)	5-10ms	<5ms
Latency (p99)	15-30ms	<15ms
Memory Usage	50-100MB	<50MB
CPU Usage	50-100%	200-400% (multi-core)

Performance Characteristics

Filter Performance:

Simple comparison: ~100ns
Boolean logic (AND/OR/NOT): ~100-300ns
Regular expressions: ~500ns-1µs
Array operations: ~1-10µs (size dependent)

Transform Performance:

JSON path extraction: ~50-100ns
Object construction: ~200-500ns
Array mapping: ~1-10µs (size dependent)
Arithmetic: ~50ns

Overall Overhead:

Per-message processing: ~2-10µs
Network I/O: Dominant factor (>99% of time)

Benchmarks

Throughput Tests

Configuration:

Message size: 1KB
Partitions: 10
Replicas: 3
Hardware: 4 CPU cores, 8GB RAM

Results:

Scenario	Throughput	CPU	Memory
Simple mirroring (no filter)	45K msg/s	150%	45MB
With simple filter	42K msg/s	180%	48MB
With boolean logic (3 conditions)	38K msg/s	200%	50MB
With regex filter	35K msg/s	220%	52MB
With array operations	30K msg/s	250%	60MB
Multi-destination (5 topics)	40K msg/s	300%	65MB

Latency Tests

Configuration:

Message size: 1KB
Batch size: 100
Linger: 10ms

Results:

Percentile	Simple	With Filter	Multi-Dest
p50	3ms	4ms	5ms
p95	8ms	10ms	12ms
p99	12ms	15ms	20ms
p99.9	25ms	30ms	40ms

Performance Characteristics

Streamforge performance at 10K msg/s baseline workload (1KB messages):

Metric	Performance	Capability
Throughput	25,000 msg/s	High-volume sustained processing
CPU Usage	120% (4 cores)	Efficient multi-core utilization
Memory	50MB	Minimal memory footprint
Latency (p99)	15ms	Consistent low latency
Startup	0.1s	Rapid deployment and recovery
Scalability	Linear	Predictable resource growth

Configuration Tuning

Basic Configuration

Minimal (Low Throughput):

{
  "threads": 2,
  "consumer_properties": {
    "fetch.min.bytes": "1",
    "fetch.wait.max.ms": "100"
  },
  "producer_properties": {
    "batch.size": "16384",
    "linger.ms": "0"
  }
}

Balanced (Recommended):

{
  "threads": 4,
  "consumer_properties": {
    "fetch.min.bytes": "1048576",
    "fetch.wait.max.ms": "500",
    "max.poll.records": "500"
  },
  "producer_properties": {
    "batch.size": "65536",
    "linger.ms": "10",
    "compression.type": "gzip"
  }
}

High Throughput:

{
  "threads": 8,
  "consumer_properties": {
    "fetch.min.bytes": "1048576",
    "fetch.wait.max.ms": "500",
    "max.poll.records": "1000",
    "max.partition.fetch.bytes": "1048576"
  },
  "producer_properties": {
    "batch.size": "131072",
    "linger.ms": "10",
    "buffer.memory": "67108864",
    "compression.type": "snappy",
    "max.in.flight.requests.per.connection": "5"
  }
}

Low Latency:

{
  "threads": 4,
  "consumer_properties": {
    "fetch.min.bytes": "1",
    "fetch.wait.max.ms": "0",
    "max.poll.records": "100"
  },
  "producer_properties": {
    "batch.size": "16384",
    "linger.ms": "0",
    "acks": "1"
  }
}

Thread Configuration

Rule of thumb:

Start with: threads = CPU cores
Low throughput: threads = 2-4
High throughput: threads = CPU cores * 2
Very high throughput: threads = CPU cores * 2-4

Testing:

# Measure with different thread counts
for threads in 2 4 8 16; do
  echo "Testing with $threads threads..."
  # Update config and run
  # Monitor throughput
done

Consumer Tuning

fetch.min.bytes:

Low latency: 1 (don’t wait for data)
Balanced: 1048576 (1MB)
High throughput: 2097152 (2MB)

fetch.wait.max.ms:

Low latency: 0-100
Balanced: 500
High throughput: 1000

max.poll.records:

Low memory: 100-200
Balanced: 500
High throughput: 1000-2000

session.timeout.ms:

Stable network: 10000 (10s)
Unreliable network: 30000 (30s)
Very unreliable: 60000 (60s)

Producer Tuning

batch.size:

Low latency: 16384 (16KB)
Balanced: 65536 (64KB)
High throughput: 131072 (128KB)

linger.ms:

Low latency: 0-1
Balanced: 10
High throughput: 20-50

compression.type:

Fastest: snappy
Balanced: gzip
Best compression: zstd
None: none

acks:

Fastest: 0 (no acknowledgment)
Balanced: 1 (leader acknowledgment)
Most durable: all (all replicas)

Compression Selection

Benchmarks (1KB messages):

Type	Compression Ratio	CPU Usage	Throughput
None	1.0x	Low	50K msg/s
Snappy	2.5x	Medium	45K msg/s
Gzip	4.0x	High	35K msg/s
Zstd	4.5x	Medium-High	40K msg/s

Recommendations:

Network bandwidth limited → Use zstd or gzip
CPU limited → Use snappy or none
Balanced → Use snappy
Storage limited → Use zstd

Best Practices

1. Filter Optimization

❌ Inefficient:

{
  "filter": "REGEX:/message,.*complex.*pattern.*with.*many.*terms.*"
}

✅ Efficient:

{
  "filter": "AND:/message/type,==,complex:/message/hasPattern,==,true"
}

Guidelines:

Use simple comparisons when possible
Avoid complex regex patterns
Put cheaper filters first in AND logic
Use NOT sparingly (still evaluates inner filter)

2. Transform Optimization

❌ Inefficient:

{
  "transform": "CONSTRUCT:f1=/a/b/c/d/e:f2=/a/b/c/d/f:f3=/a/b/c/d/g"
}

✅ Efficient:

{
  "transform": "/a/b/c/d"
}

Guidelines:

Extract parent object when possible
Avoid redundant field extraction
Use array operations efficiently
Minimize arithmetic operations

3. Partitioning Strategy

Hash Partitioning (Default):

{
  "partition": null
}

Pros: Even distribution
Cons: No ordering guarantees
Use: When order doesn’t matter

Field Partitioning:

{
  "partition": "/userId"
}

Pros: Maintains ordering per key
Cons: Potential hotspots
Use: When ordering important

Hotspot Prevention:

{
  "filter": "NOT:/userId,==,very-active-user"
}

Filter out high-volume keys
Use separate topics for hot keys
Monitor partition distribution

4. Multi-Destination Efficiency

❌ Inefficient:

{
  "destinations": [
    {"filter": "REGEX:/type,.*"},
    {"filter": "REGEX:/type,.*"},
    {"filter": "REGEX:/type,.*"}
  ]
}

✅ Efficient:

{
  "destinations": [
    {"filter": "/type,==,a"},
    {"filter": "/type,==,b"},
    {"filter": "/type,==,c"}
  ]
}

Guidelines:

Limit destinations to <10 for best performance
Use mutually exclusive filters when possible
Order by match probability (most likely first)
Combine related destinations

5. Resource Management

Memory:

{
  "consumer_properties": {
    "max.poll.records": "500",
    "fetch.max.bytes": "52428800"
  },
  "producer_properties": {
    "buffer.memory": "33554432"
  }
}

CPU:

Match threads to available cores
Leave 1-2 cores for OS
Monitor CPU saturation
Use CPU affinity in containers

Network:

Compression for bandwidth-limited networks
Increase batch sizes for high-latency networks
Use local Kafka clusters when possible
Monitor network saturation

6. Container Deployment

Docker Resource Limits:

docker run -d \
  --cpus="4" \
  --memory="512m" \
  --memory-reservation="256m" \
  streamforge:latest

Kubernetes Resource Limits:

resources:
  requests:
    memory: "256Mi"
    cpu: "1000m"
  limits:
    memory: "512Mi"
    cpu: "4000m"

Monitoring

Built-in Metrics

The application reports metrics every 10 seconds:

Stats: processed=10000 (1000.0/s), filtered=100 (10.0/s),
       completed=9900 (990.0/s), errors=0 (0.0/s)

Key Metrics:

processed: Total messages read
filtered: Messages rejected by filters
completed: Messages successfully sent
errors: Failed sends

Rates:

Monitor completed/s for throughput
Watch errors/s for issues
Check filtered/s for filter effectiveness

System Metrics

CPU:

# Overall CPU
top -p $(pgrep streamforge)

# Per-thread CPU
ps -eLo pid,tid,pcpu,comm | grep streamforge

Memory:

# Memory usage
ps aux | grep streamforge

# Detailed memory
pmap $(pgrep streamforge)

Network:

# Network traffic
iftop -f "port 9092"

# Per-process
nethogs

Kafka Metrics

Consumer Lag:

kafka-consumer-groups.sh \
  --bootstrap-server kafka:9092 \
  --group streamforge \
  --describe

Topic Metrics:

kafka-run-class.sh kafka.tools.JmxTool \
  --object-name kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec

Alerting

Key Alerts:

Consumer lag > 10000 messages
Error rate > 1%
Throughput dropped > 50%
CPU usage > 90%
Memory usage > 80%

Troubleshooting

Low Throughput

Symptoms:

Throughput < expected
CPU usage < 50%

Diagnosis:

# Check consumer lag
kafka-consumer-groups.sh --describe

# Check producer metrics
# Enable debug logging
RUST_LOG=debug

Solutions:

Increase thread count
Increase batch size
Increase linger.ms
Check network latency
Verify partition count

High CPU Usage

Symptoms:

CPU usage > 90%
Throughput plateaued

Diagnosis:

# CPU profiling
perf record -p $(pgrep streamforge)
perf report

# Check filter complexity
# Review regex patterns

Solutions:

Reduce thread count
Simplify filters
Optimize regex patterns
Reduce destinations
Scale horizontally

High Memory Usage

Symptoms:

Memory usage > expected
OOM errors

Diagnosis:

# Memory profiling
valgrind --tool=massif ./streamforge

# Check message sizes
# Review batch sizes

Solutions:

Reduce max.poll.records
Reduce buffer.memory
Reduce fetch.max.bytes
Check for memory leaks
Increase container limits

High Latency

Symptoms:

p99 latency > 50ms
Slow message delivery

Diagnosis:

# Network latency
ping kafka-broker

# Kafka latency
kafka-run-class.sh kafka.tools.JmxTool

Solutions:

Reduce linger.ms
Reduce batch.size
Set fetch.wait.max.ms=0
Use acks=1
Optimize network path

Advanced Optimization

CPU Pinning

# Pin to specific CPUs
taskset -c 0-3 ./streamforge

# Docker with CPU affinity
docker run --cpuset-cpus="0-3" streamforge:latest

Huge Pages

# Enable huge pages
echo 512 > /proc/sys/vm/nr_hugepages

# Run with huge pages
MALLOC_MMAP_THRESHOLD_=131072 ./streamforge

Network Optimization

# Increase socket buffers
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216

# TCP tuning
sysctl -w net.ipv4.tcp_window_scaling=1
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

Profiling

CPU Profiling:

# Install perf
# Start profiling
perf record -g -p $(pgrep streamforge)

# Generate flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

Memory Profiling:

# Using valgrind
valgrind --tool=massif --massif-out-file=massif.out ./streamforge

# Analyze
ms_print massif.out

Load Testing

Generate Load:

# Using kafka-producer-perf-test
kafka-producer-perf-test.sh \
  --topic test \
  --num-records 1000000 \
  --record-size 1024 \
  --throughput 10000 \
  --producer-props bootstrap.servers=kafka:9092

Measure Performance:

# Monitor throughput
watch -n 1 'docker logs mirrormaker 2>&1 | tail -1'

# Measure latency
kafka-consumer-perf-test.sh \
  --topic output \
  --bootstrap-server kafka:9092 \
  --messages 100000

Performance Checklist

Pre-Production

Production

Summary

Quick Wins

Enable compression → 2-4x bandwidth reduction
Tune thread count → Match CPU cores
Optimize batch size → Balance latency/throughput
Simplify filters → Use simple comparisons
Monitor metrics → Identify bottlenecks

Performance Targets

Environment	Throughput	Latency p99	CPU	Memory
Development	5K msg/s	50ms	<100%	<100MB
Staging	15K msg/s	30ms	<200%	<150MB
Production	25K+ msg/s	15ms	<400%	<200MB

Next Steps

Test with your specific workload
Measure before optimizing
Optimize bottlenecks first
Monitor continuously
Iterate and improve

For more information:

USAGE.md - Use cases and patterns
CONTRIBUTING.md - Development setup
ADVANCED_DSL_GUIDE.md - Filter optimization

Performance Guide

Table of Contents

Performance Overview

Key Metrics

Performance Characteristics

Benchmarks

Throughput Tests

Latency Tests

Performance Characteristics

Configuration Tuning

Basic Configuration

Thread Configuration

Consumer Tuning

Producer Tuning

Compression Selection

Best Practices

1. Filter Optimization

2. Transform Optimization

3. Partitioning Strategy

4. Multi-Destination Efficiency

5. Resource Management

6. Container Deployment

Monitoring

Built-in Metrics

System Metrics

Kafka Metrics

Alerting

Troubleshooting

Low Throughput

High CPU Usage

High Memory Usage

High Latency

Advanced Optimization

CPU Pinning

Huge Pages

Network Optimization

Profiling

Load Testing

Performance Checklist

Pre-Production

Production

Summary

Quick Wins

Performance Targets

Next Steps