Performance Guide
Comprehensive guide for optimizing StreamForge performance.
Table of Contents
- Performance Overview
- Benchmarks
- Configuration Tuning
- Best Practices
- Monitoring
- Troubleshooting
- Advanced Optimization
Performance Overview
Key Metrics
| Metric | Typical Value | Excellent Value |
|---|---|---|
| Throughput | 10K-25K msg/s | 50K+ msg/s |
| Latency (p50) | 5-10ms | <5ms |
| Latency (p99) | 15-30ms | <15ms |
| Memory Usage | 50-100MB | <50MB |
| CPU Usage | 50-100% | 200-400% (multi-core) |
Performance Characteristics
Filter Performance:
- Simple comparison: ~100ns
- Boolean logic (AND/OR/NOT): ~100-300ns
- Regular expressions: ~500ns-1µs
- Array operations: ~1-10µs (size dependent)
Transform Performance:
- JSON path extraction: ~50-100ns
- Object construction: ~200-500ns
- Array mapping: ~1-10µs (size dependent)
- Arithmetic: ~50ns
Overall Overhead:
- Per-message processing: ~2-10µs
- Network I/O: Dominant factor (>99% of time)
Benchmarks
Throughput Tests
Configuration:
- Message size: 1KB
- Partitions: 10
- Replicas: 3
- Hardware: 4 CPU cores, 8GB RAM
Results:
| Scenario | Throughput | CPU | Memory |
|---|---|---|---|
| Simple mirroring (no filter) | 45K msg/s | 150% | 45MB |
| With simple filter | 42K msg/s | 180% | 48MB |
| With boolean logic (3 conditions) | 38K msg/s | 200% | 50MB |
| With regex filter | 35K msg/s | 220% | 52MB |
| With array operations | 30K msg/s | 250% | 60MB |
| Multi-destination (5 topics) | 40K msg/s | 300% | 65MB |
Latency Tests
Configuration:
- Message size: 1KB
- Batch size: 100
- Linger: 10ms
Results:
| Percentile | Simple | With Filter | Multi-Dest |
|---|---|---|---|
| p50 | 3ms | 4ms | 5ms |
| p95 | 8ms | 10ms | 12ms |
| p99 | 12ms | 15ms | 20ms |
| p99.9 | 25ms | 30ms | 40ms |
Performance Characteristics
Streamforge performance at 10K msg/s baseline workload (1KB messages):
| Metric | Performance | Capability |
|---|---|---|
| Throughput | 25,000 msg/s | High-volume sustained processing |
| CPU Usage | 120% (4 cores) | Efficient multi-core utilization |
| Memory | 50MB | Minimal memory footprint |
| Latency (p99) | 15ms | Consistent low latency |
| Startup | 0.1s | Rapid deployment and recovery |
| Scalability | Linear | Predictable resource growth |
Configuration Tuning
Basic Configuration
Minimal (Low Throughput):
{
"threads": 2,
"consumer_properties": {
"fetch.min.bytes": "1",
"fetch.wait.max.ms": "100"
},
"producer_properties": {
"batch.size": "16384",
"linger.ms": "0"
}
}
Balanced (Recommended):
{
"threads": 4,
"consumer_properties": {
"fetch.min.bytes": "1048576",
"fetch.wait.max.ms": "500",
"max.poll.records": "500"
},
"producer_properties": {
"batch.size": "65536",
"linger.ms": "10",
"compression.type": "gzip"
}
}
High Throughput:
{
"threads": 8,
"consumer_properties": {
"fetch.min.bytes": "1048576",
"fetch.wait.max.ms": "500",
"max.poll.records": "1000",
"max.partition.fetch.bytes": "1048576"
},
"producer_properties": {
"batch.size": "131072",
"linger.ms": "10",
"buffer.memory": "67108864",
"compression.type": "snappy",
"max.in.flight.requests.per.connection": "5"
}
}
Low Latency:
{
"threads": 4,
"consumer_properties": {
"fetch.min.bytes": "1",
"fetch.wait.max.ms": "0",
"max.poll.records": "100"
},
"producer_properties": {
"batch.size": "16384",
"linger.ms": "0",
"acks": "1"
}
}
Thread Configuration
Rule of thumb:
- Start with:
threads = CPU cores - Low throughput:
threads = 2-4 - High throughput:
threads = CPU cores * 2 - Very high throughput:
threads = CPU cores * 2-4
Testing:
# Measure with different thread counts
for threads in 2 4 8 16; do
echo "Testing with $threads threads..."
# Update config and run
# Monitor throughput
done
Consumer Tuning
fetch.min.bytes:
- Low latency:
1(don’t wait for data) - Balanced:
1048576(1MB) - High throughput:
2097152(2MB)
fetch.wait.max.ms:
- Low latency:
0-100 - Balanced:
500 - High throughput:
1000
max.poll.records:
- Low memory:
100-200 - Balanced:
500 - High throughput:
1000-2000
session.timeout.ms:
- Stable network:
10000(10s) - Unreliable network:
30000(30s) - Very unreliable:
60000(60s)
Producer Tuning
batch.size:
- Low latency:
16384(16KB) - Balanced:
65536(64KB) - High throughput:
131072(128KB)
linger.ms:
- Low latency:
0-1 - Balanced:
10 - High throughput:
20-50
compression.type:
- Fastest:
snappy - Balanced:
gzip - Best compression:
zstd - None:
none
acks:
- Fastest:
0(no acknowledgment) - Balanced:
1(leader acknowledgment) - Most durable:
all(all replicas)
Compression Selection
Benchmarks (1KB messages):
| Type | Compression Ratio | CPU Usage | Throughput |
|---|---|---|---|
| None | 1.0x | Low | 50K msg/s |
| Snappy | 2.5x | Medium | 45K msg/s |
| Gzip | 4.0x | High | 35K msg/s |
| Zstd | 4.5x | Medium-High | 40K msg/s |
Recommendations:
- Network bandwidth limited → Use
zstdorgzip - CPU limited → Use
snappyornone - Balanced → Use
snappy - Storage limited → Use
zstd
Best Practices
1. Filter Optimization
❌ Inefficient:
{
"filter": "REGEX:/message,.*complex.*pattern.*with.*many.*terms.*"
}
✅ Efficient:
{
"filter": "AND:/message/type,==,complex:/message/hasPattern,==,true"
}
Guidelines:
- Use simple comparisons when possible
- Avoid complex regex patterns
- Put cheaper filters first in AND logic
- Use NOT sparingly (still evaluates inner filter)
2. Transform Optimization
❌ Inefficient:
{
"transform": "CONSTRUCT:f1=/a/b/c/d/e:f2=/a/b/c/d/f:f3=/a/b/c/d/g"
}
✅ Efficient:
{
"transform": "/a/b/c/d"
}
Guidelines:
- Extract parent object when possible
- Avoid redundant field extraction
- Use array operations efficiently
- Minimize arithmetic operations
3. Partitioning Strategy
Hash Partitioning (Default):
{
"partition": null
}
- Pros: Even distribution
- Cons: No ordering guarantees
- Use: When order doesn’t matter
Field Partitioning:
{
"partition": "/userId"
}
- Pros: Maintains ordering per key
- Cons: Potential hotspots
- Use: When ordering important
Hotspot Prevention:
{
"filter": "NOT:/userId,==,very-active-user"
}
- Filter out high-volume keys
- Use separate topics for hot keys
- Monitor partition distribution
4. Multi-Destination Efficiency
❌ Inefficient:
{
"destinations": [
{"filter": "REGEX:/type,.*"},
{"filter": "REGEX:/type,.*"},
{"filter": "REGEX:/type,.*"}
]
}
✅ Efficient:
{
"destinations": [
{"filter": "/type,==,a"},
{"filter": "/type,==,b"},
{"filter": "/type,==,c"}
]
}
Guidelines:
- Limit destinations to <10 for best performance
- Use mutually exclusive filters when possible
- Order by match probability (most likely first)
- Combine related destinations
5. Resource Management
Memory:
{
"consumer_properties": {
"max.poll.records": "500",
"fetch.max.bytes": "52428800"
},
"producer_properties": {
"buffer.memory": "33554432"
}
}
CPU:
- Match threads to available cores
- Leave 1-2 cores for OS
- Monitor CPU saturation
- Use CPU affinity in containers
Network:
- Compression for bandwidth-limited networks
- Increase batch sizes for high-latency networks
- Use local Kafka clusters when possible
- Monitor network saturation
6. Container Deployment
Docker Resource Limits:
docker run -d \
--cpus="4" \
--memory="512m" \
--memory-reservation="256m" \
streamforge:latest
Kubernetes Resource Limits:
resources:
requests:
memory: "256Mi"
cpu: "1000m"
limits:
memory: "512Mi"
cpu: "4000m"
Monitoring
Built-in Metrics
The application reports metrics every 10 seconds:
Stats: processed=10000 (1000.0/s), filtered=100 (10.0/s),
completed=9900 (990.0/s), errors=0 (0.0/s)
Key Metrics:
processed: Total messages readfiltered: Messages rejected by filterscompleted: Messages successfully senterrors: Failed sends
Rates:
- Monitor
completed/sfor throughput - Watch
errors/sfor issues - Check
filtered/sfor filter effectiveness
System Metrics
CPU:
# Overall CPU
top -p $(pgrep streamforge)
# Per-thread CPU
ps -eLo pid,tid,pcpu,comm | grep streamforge
Memory:
# Memory usage
ps aux | grep streamforge
# Detailed memory
pmap $(pgrep streamforge)
Network:
# Network traffic
iftop -f "port 9092"
# Per-process
nethogs
Kafka Metrics
Consumer Lag:
kafka-consumer-groups.sh \
--bootstrap-server kafka:9092 \
--group streamforge \
--describe
Topic Metrics:
kafka-run-class.sh kafka.tools.JmxTool \
--object-name kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
Alerting
Key Alerts:
- Consumer lag > 10000 messages
- Error rate > 1%
- Throughput dropped > 50%
- CPU usage > 90%
- Memory usage > 80%
Troubleshooting
Low Throughput
Symptoms:
- Throughput < expected
- CPU usage < 50%
Diagnosis:
# Check consumer lag
kafka-consumer-groups.sh --describe
# Check producer metrics
# Enable debug logging
RUST_LOG=debug
Solutions:
- Increase thread count
- Increase batch size
- Increase linger.ms
- Check network latency
- Verify partition count
High CPU Usage
Symptoms:
- CPU usage > 90%
- Throughput plateaued
Diagnosis:
# CPU profiling
perf record -p $(pgrep streamforge)
perf report
# Check filter complexity
# Review regex patterns
Solutions:
- Reduce thread count
- Simplify filters
- Optimize regex patterns
- Reduce destinations
- Scale horizontally
High Memory Usage
Symptoms:
- Memory usage > expected
- OOM errors
Diagnosis:
# Memory profiling
valgrind --tool=massif ./streamforge
# Check message sizes
# Review batch sizes
Solutions:
- Reduce max.poll.records
- Reduce buffer.memory
- Reduce fetch.max.bytes
- Check for memory leaks
- Increase container limits
High Latency
Symptoms:
- p99 latency > 50ms
- Slow message delivery
Diagnosis:
# Network latency
ping kafka-broker
# Kafka latency
kafka-run-class.sh kafka.tools.JmxTool
Solutions:
- Reduce linger.ms
- Reduce batch.size
- Set fetch.wait.max.ms=0
- Use acks=1
- Optimize network path
Advanced Optimization
CPU Pinning
# Pin to specific CPUs
taskset -c 0-3 ./streamforge
# Docker with CPU affinity
docker run --cpuset-cpus="0-3" streamforge:latest
Huge Pages
# Enable huge pages
echo 512 > /proc/sys/vm/nr_hugepages
# Run with huge pages
MALLOC_MMAP_THRESHOLD_=131072 ./streamforge
Network Optimization
# Increase socket buffers
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
# TCP tuning
sysctl -w net.ipv4.tcp_window_scaling=1
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
Profiling
CPU Profiling:
# Install perf
# Start profiling
perf record -g -p $(pgrep streamforge)
# Generate flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
Memory Profiling:
# Using valgrind
valgrind --tool=massif --massif-out-file=massif.out ./streamforge
# Analyze
ms_print massif.out
Load Testing
Generate Load:
# Using kafka-producer-perf-test
kafka-producer-perf-test.sh \
--topic test \
--num-records 1000000 \
--record-size 1024 \
--throughput 10000 \
--producer-props bootstrap.servers=kafka:9092
Measure Performance:
# Monitor throughput
watch -n 1 'docker logs mirrormaker 2>&1 | tail -1'
# Measure latency
kafka-consumer-perf-test.sh \
--topic output \
--bootstrap-server kafka:9092 \
--messages 100000
Performance Checklist
Pre-Production
- Benchmark with production-like data
- Load test at 2x expected throughput
- Verify latency under load
- Test failure scenarios
- Profile CPU and memory usage
- Validate filter performance
- Check network bandwidth
- Monitor consumer lag
- Test with different thread counts
- Verify compression benefits
Production
- Set up monitoring
- Configure alerting
- Tune based on metrics
- Monitor consumer lag
- Track error rates
- Review logs regularly
- Plan for scaling
- Document configuration
- Set resource limits
- Regular performance reviews
Summary
Quick Wins
- Enable compression → 2-4x bandwidth reduction
- Tune thread count → Match CPU cores
- Optimize batch size → Balance latency/throughput
- Simplify filters → Use simple comparisons
- Monitor metrics → Identify bottlenecks
Performance Targets
| Environment | Throughput | Latency p99 | CPU | Memory |
|---|---|---|---|---|
| Development | 5K msg/s | 50ms | <100% | <100MB |
| Staging | 15K msg/s | 30ms | <200% | <150MB |
| Production | 25K+ msg/s | 15ms | <400% | <200MB |
Next Steps
- Test with your specific workload
- Measure before optimizing
- Optimize bottlenecks first
- Monitor continuously
- Iterate and improve
For more information:
- USAGE.md - Use cases and patterns
- CONTRIBUTING.md - Development setup
- ADVANCED_DSL_GUIDE.md - Filter optimization