Benchmarks¶
This page measures aiokpl against the existing Python Kinesis ecosystem.
Before the numbers, the framing — because the numbers don't mean the same
thing across rows.
Apples and oranges¶
Two things called "throughput" are being measured here:
- Bytes-per-second on the wire. How fast a process can serialize and push records to Kinesis. No knowledge of which records succeeded.
- Records-with-confirmation-per-second. How fast a process can submit
a record AND receive its
sequence_numberback from Kinesis.
These are different products. The Python ecosystem has tools for the
first (aws-kinesis-agg + boto3, raw boto3.put_records). aiokpl is the
only thing that ships the second.
| Variant | Per-record outcomes? | Retry on failure? | Shard prediction? | Backpressure? |
|---|---|---|---|---|
aiokpl async agg=on (confirmed) |
✅ | ✅ | ✅ | ✅ |
aiokpl async agg=on (fire-and-forget) |
❌ | ✅ | ✅ | ✅ |
aiokpl async agg=off |
✅ | ✅ | ✅ | ✅ |
aiokpl sync agg=on |
✅ | ✅ | ✅ | ✅ |
boto3.put_records (batched) |
❌ (batch-level only) | ❌ | ❌ | ❌ |
aws-kinesis-agg + boto3 |
❌ | ❌ | ❌ | ❌ |
Read the throughput table with the columns above open in the other tab. "Rps" rows in the same table are not strictly comparable.
Throughput¶
20 000 records of 200 bytes each, single shard, end-to-end against
etspaceman/kinesis-mock:0.5.2 in Docker on loopback. The boto3 variants
ran with reduced N because they saturate kinesis-mock's CPU above
~200 rps; see caveats.
| Variant | Records | Elapsed | Throughput |
|---|---|---|---|
aws-kinesis-agg + boto3 (no outcomes, no retries) |
10 000 | 0.35 s | 28 960 rps |
aiokpl async agg=on (fire-and-forget) |
20 000 | 2.15 s | 9 296 rps |
aiokpl async agg=on (confirmed) |
20 000 | 2.33 s | 8 572 rps |
aiokpl sync agg=on (confirmed) |
20 000 | 5.24 s | 3 820 rps |
aiokpl async agg=off (per-record PutRecords) |
20 000 | 94.81 s | 211 rps |
boto3.put_records (batched 500) |
2 000 | 15.76 s | 127 rps |
Reading the numbers honestly¶
1 — aws-kinesis-agg + boto3 wins on raw throughput. By 3×.
It's not magic. It encodes the input into 1-4 aggregated records that fit
in a single PutRecords call and ships them. It does not track which
user-records inside each blob actually landed. It does not retry. It does
not predict shards (it assumes all records aggregated together share a
shard — which is wrong in multi-shard streams, but kinesis-mock doesn't
enforce that). For a workload that is genuinely "fire records, let
downstream detect losses", that's the right tool and it will be faster.
2 — The aiokpl confirmed-vs-fire-and-forget gap is only 8 %.
aiokpl async agg=on (confirmed) is 8 572 rps. The same configuration
without await outcome.wait() per record is 9 296 rps. The cost of
per-record outcome tracking is small. The cost of aiokpl over
aws-kinesis-agg (~3×) is the pipeline — per-record shard prediction,
per-shard rate limiting, retry classification, backpressure — not the
outcome bookkeeping.
That's an architectural floor, not a bug. If your workload doesn't need those features, you don't need a producer; the codec is enough.
3 — Aggregation is the single biggest knob.
aiokpl async agg=on is 8 572 rps. The same code with
aggregation_enabled=False is 211 rps — a 41× drop. Without
aggregation each user record becomes its own Kinesis record and your
throughput is bottlenecked by the HTTPS round-trip per call. This is
exactly why the C++ KPL invented aggregation in 2015 and why aiokpl
exists.
4 — Naive boto3.put_records is not a competitor.
127 rps single-threaded with 500-record batches is what dumb batching costs you. It's here to make the gap explicit, not because it's a baseline anyone should target.
5 — The sync bridge is ~45 % of async throughput.
SyncProducer bounces every put_record through an anyio portal on a
background thread. Pay it if you must (Flask, Django, Celery, scripts);
reach for the async path if you have an event loop.
Latency¶
1 000 records submitted one at a time with ~1 ms inter-arrival, 2 shards, against the same emulator.
| Variant | P50 | P99 | P99.9 |
|---|---|---|---|
boto3.put_records (batched 500) |
105 ms | 170 ms | 174 ms |
aws-kinesis-agg + boto3 |
698 ms | 1 331 ms | 1 342 ms |
aiokpl async agg=on |
778 ms | 1 473 ms | 1 484 ms |
aiokpl async agg=off |
798 ms | 1 475 ms | 1 487 ms |
aiokpl sync agg=on |
959 ms | 1 755 ms | 1 767 ms |
Reading this:
boto3.put_recordsis the latency winner here, but on a synthetic metric. Latency for the boto3 row is "elapsed since the batch started" — every record in a 500-record batch sees the same number. That's not a per-record durability time. Compare against its 127 rps throughput: the metric reflects how fast a tight sync loop runs, not how fast records become durable.aiokplandaws-kinesis-aggcluster around 700-1 500 ms because both pay the buffered-time deadline (100 ms in aiokpl) plus kinesis-mock's per-call latency on aggregated batches.- P99.9 is tight to P99 across the board — there are no long tails on the emulator. Real AWS has different tails (throttle backoffs, split-shard convergence); those don't show up here.
Caveats¶
These numbers are relative, not absolute. Specifically:
-
kinesis-mock is not real AWS. It's an in-process Scala reimplementation for testing. It has its own latency characteristics (mostly internal scheduling, no network), no real rate limits, no real shard provisioning, and a single-machine CPU ceiling we hit on non-aggregated variants. The shape (aggregation matters; per-record pipelines cost CPU) carries over to real AWS — the absolute numbers do not.
-
The CPU ceiling forced compromises. We initially tried
[1, 4, 8]-shard runs at 20 000 records. The unaggregated variants saturate kinesis-mock above ~200 rps regardless of shard count, hanging the emulator. The shipped numbers are single-shard with reduced N (2 000 records forboto3.put_records, 10 000 foraws-kinesis-agg + boto3). Real-AWS multi-shard scaling is linear by design; that part doesn't need a benchmark to claim. -
The fire-and-forget variants drop most of what aiokpl does. They are listed because they answer the natural question "how fast COULD aiokpl push bytes if I disabled the bookkeeping?". The answer is ~9 300 rps — and you give up per-record outcomes, retry classification visibility, and the ability to know which records failed. If the producer's job is "push events I care about", confirmed is the only honest measurement.
-
aiokplmeasures end-to-end. Throughput for confirmed mode is the wall clock from the firstput_recordto the lastoutcome.wait()returning. The fire-and-forget variant ends atoutstanding_records == 0(the pipeline has drained, but the caller didn't observe any per-record result). Theaws-kinesis-aggrow ends at the boto3put_recordsreturn — no equivalent end state because no per-record tracking exists. -
No retries forced. kinesis-mock doesn't throttle naturally. Every record succeeds first try here. Real-world numbers will include retry latency for throttle/transient errors — and those costs land on
aiokpl(which does retries), not onaws-kinesis-agg + boto3(which would just lose the records and never know).
Methodology¶
- Backend:
ghcr.io/etspaceman/kinesis-mock:0.5.2on Docker, single container, default config, HTTPS onlocalhost:4567, self-signed cert. - Host: Apple Mac16,7 (M4 Pro, 14 cores), 48 GB RAM, macOS, Python 3.12.13.
- Library version:
aiokpl0.2.0 at commitad4c4b4. - Records: 200-byte payloads, partition keys cycled across 256 distinct values.
aiokplconfiguration: defaults (record_max_buffered_time_ms=50for benches,record_ttl_ms=30_000,max_outstanding_records ≥ N).- boto3 / aws-kinesis-agg configuration: boto3 defaults (no custom
pool sizing).
aws-kinesis-agguses 1 MiB aggregation cap; ships 500 aggregated records perput_recordscall.
Reproducing¶
git clone https://github.com/DevArKaDiA/aiokpl
cd aiokpl
uv venv && uv pip install -e ".[dev,bench]"
docker pull ghcr.io/etspaceman/kinesis-mock:0.5.2
python -m benchmarks.bench_throughput | tee benchmarks/results/throughput.txt
python -m benchmarks.bench_latency | tee benchmarks/results/latency.txt
Total runtime: 2-3 minutes for throughput, 1-2 minutes for latency. Cross-machine comparison is fair — kinesis-mock's CPU starvation point is roughly machine-independent (it's event-loop saturation, not a throughput cap).
Is the 3× worth it?¶
The honest answer for most readers is yes. Three things put the gap in perspective.
1. 8 572 rps is already a lot¶
| Producer config | Records/day | Records/month |
|---|---|---|
aiokpl confirmed @ 8 572 rps |
741 M / day | 22 B / month |
aws-kinesis-agg + boto3 @ 28 960 rps |
2.5 B / day | 75 B / month |
At 200-byte records, 8 572 rps is ~1.7 MB/s sustained from one Python process. To find a workload where that cap is the bottleneck, you need to be doing one of:
- Streaming telemetry from a fleet of millions of devices.
- Ingesting logs from an entire org's microservice mesh.
- Already designing producer-side sharding (Kafka-style).
For webhooks, transactional events, audit logs, analytics, app telemetry, observability pipelines — 8 572 rps is headroom, not a ceiling.
2. The 3× is the worst case (kinesis-mock has no network)¶
kinesis-mock is loopback HTTPS with negligible per-call latency. That
amplifies the gap between "encode + 1 call" (aws-kinesis-agg) and
"encode + pipeline + outcomes" (aiokpl). On real AWS each
PutRecords round-trip is ~10-50 ms, and that's the same cost for
both variants. What differs is the per-record bookkeeping aiokpl does
outside the HTTP path — task switches measured in microseconds,
dominated by AWS network latency.
Expect the gap to compress to roughly 1.3-1.5× against real AWS.
3. The "safety net" is what 2 000 lines of code do¶
aws-kinesis-agg + boto3 will give you one PutRecords response per
batch. If it says "47 of 1 200 records failed":
- The failed records are inside aggregated blobs — you don't know which user-records map to which failures.
- There is no retry. They're lost.
- If the failure was a throttle (
ProvisionedThroughputExceededException), the next batch is just as likely to fail. - If the failure was a wrong-shard-after-split, the failed records will keep going to the wrong shard.
To rebuild what aiokpl does on top of aws-kinesis-agg + boto3:
| You'd write | aiokpl ships |
|---|---|
| user-record → aggregated-blob → failure-result mapping | per-record Outcome[RecordResult] |
| Retry loop with throttle/transient/wrong-shard classification | Retrier (300 lines) |
| Exponential backoff with jitter | inside the Retrier |
Shard map refresh on split_shard convergence |
ShardMap (230 lines) |
| Per-shard token bucket (1000 rec/s + 1 MiB/s) | Limiter (152 lines) |
| Bounded-memory backpressure | max_outstanding_records semaphore |
| Vendor-neutral metrics export | MetricsSink (5 implementations) |
That's roughly the 2 000 lines aiokpl is. The 3× throughput gap is
the price of not writing them yourself.
Verdict¶
- Workload is "fire telemetry, downstream detects loss" →
aws-kinesis-agg + boto3. Faster, less code, right tool. - Records are events you care about (webhooks, transactions, audit
logs, observability) →
aiokpl. The 3× is the cost of the safety net; on real AWS it compresses to ~1.5×. - You're not sure →
aiokpl. 8 572 rps covers the long tail. When you discover you need retries (you will), you don't have to migrate.