NAME
celeriant - an append-only event store for the write side of CQRS
SYNOPSIS
// append to one aggregate, only if it is still at the version you read.
await pool.WriteAsync(key, [event], expectedVersion: version);
// or append to many aggregates in one atomic write, each guarded by its
// own version: a dynamic consistency boundary. all of it lands, or none.
await pool.WriteAsync(new WriteRequest {
Writes = new() {
[from] = new() { Events = [withdrawn], ExpectedVersion = fromVersion },
[to] = new() { Events = [deposited], ExpectedVersion = toVersion },
},
});
// read it back: stream every batch from a version, pagination followed
// for you, and fold each event into your projection.
await foreach (var batch in pool.ReadAllAsync(key, ReadFilters.From(version)))
foreach (var e in batch.Events)
state = Apply(state, e); DESCRIPTION
Celeriant is the write side of CQRS: a distributed, append-only log for event sourcing. Optimistic concurrency, strict per-aggregate ordering, exactly-once writes, atomic writes across aggregates. Postgres-grade write correctness at Kafka-grade throughput, on two nodes.
Event sourcing lives or dies on one question: can I append this event only if nobody changed the aggregate under me? Kafka scales to millions of writes a second but cannot do a conditional, per-key append.[1] Postgres can, with a row lock and a conditional insert, then falls over around ten to twenty thousand conditional writes a second on commit-fsync latency, lock contention on hot aggregates, and the vacuum debt building underneath.[2] Miss the answer in production and two requests both read version 4, both decide they are valid, both append milliseconds apart. One just violated an invariant you thought you enforced. No error.
AVAILABILITY: Apache-2.0 release coming. [ get it ] [ docs ]
The aggregate is the unit of ordering
Every event belongs to an aggregate, addressed by a three-part key:
org_id / aggregate_type_id / aggregate_id Events within an aggregate are strictly ordered: no gaps, no concurrent writers. Aggregates map to shards deterministically, so that ordering holds across the cluster, not just on one box. One stream per user, per device, per order, per match. Model your domain, not your index budget: a stream per entity costs nothing, even at millions of them.
Dynamic consistency boundaries
One atomic write can span many aggregates, each guarded by its own version. The consistency boundary is not fixed at a single aggregate; it is whatever set you write together, chosen per write. Debit one account and credit another in one write (the SYNOPSIS above), each with its own optimistic-concurrency check. If either moved since you read it, the whole write is rejected: no half-finished transfer, no distributed transaction, no saga, no outbox. Most event stores fix the boundary at one aggregate and make you bolt the rest on; few, if any, let you draw it per write. It is the cleanest answer to "these two aggregates must agree" short of abandoning event sourcing.
Watch two writers race for the same version:
- Two writers, one aggregate. Press run.
Exactly-once writes
A retried write lands at most once: no double-apply, and no outbox, no
dedup table, no CDC pipeline to run. Each writer derives a sequence
number from the log it already reads while catching up; the server
deduplicates on (client, aggregate, sequence) and checks optimistic
concurrency before idempotency, so a concurrent writer's event
is never mistaken for your retry. The case that breaks most systems, a
timeout where you cannot tell whether the write landed, is safe here:
hold the sequence and retry, and it either lands once or comes back
recognised as already applied. The whole contract is one read-and-retry
loop; the reference implementation (account_service.rs) is a
few hundred lines with no extra moving parts.
Two nodes and S3
The leader takes writes and replicates each batch to the follower; both fdatasync to disk before the leader acknowledges, so an acknowledged write is on two machines, not one. S3 is never in the write path, so it adds zero latency: it is only the coordination plane (leader election is a single S3 conditional write) and the backup replica (lose the follower and the leader keeps serving, replicating to S3 until it returns). No Zookeeper, no Raft library to misconfigure.
Writes commit speculatively and roll back cleanly: data hits disk before replication confirms, but a reader never sees a write that could still be rolled back. Rolling upgrades take one node at a time, and the restarted node catches up on its own. If S3 is unreachable, a leader handover can wait, but an acknowledged write is never lost. No split-brain, no lost writes.
The storage engine
Celeriant skips the kernel page cache and writes with Direct I/O. Buffered I/O can report a clean fsync and still lose the data when writeback fails later; that is documented kernel behaviour, not a hypothetical. One thread per core on io_uring, in the ScyllaDB and TigerBeetle lineage, removes the lock contention and the entire class of concurrency bugs that thread-pool databases never stop fighting.
Small events are stored inline in their metablock, saving a disk seek entirely. And memory stays bounded at any cardinality: each log segment carries a bloom filter, cold aggregates fall back to a pruned reverse scan of the log, and recent writes stay hot in a bounded LRU cache. Millions of aggregates and billions of events on a 32 GB box, with the log on NVMe. The only cost is slightly higher latency on the first read of a cold aggregate.
On by default
You do not opt in to correctness. These come standard.
- Durability
- An acknowledged write is fdatasync'd to disk on both nodes, through Direct I/O, before the ack returns. Pull the power on either one; it is still there.
- Schema validation
- Register a JSON Schema, Avro, or Protobuf schema per event type; malformed events are rejected at write time, on the server, with semantic versioning built in. Server-side validation is exactly what Kafka makes you buy Confluent for.
- Audit and encryption
- Every event is hash-chained to its predecessor with Blake3, so tampering is detectable; payloads can be encrypted per event with AES-GCM.
- Live watch
- Subscribe and react to writes as they land, per aggregate or across a whole org. Read models catch up by replaying from an offset, then follow the live tail.
Performance
The full shape of one run, so you can judge whether it is honest. Not a single hero number: connections, throughput, the whole latency spread.
- Connections
- 36,000
- Durable writes / sec
- 325,000
- p50
- 94 ms
- p95
- 111 ms
- p99
- 201 ms
- Concurrency
- 36,000 durable writes in flight at once, across four load-generating clients. A saturation number well past Postgres's connection wall, not a single-threaded ping
- Payload
- one "Hello World" event per acknowledged write
- Hardware
- two AWS i4i.16xlarge data nodes: 64 vCPU, four local NVMe drives striped RAID0
- Cost
- the two i4i.16xlarge data nodes run $13.16 an hour on-demand in ap-southeast-2, about $9,600 a month, before reserved or spot discounts. It scales down hard: two i4i.large cost about $300 a month and still hold 30,000 durable writes a second at p99 158ms
- Network
- ap-southeast-2, single availability zone
- Security
- mTLS on client connections and on cluster replication
- The write path
- every write is fdatasync'd to disk on both nodes through Direct I/O, replicated to the follower, and acknowledged only after both succeed
Every latency is end-to-end, including replication and both fsyncs, over mTLS on the client and replication paths. That is encrypted, durable, replicated throughput, not a page-cache number you cannot trust. Tested in a single availability zone, expect worse numbers for cross AZ. One sweep on AWS, reproduce it for a few dollars.
CONFORMING TO
Apache-2.0, release pending. The server runs on Linux: the write path is built on io_uring and Direct I/O. First-party clients for .NET and Rust speak the binary protocol directly; a local-first HTTP and SSE gateway fronts browsers.
CAVEATS
When not to use it. Celeriant is narrow on purpose. Here is when it is the wrong tool.
- You have one writer per aggregate
- No contention means you do not need optimistic concurrency. Kafka with an outbox and Debezium will serve you fine; reach for it.
- You need a query database
- This is the write side. Reads are per-aggregate, by offset and event type: no SQL, no ad-hoc queries. Project into Postgres or your read store of choice for that.
- You need transactional reads across aggregates
- Per aggregate, just-in-time catch-up keeps reads current: replay the stream at read time, then serve. What you give up is a single transactional snapshot across many aggregates. If you need that, use an RDBMS.
- You need stability today
- Pre-1.0, the wire format can still change. If you cannot ride breaking changes, run EventStoreDB or Marten now and revisit later.
COLOPHON
Celeriant is releasing under Apache-2.0. The OSS core will be fully functional; no crippled community edition.
No signup form. Email tyson@celeriant.io and you will hear when it ships, and not otherwise. Or watch the repo on GitHub and find out the same day.
Questions: tyson@celeriant.io / LinkedIn.
- Kafka has idempotent and transactional producers, but no per-key compare-and-set on produce: the expected-offset append event sourcing needs. Tracked in KAFKA-2260. ↩
- Throughput ceiling and vacuum/IOPS behaviour vary by workload and tuning; the failure mode under event-sourcing write patterns is the consistent part. ↩