NAME

celeriant - an append-only event store for the write side of CQRS

SYNOPSIS

// append to one aggregate, only if it is still at the version you read.
await pool.WriteAsync(key, [event], expectedVersion: version);

// or append to many aggregates in one atomic write, each guarded by its
// own version: a dynamic consistency boundary. all of it lands, or none.
await pool.WriteAsync(new WriteRequest {
    Writes = new() {
        [from] = new() { Events = [withdrawn], ExpectedVersion = fromVersion },
        [to]   = new() { Events = [deposited], ExpectedVersion = toVersion },
    },
});

// read it back: stream every batch from a version, pagination followed
// for you, and fold each event into your projection.
await foreach (var batch in pool.ReadAllAsync(key, ReadFilters.From(version)))
    foreach (var e in batch.Events)
        state = Apply(state, e);

DESCRIPTION

Celeriant is the write side of CQRS: a distributed, append-only log for event sourcing. Optimistic concurrency, strict per-aggregate ordering, exactly-once writes, atomic writes across aggregates. Postgres-grade write correctness at Kafka-grade throughput, on two nodes.

Event sourcing lives or dies on one question: can I append this event only if nobody changed the aggregate under me? Kafka scales to millions of writes a second but cannot do a conditional, per-key append.[1] Postgres can, with a row lock and a conditional insert, up to around ten to twenty thousand conditional writes a second; past that, commit-fsync latency and lock contention on hot aggregates are the wall.[2] Miss the answer in production and two requests both read version 4, both decide they are valid, both append milliseconds apart. One just violated an invariant you thought you enforced. No error.

AVAILABILITY: Apache-2.0 release coming. [ get it ] [ docs ]

The aggregate is the unit of ordering

Every event belongs to an aggregate, addressed by a three-part key:

org_id / aggregate_type_id / aggregate_id

Events within an aggregate are strictly ordered: no gaps, no concurrent writers. Aggregates map to shards deterministically, so that ordering holds across the cluster, not just on one box. One stream per user, per device, per order, per match. Model your domain, not your index budget: a stream per entity costs nothing, even at millions of them.

Dynamic consistency boundaries

One atomic write can span many aggregates, each guarded by its own version. The consistency boundary is not fixed at a single aggregate; it is whatever set you write together, chosen per write. Debit one account and credit another in one write (the SYNOPSIS above), each with its own optimistic-concurrency check. If either moved since you read it, the whole write is rejected: no half-finished transfer, no distributed transaction, no saga, no outbox. Most event stores fix the boundary at one aggregate and make you bolt the rest on. Drawing the boundary per write is the answer to "these two aggregates must agree" short of abandoning event sourcing.

An atomic write always lands on one shard. There is no cross-shard commit protocol. Routing is a plain modulus on an id you choose at cluster init (org, type, or aggregate), so multi-aggregate writes that must agree can be placed together intentionally. A write that spans shards is rejected whole, never half-applied.

Watch two writers race for the same version:

occ.conflict.liveaggregate @ v4

Two writers, one aggregate. Press run.

Exactly-once writes

A retried write lands at most once, and there is no outbox, dedup table, or CDC pipeline to babysit. Each writer derives a sequence number from the log it already reads while catching up; the server deduplicates on (client, aggregate, sequence) and checks optimistic concurrency before idempotency, so a concurrent writer's event is never mistaken for your retry. The whole contract is one read-and-retry loop; the reference implementation (account_service.rs) is a few hundred lines with no extra moving parts.

Two nodes and S3

The leader takes writes and replicates each batch to the follower; both fdatasync to disk before the leader acknowledges, so an acknowledged write is on two machines, not one. In normal operations S3 is never in the write path, so it adds zero latency: it is only the coordination plane (leader election is a single S3 conditional write) and the backup replica (lose the follower and the leader keeps serving, replicating to S3 until it returns). There is no Zookeeper to run and no Raft library to misconfigure.

Writes commit speculatively and roll back cleanly: data hits disk before replication confirms, but a reader never sees a write that could still be rolled back. Rolling upgrades take one node at a time, and the restarted node catches up on its own. If S3 is unreachable, a leader handover can wait, but an acknowledged write is never lost, and two nodes can never both hold the lease.

The storage engine

Celeriant skips the kernel page cache and writes with Direct I/O. Buffered I/O can report a clean fsync and still lose the data when writeback fails later; that is documented kernel behaviour, not a hypothetical. One thread per core on io_uring, in the ScyllaDB and TigerBeetle lineage, removes the lock contention and the entire class of concurrency bugs that thread-pool databases never stop fighting.

Small events are stored inline in their metablock, saving a disk seek entirely. And memory stays bounded at any cardinality: each log segment carries a bloom filter, cold aggregates fall back to a pruned reverse scan of the log, and recent writes stay hot in a bounded LRU cache. Millions of aggregates and billions of events on a 32 GB box, with the log on NVMe. The only cost is slightly higher latency on the first read of a cold aggregate.

On by default

You do not opt in to correctness. These come standard.

Durability: An acknowledged write is fdatasync'd to disk on both nodes, through Direct I/O, before the ack returns. Pull the power on either one; it is still there.
Schema validation: Register a JSON Schema, Avro, or Protobuf schema per event type; malformed events are rejected at write time, on the server, with semantic versioning built in. Server-side validation is exactly what Kafka makes you buy Confluent for.
Audit and encryption: Every event is hash-chained to its predecessor with Blake3, so tampering is detectable; payloads can be encrypted per event with AES-GCM.
Live watch: Subscribe and react to writes as they land, per aggregate or across a whole org. Read models catch up by replaying from an offset, then follow the live tail.

Performance

Connections: 36,000
Durable writes / sec: 325,000
p50: 94 ms
p95: 111 ms
p99: 201 ms

Client Concurrency: 36,000 durable writes in flight at once, across four load-generating clients. A saturation number well past Postgres's connection wall, not a single-threaded ping
Payload: one "Hello World" event per acknowledged write
Hardware: two AWS i4i.16xlarge data nodes: 64 vCPU, four local NVMe drives striped RAID0
Cost: the two i4i.16xlarge data nodes run $13.16 an hour on-demand in ap-southeast-2, about $9,600 a month, before reserved or spot discounts. It scales down hard: two i4i.large cost about $300 a month and still hold 30,000 durable writes a second at p99 158ms
Network: ap-southeast-2, single availability zone
Security: mTLS on client connections and on cluster replication
Durability: every write is fdatasync'd to disk on both nodes through Direct I/O, replicated to the follower, and acknowledged only after both succeed

Every latency is end-to-end, including replication and both fsyncs, over mTLS on the client and replication paths. That is encrypted, durable, replicated throughput, not a page-cache number you cannot trust. Throughput peaks just over 500k writes/sec but you pay for it in p99 latency drop-off. Tested in a single availability zone, expect worse numbers for cross AZ. One sweep on AWS, reproduce it yourself for a few dollars.

CONFORMING TO

Apache-2.0, release pending. The server runs on Linux: the write path is built on io_uring and Direct I/O. First-party clients for .NET and Rust speak the binary protocol directly; a local-first HTTP and SSE gateway fronts browsers.

CAVEATS

When not to use it. Celeriant is narrow on purpose. Here is when it is the wrong tool.

You have one writer per aggregate and no causal coupling: No contention means you do not need optimistic concurrency. Independent systems that don't share state means no race conditions. Kafka with an outbox and Debezium will serve you fine; reach for it.
You need a query database: This is the write side. Reads are per-aggregate, by offset and event type: no SQL, no ad-hoc queries. Project into Postgres or your read store of choice for that.
You need transactional reads across aggregates: Per aggregate, just-in-time catch-up keeps reads current: replay the stream at read time, then serve. What you give up is a single transactional snapshot across many aggregates. If you need that, use an RDBMS.
You need stability today: Pre-1.0, the wire format can still change. If you cannot ride breaking changes, run EventStoreDB or Marten now and revisit later.

COLOPHON

Celeriant is releasing under Apache-2.0. The OSS core will be fully functional; no crippled community edition.

No signup form. Email tyson@celeriant.io and you will hear when it ships, and not otherwise. Or watch the repo on GitHub and find out the same day.

Questions: tyson@celeriant.io / LinkedIn.

Kafka has idempotent and transactional producers, but no per-key compare-and-set on produce: the expected-offset append event sourcing needs. Tracked in KAFKA-2260. ↩
Throughput ceiling and vacuum/IOPS behaviour vary by workload and tuning; the failure mode under event-sourcing write patterns is the consistent part. ↩