Write-Ahead Log (WAL)
The WAL ensures data durability. Every write is persisted to WAL before being stored in the main storage engine, preventing data loss on crash.
Key Features
- Partitioned by db/collection/date: Each partition gets its own Writer instance
- Batch Writer: Groups commits with 10ms flush interval for optimal I/O
- Protobuf Encoding: Entries serialized with Protocol Buffers, framed with CRC32 checksums
- Segment Rotation: Max 64MB per segment file, automatic rotation
- Event-Driven Flush: FlushWorkerPool is notified only on new segment boundaries (not per record)
Architecture
┌───────────────────────────────────────────────────────┐
│ PartitionedWriter │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Partition: db1/sensors/2026-01-29 │ │
│ │ ┌──────────────────────────────────────────┐ │ │
│ │ │ batchWriter │ │ │
│ │ │ Buffer → flushLoop (10ms) → baseWAL │ │ │
│ │ │ Returns WriteResult{IsNewSegment: bool} │ │ │
│ │ └──── ──────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Partition: db1/metrics/2026-01-29 │ │
│ │ (lazy-created, idle cleanup after 10 min) │ │
│ └─────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────┘
Entry Format
Each WAL entry is protobuf-encoded and framed:
[4 bytes: entry size] [N bytes: protobuf data] [4 bytes: CRC32 checksum]
Entry fields:
type Entry struct {
Type EntryType // Write or Delete
Database string
Collection string
ShardID string
GroupID int
Time string // RFC3339 timestamp
ID string // device_id
Fields map[string]interface{}
Timestamp int64 // Unix timestamp
}
Segment Files
data/wal/
├── db1/
│ └── sensors/
│ └── 2026-01-29/
│ ├── wal-1706486400000.log (64MB max per segment)
│ ├── wal-1706486500000.log
│ └── ...
Segments are rotated when they reach 64MB. Old segments are removed after successful flush to TieredStorage.
Write Modes
Async Write (Default)
- ~137ns per write
- Entries buffered in memory, flushed every 10ms or when batch size reached (1000 entries)
- Best for high-throughput scenarios
Sync Write
- ~18ms per write
- Adds a notification channel, blocks until flush completes
- Guaranteed durability per write
Batch Write
- Bulk insert of multiple entries in a single call
- Entries are buffered and flushed together
Flush Integration
The WAL integrates with the FlushWorkerPool through an event-driven model:
WriteWorkerPoolwrites entries to thePartitionedWriter- When a new segment is created,
WriteResult.IsNewSegment = true FlushWorkerPoolis notified only on new segments (not per record)FlushWorkercallsPrepareFlushPartition()— rotates segment, returns old files- Old segments are read, data flushed to TieredStorage, then segment files removed
Recovery
On startup, WAL replays uncommitted entries:
- Scan all partition directories for segment files
- Read entries segment-by-segment
- Recent data → MemoryStore; all data → flush to TieredStorage
- Processed segments are removed after successful recovery