LSM Tree Backend¶

Overview¶

The LSM (Log-Structured Merge) Tree is a write-optimized data structure designed for high-throughput insertion and update scenarios. It trades read performance for exceptional write performance through deferred merging.

Key Characteristics¶

Write-Optimized: Fast sequential writes to in-memory structures
Level-Based Structure: Multiple sorted runs at different levels
Asynchronous Compaction: Background merging of levels
Bloom Filters: Quick negative lookups to avoid unnecessary I/O

Architecture¶

┌─────────────────────────────────┐ │ Memtable (In-Memory) │ │ - Red-Black Tree or SkipList │ │ - Sorted by key │ │ - Fast writes and reads │ └────────────┬────────────────────┘ │ (when full) ▼ ┌─────────────────────────────────┐ │ Level 0 (On-Disk SSTables) │ │ - Multiple overlapping runs │ │ - Unsorted between runs │ │ - Uncompacted structure │ └────────────┬────────────────────┘ │ (Compaction) ▼ ┌─────────────────────────────────┐ │ Level 1 (On-Disk SSTables) │ │ - Sorted runs (no overlap) │ │ - Size ×10 of Level 0 │ └────────────┬────────────────────┘ │ (Compaction) ▼ ┌─────────────────────────────────┐ │ Level 2, 3, ... (Final Levels) │ │ - Progressively larger │ │ - Minimal overlap │ └─────────────────────────────────┘

Data Flow¶

Write Path¶

Client writes key-value pair
Write logged to WAL (for durability)
Insert into memtable
Return success to client
Background: Eventually flush to Level 0

Benefits: Single sequential write to WAL + one random write to memtable

Read Path¶

Check memtable first (fastest)
Check L0 SSTables in reverse insertion order
Check L1-Ln SSTables using binary search
Use bloom filters to skip unnecessary SSTables

SSTables (Sorted String Tables)¶

SSTables are immutable, sorted files containing key-value pairs:

┌──────────────────────────────┐ │ Header │ │ - Version │ │ - Key Range │ │ - Bloom Filter │ ├──────────────────────────────┤ │ Data Blocks │ │ - Index Block │ │ - Data entries │ ├──────────────────────────────┤ │ Footer │ │ - Checksum │ │ - Index offset │ │ - Metadata offset │ └──────────────────────────────┘

Bloom Filters¶

Probabilistic data structures for efficient negative lookups:

False Positive: May indicate key exists when it doesn't
False Negative: Never occurs - if bloom says "no", key definitely absent
Performance: O(1) lookups avoiding unnecessary I/O

Compaction Strategy¶

Compaction merges overlapping SSTables into larger files:

Level-Based Compaction¶

L0 → L1: When L0 accumulates enough SSTables
L1 → L2: When L1 exceeds size limit
Ln → L(n+1): Cascading up the levels

Each level is typically 10x the size of the previous level.

Compaction Process¶

``` L0 Compactions (overlapping runs): SSTables at L0 ─┬─ Merge sort ─┐ └─ Combine ──┴──► L1 SSTable

Li → Li+1 Compaction (non-overlapping): L1 SSTables ────┬─ Merge sort ─┐ └─ Combine ──┴──► L2 SSTable ```

Performance Characteristics¶

Operation	Time Complexity	Notes
Write	O(log n) avg	Amortized, due to compaction
Read	O(log n) + I/O	Logarithmic seeks + block I/O
Range Scan	O(k log n)	k results, requires level traversal
Space	O(n×1.1)	Small overhead due to compaction

Trade-offs vs B+ Tree¶

Aspect	LSM Tree	B+ Tree
Writes	Much faster	Slower (random I/O)
Reads	Slower	Faster
Space Amp.	Low	Higher (balanced tree)
Write Amp.	High (compaction)	Low
Range Scans	Good	Excellent
Latency	Variable	Predictable

Configuration Parameters¶

Memtable Size: When to flush to L0 (affects write latency)
Level Size Multiplier: Ratio between consecutive levels (typically 10)
Target File Size: Size of SSTables at each level
Bloom Filter Bits: Bits per key for bloom filters (affects false positive rate)
Compaction Strategy: When and how to trigger compaction