Indexer Pipeline Architecture
The sui-indexer-alt-framework provides two distinct pipeline architectures. Understanding their differences is crucial for choosing the right approach.
Sequential versus concurrent pipelines
Sequential pipelines commit complete checkpoints in order. Each checkpoint is fully committed before the next one, ensuring simple, consistent reads.
Concurrent pipelines commit out-of-order and can commit individual checkpoints partially. This allows you to process multiple checkpoints simultaneously for higher throughput, but requires reads to check which data is fully committed to ensure consistency.
When to use each pipeline
Both pipeline types can handle updates in place, aggregations, and complex business logic. While sequential pipelines have throughput limitations compared to concurrent, the decision to use one over the other is primarily about engineering complexity rather than performance needs.
Recommended: Sequential pipeline
Start here for most use cases. Provides more straightforward implementation and maintenance.
- ✓ You want straightforward implementation with direct commits and simple queries.
- ✓ Team prefers predictable, easy-to-debug behavior.
- ✓ Current performance meets your requirements.
- ✓ Operational simplicity is valued.
Concurrent pipeline
Consider implementing a concurrent pipeline when:
- ✓ Performance optimization is essential.
- ✓ Sequential processing can't keep up with your data volume.
- ✓ Your team is willing to handle the additional implementation complexity for the performance benefits.
Supporting out-of-order commits introduces a few additional complexities to your pipeline:
- Watermark-aware queries: All reads must check which data is fully committed. See the watermark system section for details.
- Complex application logic: You must handle data commits in pieces rather than handling complete checkpoints.
Decision framework
If you're unsure of which pipeline to choose for your project, start with a sequential pipeline as it's easier to implement and debug. Then, measure performance under a realistic load. If the sequential pipeline can't meet your project's requirements, then switch to a concurrent pipeline.
While not an exhaustive list, some specific scenarios where a sequential pipeline might not meet requirements include:
- Your pipeline produces data in each checkpoint that benefits from chunking and out-of-order commits. Individual checkpoints can produce lots of data or individual writes that might add latency.
- You're producing a lot of data that needs pruning. In this case, you must use a concurrent pipeline.
Beyond the decision of which pipeline to use, you also need to consider scaling. If you're indexing multiple kinds of data, then consider using multiple pipelines and watermarks.
The watermark system
For each pipeline, the indexer minimally tracks the highest checkpoint where all data up to that point is committed. Tracking is done through the checkpoint_hi_inclusive committer watermark. Both concurrent and sequential pipelines rely on checkpoint_hi_inclusive to understand where to resume processing on restarts.
Optionally, the pipeline tracks reader_lo and pruner_hi, which define safe lower bounds for reading and pruning operations, if pruning is enabled. These watermarks are particularly crucial for concurrent pipelines to enable out-of-order processing while maintaining data integrity.
Safe pruning
The watermark system creates a robust data lifecycle management system:
- Guaranteed data availability: Enforcing checkpoint data availability rules ensures readers perform safe queries.
- Automatic cleanup process: The pipeline frequently cleans unpruned checkpoints to ensure storage doesn't grow indefinitely while maintaining the retention guarantee. The pruning process runs with a safety delay to avoid race conditions.
- Balanced approach: The system strikes a balance between safety and efficiency.
- Storage efficiency: Old data gets automatically deleted.
- Data availability: Always maintains retention amount of complete data.
- Safety guarantees: Readers never encounter missing data gaps.
- Performance: Out-of-order processing maximizes throughput.
This watermark system is what makes concurrent pipelines both high-performance and reliable, enabling massive throughput while maintaining strong data availability guarantees and automatic storage management.
Scenario 1: Basic watermark (no pruning)
With pruning disabled, the indexer reports each pipeline's committer checkpoint_hi_inclusive only. Consider the following timeline, where a number of checkpoints are being processed and some are committed out of order.
Checkpoint Processing Timeline:
[1000] [1001] [1002] [1003] [1004] [1005]
✓ ✓ ✗ ✓ ✗ ✗
^
checkpoint_hi_inclusive = 1001
✓ = Committed (all data written)
✗ = Not Committed (processing or failed)
In this scenario, the checkpoint_hi_inclusive is at 1001, even though checkpoint 1003 is committed, because there is still a gap at 1002. The indexer must report the high watermark at 1001 to satisfy the guarantee that all data from start to checkpoint_hi_inclusive is available.
After the checkpoint 1002 is committed, you can safely read data up to 1003.
[1000] [1001] [1002] [1003] [1004] [1005]
✓ ✓ ✓ ✓ ✗ ✗
[---- SAFE TO READ -------]
(start → checkpoint_hi_inclusive at 1003)
Scenario 2: Pruning enabled
Pruning is enabled for pipelines configured with a retention policy. For example, if your table is growing too large and you want to keep only the last four checkpoints, then retention = 4. This means that the indexer periodically updates reader_lo as the difference between checkpoint_hi_inclusive and the configured retention. A separate pruning task is responsible for pruning data between [pruner_hi, reader_lo].
[998] [999] [1000] [1001] [1002] [1003] [1004] [1005] [1006]
🗑️ 🗑️ ✓ ✓ ✓ ✓ ✗ ✗ ✗
^ ^
reader_lo = 1000 checkpoint_hi_inclusive = 1003
🗑️ = Pruned (deleted)
✓ = Committed
✗ = Not Committed
Current watermarks:
-
checkpoint_hi_inclusive= 1003:- All data from start to 1003 is complete (no gaps).
- Cannot advance to 1005 because 1004 is not committed yet (gap).
-
reader_lo= 1000:- Lowest checkpoint guaranteed to be available.
- Calculated as:
reader_lo = checkpoint_hi_inclusive - retention + 1. reader_lo= 1003 - 4 + 1 = 1000.
-
pruner_hi= 1000:- Highest exclusive checkpoint that has been deleted.
- Checkpoints 998 and 999 were deleted to save space.
Clear safe zones:
[998] [999] [1000] [1001] [1002] [1003] [1004] [1005] [1006]
🗑️ 🗑️ ✓ ✓ ✓ ✓ ✗ ✗ ✓
[--PRUNED--][--- Safe Reading Zone ---] [--- Processing ---]
How watermarks progress over time
Step 1: Checkpoint 1004 completes.
[999] [1000] [1001] [1002] [1003] [1004] [1005] [1006] [1007]
🗑️ ✓ ✓ ✓ ✓ ✓ ✗ ✓ ✗
^ ^
reader_lo = 1000 checkpoint_hi_inclusive = 1004 (advanced by 1)
pruner_hi = 1000
With checkpoint 1004 now committed, checkpoint_hi_inclusive can advance from 1003 to 1004 because there are no gaps up to 1004. Note that reader_lo and pruner_hi haven't changed yet.
Step 2: Reader watermark updates periodically.
[999] [1000] [1001] [1002] [1003] [1004] [1005] [1006] [1007]
🗑️ ✓ ✓ ✓ ✓ ✓ ✗ ✓ ✗
^ ^
reader_lo = 1001 checkpoint_hi_inclusive = 1004
(1004 - 4 + 1 = 1001)
pruner_hi = 1000 (unchanged as pruner hasn't run yet)
A separate reader watermark update task (running periodically, configurable) advances reader_lo to 1001 (calculated as 1004 - 4 + 1 = 1001) based on the retention policy. However, the pruner hasn't run yet, so pruner_hi remains at 1000.
Step 3: Pruner runs after safety delay.
[999] [1000] [1001] [1002] [1003] [1004] [1005] [1006] [1007]
🗑️ 🗑️ ✓ ✓ ✓ ✓ ✗ ✓ ✗
^ ^
reader_lo = 1001 checkpoint_hi_inclusive = 1004
pruner_hi = 1001
Because pruner_hi (1000) < reader_lo (1001), the pruner detects that some checkpoints are outside of the retention window. It cleans up all elements up to reader_lo (deleting checkpoint 1000) and updates pruner_hi to reader_lo (1001).
Checkpoints older than reader_lo might still be temporarily available because of:
- Intentional delay protecting in-flight queries
- Pruner not completing cleanup yet
Sequential pipeline architecture
Sequential pipelines provide a more straightforward yet powerful architecture for indexing that prioritizes ordered processing. While they sacrifice some throughput compared to concurrent pipelines, they offer stronger guarantees and are often easier to reason about.
Architecture overview
The sequential pipeline consists of only two main components, making it significantly simpler than the concurrent pipeline's six-component architecture.

The ingestion layer (Regulator + Broadcaster) and Processor components use identical backpressure mechanisms, FANOUT parallel processing, and processor() implementations to the concurrent pipeline.
The key difference is the dramatically simplified pipeline core with just a single Committer component that handles ordering, batching, and database commits. Concurrent pipelines, in contrast, have five separate components in addition to the Processor: Collector, Committer, CommitterWatermark, ReaderWatermark, and Pruner.
Sequential pipeline components
There are two main components to sequential pipelines.