[MB-70263] FetchMinSeqnos should fail on partial KV node responses to prevent stale seqno connection cache and disk snapshot accumulation – Couchbase Support

Product: Couchbase Server
Component: secondary-index
Issue Link: MB-70263
Affects Versions: 7.6, 8.0
Fix Versions: 7.6.12, 8.0.2

Summary

On clusters using MOI (Memory-Optimized Index) storage, the indexer's internal FetchMinSeqnos() function can silently return incomplete data when one or more KV nodes become unreachable over a stale cached connection. Vbuckets owned exclusively by the unreachable node(s) are returned with a sequence number of 0, causing the snapshot cleanup routine (cleanupOldSnapshotFiles()) to believe those snapshots are still needed. As a result, on-disk MOI snapshots accumulate on the index node up to the max_disk_snapshots limit. Disk usage climbs steadily and then drops sharply when the indexer is restarted, producing a characteristic sawtooth pattern in disk usage metrics. The root cause is that FetchMinSeqnos() was not treating a per-node failure as a hard error; it merged whatever results it had and returned success. The fix makes any per-node failure propagate as an error, triggering delDBSbucket() to tear down the stale kvfeeds connection cache and re-establish fresh connections on the next cycle, restoring normal snapshot cleanup.

Symptoms

Please note that not all of the following symptoms are required to deem this issue as present.

Disk usage on the index node rises steadily and then drops suddenly in a repeating sawtooth pattern, without any application-side changes to data volumes.
The index_raw_data_size and index_total_data_size metrics remain relatively flat while actual on-disk usage climbs — the growth is in rollback snapshot files, not index data itself.
The indexer logs cease showing regular Removing disk snapshot messages during the accumulation phase.
Restarting the indexer service temporarily resolves the disk pressure (disk drops), but the pattern resumes if the underlying connectivity issue persists.
Approaching-disk-full alerts fire on the index node: Approaching full disk warning. Usage of disk "..." on node "..." is around 92%.

Triggers

MOI (Memory-Optimized Index) storage mode must be in use on the affected indexes.
The indexer node must have a stale TCP connection to one or more KV nodes — most commonly caused by a transient network interruption, firewall/routing change, or the indexer and some KV nodes residing in different network subnets with inconsistent connectivity.
The affected KV node(s) must be the sole owners (active + replica) of at least some vbuckets; those vbuckets will receive a merged seqno of 0, which is sufficient to block cleanup.

Verification

On the index node, run the following grep against indexer.log to confirm snapshot cleanup has stalled:

grep -i "Removing disk snapshot" */indexer.log -c

A count of 0, or a value that does not increase over several minutes while mutations are active, confirms cleanup has stopped.

Once the snapshot count reaches the hard cap (indexer.recovery.max_disksnaps + 1, typically observed as 5), the indexer removes one snapshot to stay within bounds but cannot clean further due to the stale min-seqno. This produces a repeating 5 → 4 oscillation visible in indexer.log. Look for Removing disk snapshot immediately followed by Skipped disk snapshot cleanup for the same index instance:

<TIMESTAMP> [Info] MemDBSlice Slice Id 0, IndexInstId <INDEX_INST_ID>, PartitionId <N> Removing disk snapshot .... Num snapshots 5.
<TIMESTAMP> [Info] MemDBSlice Slice Id 0, IndexInstId <INDEX_INST_ID>, PartitionId <N> Skipped disk snapshot cleanup .... Num snapshots 4.
<TIMESTAMP> [Info] MemDBSlice Slice Id 0, Threads 3, IndexInstId <INDEX_INST_ID>, PartitionId <N> created ondisk snapshot .... Took 1m32.927850892s

This cycle repeating confirms the issue is present. To grep for the oscillation pattern:

grep -E "Removing disk snapshot|Skipped disk snapshot cleanup" */indexer.log

Workarounds

Temporary mitigation - kill the indexer process: The Index Service cannot be restarted independently via systemctl; to restart it, kill the indexer process directly. The Cluster Manager will automatically respawn it. A standard SIGTERM should be tried first to allow for a cleaner shutdown:
```
kill $(pgrep -f '/indexer ')
```
This re-establishes KV connections and clears the stale snapshot backlog, but disk will accumulate again if the underlying network issue persists. Note that this will incur index downtime and will require the Index Service to complete Indexer Warmup after restarting.
Temporary mitigation - restart Couchbase Server on the affected node: Restarting the full couchbase-server service achieves the same effect but will also incur downtime. Before doing so, take one of the following precautions to avoid an unplanned failover:
- Disable auto-failover in the Admin UI (Settings → Auto-Failover) before restarting, then re-enable it afterwards; or
- Manually failover the node first and rebalance it back in after the restart; or
- Remove the node from the cluster, restart, and re-add it.
Network mitigation: Ensure all KV nodes are reachable from the indexer node over a stable, low-latency path. Moving KV nodes and the indexer into the same network subnet reduces the probability of partial connectivity failures.

Summary

Symptoms

Triggers

Verification

Workarounds

Related articles