[MB-70453] Manifest calls retry and Discard connection on i/o timeout – Couchbase Support

Product: Couchbase Server
Component: Query-Service
Issue Link: MB-70453
Affects Versions: 7.6, 8.0
Fix Versions: 7.6.11, 8.0.2

Summary

When the Query service retrieves the collections manifest for a bucket, it previously always targeted the first (0th) node in the cluster map via a memcached connection on port 11210. If that node became temporarily unreachable — returning an i/o timeout — the errored connection was not discarded from the pool, and all subsequent manifest retrieval attempts continued hitting the same failing node. This caused the Query service to become stuck, persistently logging Unable to retrieve collections info for bucket <BUCKET>: Unable to get connection to retrieve collections manifest: dial tcp <NODE>:11210: i/o timeout. No collections access to bucket <BUCKET>. The state persisted through Couchbase service restarts (which reuse the same connection pools) but was resolved by killing and restarting only the cbq process.

The fix introduces three improvements to GetCollectionsManifest(): (1) connections that error with an i/o timeout are now properly discarded from the pool; (2) a random node is selected from the cluster map rather than always the first node; (3) a retry loop with exponential backoff attempts the request on a fresh connection if a connection error is detected.

Symptoms

Please note that not all of the following symptoms are required to deem this issue as present.

Queries fail with a Scan failed error (error code 16062) after approximately 20 seconds.
One or more Query nodes become stuck and cannot service requests against a bucket, even though other nodes in the cluster are healthy.
Restarting the full Couchbase service on the affected node does not resolve the problem, but killing and restarting only the cbq process does.
Query request plans show a Failed to get collection ID for scan error, eg:

    "errors": [
        {
            "_level": "exception",
            "caller": "seq_scan_index:414",
            "cause": {
                "_level": "exception",
                "caller": "memcached_scan:73",
                "cause": "dial tcp <IP>:11210: i/o timeout",
                "code": 16058,
                "key": "datastore.seq_scan.cid",
                "message": "Failed to get collection ID for scan"
            },
            "code": 16062,
            "key": "datastore.seq_scan.failed",
            "message": "Scan failed"
        }

Triggers

The Query service starts up (or restarts) and initialises its memcached connection pool while one KV node is temporarily slow or unreachable on port 11210, causing the initial manifest connection to time out.
The cluster map assigns that unreachable node as the first (0th) entry, causing all subsequent manifest requests to target the same failing node.
The errant connection is returned to the pool rather than discarded, so retries continue to use it.
The issue is more likely to manifest on nodes that use the unencrypted memcached port (11210) when connecting to a peer that is momentarily unavailable during a rolling or simultaneous multi-node restart.

Verification

Examine ns_server.query.log on the affected Query node. Look for repeated occurrences of the following pattern — a high count (hundreds or thousands) of the message indicates the Query service is stuck:

2026-02-02T08:51:55.783-06:00 [INFO] Unable to retrieve collections info for bucket <BUCKET>: Unable to get connection to retrieve collections manifest: dial tcp <NODE>:11210: i/o timeout. No collections access to bucket <BUCKET>.
2026-02-02T08:51:55.791-06:00 [INFO] Unable to retrieve collections info for bucket <BUCKET>: Unable to get connection to retrieve collections manifest: dial tcp <NODE>:11210: i/o timeout. No collections access to bucket <BUCKET>.
2026-02-02T08:52:15.795-06:00 [INFO] Unable to retrieve collections info for bucket <BUCKET>: Unable to get connection to retrieve collections manifest: dial tcp <NODE>:11210: i/o timeout. No collections access to bucket <BUCKET>.

Confirm the count and identify which node the timeout targets (run this from the root of an unzipped cbcollect_info bundle):

grep -i "i/o timeout. No collections access" */ns_server.query.log -c

Example output indicating the issue is present on one Query node:

cbcollect_info_ns_1@<NODE1>/ns_server.query.log:0
cbcollect_info_ns_1@<NODE2>/ns_server.query.log:847
cbcollect_info_ns_1@<NODE3>/ns_server.query.log:0

A non-zero count on any node (e.g. 847 on <NODE2> above) confirms that node is stuck in a retry loop against an unreachable KV node. A count of zero on the other nodes confirms the problem is isolated to the affected Query node rather than being cluster-wide.

Also verify that the connection pool for the affected bucket shows zero open connections on the stuck node:

grep "bucket <BUCKET> node <NODE>" */ns_server.query.log | grep "open 0"

Example output:

cbcollect_info_ns_1@<NODE2>/ns_server.query.log:2026-02-02T08:45:12.001-06:00 [INFO] bucket <BUCKET> node <NODE>:11210 open 0 free 0 waiters 1

open 0 confirms there are no usable connections to that KV node in the pool. free 0 means none are available for reuse. waiters 1 (or higher) shows that Query threads are actively blocked waiting for a connection that will never succeed, explaining why queries hang until the 20s timeout is reached.

Workarounds

Upgrade to a version of Couchbase Server with the fix: 7.6.11 or 8.0.2.
If an immediate upgrade is not possible, restart only the cbq-engine process (rather than the full Couchbase Server service) on the affected node. Killing and restarting cbq-engine forces the connection pool to be rebuilt, allowing the Query Service to establish fresh connections to an available node:

# Find the PID for the cbq-engine process
ps aux | grep cbq-engine
# Terminate the cbq-engine process
kill -9 <PID>
# The Cluster Manager will restart the cbq-engine process automatically

While cbq-engine restarts are generally safe, any currently running queries on that specific node will be terminated and will need to be retried by the application. For this reason, this step should be done during a period of low traffic or downtime.
Ensure that non-secure memcached ports (11210) are not blocked by firewall rules between cluster nodes, even when node-to-node encryption is set to all. Services still use non-secure ports for loopback (intra-node) communication, and blocking these ports can trigger the i/o timeout condition.

Summary

Symptoms

Triggers

Verification

Workarounds

Related articles