[MB-70699] reallocateStoredValue loses locked_cas – Couchbase Support

Product: Couchbase Server
Component: Data-Service
Issue Link: MB-70699
Affects Versions: 7.2.4+, 7.6, 8.0
Fix Versions: 7.6.11, 8.0.1

Summary

An issue in the defragmenter's reallocateStoredValue function causes the locked_cas to be lost. This results in failures for GET_AND_LOCK and subsequent CAS operations (like REPLACE) because the server forgets the original CAS value that should be restored after the lock expires.

Symptoms

Please note that not all of the following symptoms are required to deem this issue as present.

GET_AND_LOCK operations succeed but subsequent REPLACE operations using the returned CAS are rejected with LOCKED (status 0x0009), even though the lock has expired.
The REPLACE response carries a CAS of 0x0, indicating the server has lost the locked_cas for that document.
Repeated SDK retries all fail with LOCKED for the same document until the lock timeout is exhausted.
Documents may appear "stuck" or unable to be updated by the original owner after a lock period.
lock_errors cbstat is non-zero on buckets where ep_defragmenter_sv_num_moved is also non-zero.

Triggers

The document must be currently locked (via GET_AND_LOCK).
The defragmenter must run and trigger a reallocation of the StoredValue (reallocateStoredValue) for that specific document while the lock is active.
The issue is more likely when the defragmenter is running aggressively, i.e. when ep_defragmenter_sleep_time is at or near its minimum value (0.6s), driven by high jemalloc heap fragmentation.
Affects Couchbase Server versions from 7.2.4 up to (but not including) the fix versions.

Verification

Verification requires a packet capture collected on the affected KV node and stats.log from a cbcollect_info log collection.

Step 1

Collect a packet capture on the affected KV node

Capture traffic on the KV port (11210) during a window where LOCKED errors are being observed:

tcpdump -i <interface> -s 0 port 11210 -w <pcap_file>.pcap

Step 2

Confirm the LOCKED error pattern in the pcap

Count LOCKED (status 0x0009) responses grouped by opcode:

tshark -r <pcap_file>.pcap -Y 'couchbase.status == 0x0009' -T fields -e couchbase.opcode | sort | uniq -c | sort -rn

Expected output when the issue is present. A high count of REPLACE (0x03) operations rejected as LOCKED:

   3278 0x03    <- REPLACE rejected as LOCKED
     97 0x94    <- GET_AND_LOCK rejected as LOCKED

Step 3

Confirm the server returned CAS 0x0 on the REPLACE response

Pick a CAS value returned in a GET_AND_LOCK response and trace the subsequent REPLACE to confirm the server lost the locked_cas:

tshark -r <pcap_file>.pcap \
  -Y 'couchbase.cas == <CAS_VALUE> or (couchbase.opcode == 0x03 and couchbase.status == 0x0009 and frame.time_relative >= <T_START> and frame.time_relative <= <T_END>)' \
  -T fields -e frame.time_relative -e ip.src -e ip.dst -e couchbase.opcode -e couchbase.status -e couchbase.cas

Expected output when the issue is present:

38.492908  <NODE_IP>    <CLIENT_IP>  0x94  0x0000  <CAS_VALUE>             <- GET_AND_LOCK response (SUCCESS)
38.493996  <CLIENT_IP>  <NODE_IP>    0x03          <CAS_VALUE>             <- REPLACE request (same CAS, 1.1ms later)
38.500812  <NODE_IP>    <CLIENT_IP>  0x03  0x0009  0x0000000000000000      <- REPLACE response (LOCKED, CAS lost)
38.505969  <CLIENT_IP>  <NODE_IP>    0x03          <CAS_VALUE>             <- SDK retry #1
... (further retries, all rejected as LOCKED)

The issue is confirmed by the third line: the REPLACE response carries status 0x0009 (LOCKED) and CAS 0x0000000000000000. The server accepted the lock but lost the locked_cas between the GET_AND_LOCK and REPLACE, causing every subsequent retry to also fail.

Step 4

Correlate defragmenter activity with lock errors via cbstats

From stats.log in the cbcollect_info log collection, extract ep_defragmenter_sv_num_moved and lock_errors per bucket:

paste \
    <(awk '/^\*{20,}/{getline; bucket=$1} /ep_defragmenter_sv_num_moved/{printf "%-25s %s\n", bucket, $2}' stats.log) \
    <(awk '/^\*{20,}/{getline; bucket=$1} /^[[:space:]]*lock_errors:/{print $2}' stats.log)

Expected output when the issue is present. Buckets with defragmenter activity (sv_num_moved > 0) show lock errors, while buckets with no defragmenter activity show none:

BUCKET       sv_num_moved  lock_errors
<BUCKET_1>   1905765865    6420
<BUCKET_2>   5713777       2808
<BUCKET_3>   2910680       117
<BUCKET_4>   734826        0          <- sv_num_moved > 0 but lock_errors = 0 (low move rate)
<BUCKET_5>   0             0
<BUCKET_6>   0             0
<BUCKET_7>   0             0

The issue is strongly indicated when:

All buckets with ep_defragmenter_sv_num_moved = 0 show lock_errors = 0
Buckets with ep_defragmenter_sv_num_moved > 0 and active locking workloads show lock_errors > 0

Also check ep_defragmenter_sleep_time. A value at or near 0.6 (the minimum) indicates the defragmenter is running at maximum aggressiveness due to high heap fragmentation, which increases the probability of the issue occurring.

Workarounds

Upgrade to a version of Couchbase Server containing the fix: 7.6.11, 8.0.1, or 8.1.0 and above.
Implement application-level retries that fetch the latest CAS if a mismatch occurs after a lock expires.
Not recommended - the issue may be prevented by disabling the Defragmenter, however this will result in high memory fragmentation which could cause further issues. This step should only be performed after consulting Couchbase Technical Support.

Summary

Symptoms

Triggers

Verification

Workarounds

Related articles