Force rebuild out-synced Shards with broken Merkle Tree


Right now during DBServer restart (so also during the upgrade) we can face an issue with out-synced shards.

This is known problem and occurs when Leader and Follower disagree on the number of documents in a shard, they will not be able to get in sync, and retry forever.

This feature is designed to solve this problem by forcing rebuild of out-synced shards with broken Merkle Tree, by using internal DBServer API.

This fix is addressed to the ArangoDB versions lower then: 3.10.6 and 3.9.11.

How to use

This feature is disabled by default.

  • To enable it use --deployment.feature.force-rebuild-out-synced-shards arg, which needs be passed to the operator.
  • Optionally we can override default timeouts by attaching following args to the operator:
    • --timeout.shard-rebuild {duration} - timeout after which particular out-synced shard is considered as failed and rebuild is triggered (default 60m0s)
    • --timeout.shard-rebuild-retry {duration} - timeout after which rebuild shards retry flow is triggered (default 60m0s)

Here is the example helm command which enables this feature and sets shard-rebuild timeout to 10 minutes:

helm upgrade --install kube-arangodb \$VER/kube-arangodb-$VER.tgz \
  --set "operator.args={--deployment.feature.force-rebuild-out-synced-shards,--timeout.shard-rebuild=10m}"