Live migration

This document describes the Bufstream live migration process for upgrading a Bufstream cluster to the latest architecture with zero downtime.

You control the migration by advancing through stages manually. Bufstream includes guardrails to prevent invalid cluster states.

Cluster health must be monitored continuously during the migration. While individual segments can be reverted or re-run, there is a point of no return after which full reversal is only possible through a backup and restore.

How it works

The migration moves your cluster to the new table schema without downtime by stepping the Bufstream metadata storage layer through a sequence of phases called read-write modes. Two intermediate dual-write modes keep both schemas in sync while data is gradually moved over, so the old tables remain available as a fallback until a cutover point is reached.

The cluster progresses through four modes:

Mode 1: Read and write old tables (starting point)
Mode 2: Read old tables, write to both
- Key-value data is copied during this mode
Mode 3: Read new tables, write to both
- Message queues transition atomically during this mode
Mode 4: Read and write new tables (migration complete)

Before you begin

Verify the following before starting the migration:

Bufstream has been upgraded to a version ≥ 0.4.14 and < 0.5.0, and all brokers are confirmed to be running the same version.
Bufstream and Postgres have sufficient resources:
- Connection capacity: Verify that the sum of the connections allocated to Bufstream brokers (metadata.postgres.pool.maxConnections times the number of brokers) is below Postgres's configured connection limit.
  For example, if you have eight brokers and Postgres's max_connections is 1000, then setting metadata.postgres.pool.maxConnections to 100 means Bufstream may use up to 8 × 100 = 800 connections, which is below the limit of 1000.
  While the migration doesn't allocate beyond your configured limit, it does increase usage of existing connections. If connection pools are misconfigured, operations may time out and cause cascading failures. If you're close to the limit, increase Postgres's max_connections or reduce Bufstream's connection pool size before proceeding.
- Postgres CPU and memory: The migration's dual-write modes increase demand on Postgres. Before the migration, ensure Postgres is operating at under 65% CPU utilization and 80% memory utilization on average.
- Bufstream memory: If brokers frequently experience OOMKilled events or consistently use more than 80% of available memory, increase memory allocation before proceeding.
Auto-scaling is disabled: Running additional brokers doesn't negatively impact migration as long as Postgres can support the additional connections. Nevertheless, keeping the broker count static during the migration is recommended because:
- Broker restarts and terminations, while tolerated, may slow migration progress.
- Temporary migration-related metric changes may trigger spurious scaling events.
Cluster is healthy: See Monitor cluster health for details.

Dry-run command

Bufstream provides a command that generates a migration plan and compares it against a full scan of your metadata store to verify that the migration will cover all of your data.

bufstream admin migrate metadata dry-run

This command is read-only, and though it is optional, you are encouraged to run it before performing the migration.

Note

The dry-run check may show false-positive "missed keys" warnings if any metadata changes while it runs (new topics, partitions, or consumer groups). If this happens, re-run the command; it should not warn consistently about the same key. Note that in the real migration, any new keys created during the process are already written to both stores via the dual-write system.

Monitor migration status

Run this command periodically (for example, using watch) as you perform the migration:

bufstream admin migrate metadata status

This prints a formatted summary of the migration status:

yaml

Cluster Mode: 1 (V1_ONLY)
Stability Window: 2m0s
Last Action: never
Stability Window Remaining: 0s (ready for next action)

Broker Modes:
  bufstream-us-west1-a-0 (us-west1-a): 1 (V1_ONLY)
  bufstream-us-west1-a-1 (us-west1-a): 1 (V1_ONLY)
  bufstream-us-west1-b-0 (us-west1-b): 1 (V1_ONLY)
  bufstream-us-west1-b-1 (us-west1-b): 1 (V1_ONLY)

Migration Action Revisions:
  mode_2_rev: 0
  mode_3_rev: 0
  mode_4_rev: 0
  sync_kv_started_rev: 0
  sync_kv_done_rev: 0
  migrate_queues_started_rev: 0
  migrate_queues_done_rev: 0

Monitor cluster health

Metrics and logs are your early warning system during the migration. Watch for the following between each step, and if you see significant degradation in performance, consider reverting the last step and investigating before continuing:

Metrics:
- CPU and memory (brokers and Postgres) are stable.
- Kafka error count is not increasing (obtained via bufstream.kafka.request.count with non-empty kafka.error_code).
- Consumer lag is bounded (bufstream.kafka.consumer.group.lag).
- Producer throughput is consistent for constant-sized workloads (bufstream.kafka.produce.bytes).
Logs:
- Bufstream: there should be no new ERROR-level messages (there may be INFO-level logs related to the migration itself).
- Clients: the migration is not expected to impact clients, so any changes in client-side logs may indicate a problem.

See the Bufstream metrics reference for a full list of available metrics.

Note

Metrics will fluctuate during the migration. In particular, Postgres CPU utilization will increase noticeably, and Bufstream latencies may increase slightly. You can expect these metrics to settle, at or below their pre-migration averages, a few minutes after the migration completes.

Migration steps

Run these commands in order. You can execute them in any Bufstream broker pod using kubectl exec.

Note

After each step, Bufstream enforces a 2-minute stability window during which further migration actions are blocked. Wait for this window to complete before proceeding.

Step 1: Enter dual-write mode (Mode 2)

The cluster writes to both old and new table schemas while reading from the old tables. New data is written to the new tables immediately; existing data is migrated in the next step.

bufstream admin migrate metadata advance-mode 2

undefined

Rollback: Revert to Mode 1.

bufstream admin migrate metadata revert-mode 1

Step 2: Migrate key-value data

Copies all existing key-value data from the old tables to the new tables in a background job. The cluster remains available during the job. Use the status command to monitor progress.

bufstream admin migrate metadata sync-kv start

undefined

Rollback: Cancel the job, then revert to Mode 1 if needed. If you revert to Mode 1, Bufstream will require you to run sync-kv again before continuing the migration.

bufstream admin migrate metadata sync-kv cancel

Concurrency

The sync-kv start command accepts a --concurrency option which defaults to 1. This controls the number of data-copying threads within each broker. Increasing the concurrency will copy data faster, but will also increase Bufstream latency substantially. You are encouraged to use the default concurrency unless you need faster completion and can tolerate higher latency.

Cancelling and resuming `sync-kv`

Because sync-kv may take several minutes to hours depending on the age of your cluster, it is resilient to broker restarts as well as termination of the sync-kv start command. It will run to completion unless you explicitly cancel it.

After cancelling sync-kv, you can resume without losing progress by starting the job again. You may wish to cancel and resume if, for example, you need to tune the concurrency or broker count.

If you terminate sync-kv start and rerun it without cancelling first, your client will reconnect to the in-progress job and continue displaying its logs and progress.

Reverting to Mode 1 resets your progress, since Mode 1 is not a dual-write mode.

Step 3: Enter second dual-write mode (Mode 3)

The cluster begins reading from the new tables while still writing to both. Key-value data has been fully migrated. Message queues are atomically switched over upon first write; the next step ensures they all transition.

bufstream admin migrate metadata advance-mode 3

undefined

Rollback: Revert to Mode 2.

bufstream admin migrate metadata revert-mode 2

You cannot revert to Mode 1 after this step.

Step 4: Migrate message queues

Moves all message queues from the old tables to the new tables. The cluster remains available throughout.

bufstream admin migrate metadata migrate-queues

undefined

Rollback: Cancel the job.

bufstream admin migrate metadata migrate-queues cancel

If needed, you may revert to Mode 2 after cancelling this job.

You cannot revert to Mode 1 after this step.

Note: Unlike sync-kv, the duration of the migrate-queues job is bounded by the number of topics and partitions in your cluster. It typically finishes in seconds to minutes. However, note that terminating the migrate-queues start command does not stop the job; you must explicitly cancel it.

Step 5: Finalize migration (Mode 4)

The cluster switches exclusively to the new tables. This step is irreversible without a backup restore.

bufstream admin migrate metadata advance-mode 4

undefined

No rollback available. Contact Buf if you experience problems after this step.

Commands

Beta

Registry

Config

Dep

Lsp

Plugin

Registry

Module

Organization

Plugin

Policy

Sdk

Source

Edit

v2

v1

v1beta

Buf check plugins

SCIM

CLI commands

Admin

Clean

Repair

Client

Kafka

Config

Topic

Live migration ​

How it works ​

Before you begin ​

Dry-run command ​

Monitor migration status ​

Monitor cluster health ​

Migration steps ​

Step 1: Enter dual-write mode (Mode 2) ​

Step 2: Migrate key-value data ​

Concurrency ​

Cancelling and resuming sync-kv ​

Step 3: Enter second dual-write mode (Mode 3) ​

Step 4: Migrate message queues ​

Step 5: Finalize migration (Mode 4) ​

Live migration

How it works

Before you begin

Dry-run command

Monitor migration status

Monitor cluster health

Migration steps

Step 1: Enter dual-write mode (Mode 2)

Step 2: Migrate key-value data

Concurrency

Cancelling and resuming `sync-kv`

Step 3: Enter second dual-write mode (Mode 3)

Step 4: Migrate message queues

Step 5: Finalize migration (Mode 4)