← Back to digest

Qubit-Scalable CVRP via Lagrangian Knapsack Decomposition and Noise-Aware Quantum Execution

Authors: Monit Sharma, Hoong Chuin Lau · arXiv: 2604.22194 · submitted 2026-04-27 · score 8/10 (HIGH)

Abstract

Hybrid quantum optimization for vehicle routing faces a practical bottleneck: (i) direct QUBO encodings of CVRP quickly exceed near-term qubit and gate budgets; and (ii) near-term quantum evaluations are expensive, noise-limited, and highly sensitive to backend and circuit configuration. We address this gap with an end-to-end decomposition-based pipeline that converts CVRP into a sequence of bounded-width quantum subproblems and treats quantum execution itself as a decision problem within the optimization loop. Starting from a Fisher–Jaikumar assignment linearization, we apply Lagrangian relaxation to dualize customer-assignment couplers, yielding independent, per-vehicle knapsack subproblems that admit QUBO/Ising evaluation. To replace brittle subgradient tuning in the outer loop, we learn a multiplier-update controller using a two-stage protocol (expert-guided pretraining followed by reinforcement learning fine-tuning) with rewards defined by execution-realized progress and route reconstruction. Furthermore, we introduce constrained contextual bandit as a hardware-aware execution layer that selects backend and circuit configuration with feasibility screening, thereby enabling safe adaptation across heterogeneous noisy resources and parallel multi-QPU scheduling. Computational results on multiple CVRPLIB families demonstrate that (i) the decomposition yields stable bounded-width subproblems across instance sizes, (ii) learned multiplier updates improve end-to-end routing quality relative to classical subgradient control under matched budgets, and (iii) hardware-aware configuration on real devices reduces median optimality gaps compared with static hardware execution, improving the utility of each costly quantum evaluation. Our goal is not to claim quantum advantage, but together, these components provide a practical end-to-end framework for scaling hybrid quantum CVRP optimization through OR decomposition, AI (learning-augmented dual control), and adaptive hardware-aware execution.

Executive summary

This paper builds an end-to-end hybrid pipeline for the capacitated vehicle routing problem (CVRP) on near-term hardware. The core trick is replacing the monolithic CVRP QUBO with a Fisher–Jaikumar linearization plus Lagrangian relaxation that produces a stream of per-vehicle knapsack QUBO subproblems whose logical width is bounded by the candidate set rather than the full customer count. Two learning layers sit on top: a PPO-fine-tuned multiplier-update controller replaces hand-tuned subgradient ascent, and a constrained contextual bandit selects backend, depth, qubit-placement, and entanglement pattern subject to gate-budget feasibility. For Yuan, the relevance is direct: the encoding-efficiency, hard-feasibility, and hybrid classical–quantum-decomposition themes line up squarely with Y2 (quasi-binary portfolio QAOA), Y3 (noise-aware QAOA portfolio), and Y4 (ADMM cardinality-constrained BO).

Main contribution

The paper contributes a four-piece system. (1) An OR decomposition that converts CVRP via Fisher–Jaikumar GAP linearization, then dualizes the customer-assignment couplers via Lagrangian relaxation, yielding K independent 0–1 knapsack subproblems with bounded width. (2) A novel tilted quadratic capacity penalty (Eq. 4) that biases towards a controlled headroom band W ∈ [Q−s, Q] instead of using slack variables — saving qubits and improving feasibility yield under sampling noise. (3) A two-stage learned multiplier controller (expert behavior cloning + PPO fine-tuning) with execution-realized rewards. (4) A constrained contextual bandit (LinUCB) for hardware-aware backend/configuration selection, with explicit gate-budget feasibility screening. Empirically, on 30 CVRPLIB instances, learned multiplier control alone cuts mean end-to-end gap from 26.0% (subgradient) to 14.8%, and hardware-aware execution further trims the bandit-enabled hardware setting from 17.8% (RL+HW) to 15.7% mean gap.

Key algorithms / formulations

Detailed walkthrough

The paper opens by emphasizing two NISQ pain points (Section 1): direct QUBO encodings of CVRP scale superlinearly because assignment and sequencing decisions are coupled in a monolithic formulation, and execution quality varies sharply with backend/configuration. The proposed system addresses both layers separately.

Decomposition (Section 3). The pipeline begins with the Fisher–Jaikumar linearization (Section 3.2) which fixes a per-arc cost approximation a_ik for assigning customer i to vehicle k, separating assignment from routing. Section 3.3 then takes the GAP and dualizes the customer-coverage constraints Σ_k y_ik = 1 via Lagrange multipliers λ_i. This is the key move: the Lagrangian is separable across vehicles, and each per-vehicle subproblem becomes a 0–1 knapsack of width equal to the candidate set considered for that vehicle (Section 3.4). A tilted quadratic penalty (Eq. 4) replaces slack-variable encodings of the capacity inequality, eliminating ⌈log₂ Q⌉ auxiliary qubits per vehicle and yielding a further width reduction beyond decomposition alone. The authors justify the tilt by noting that under sampling noise, near-feasible solutions that "hug" the boundary frequently flip to slight overloads; biasing toward a band below capacity raises feasibility yield in decoding (Appendix A micro-benchmark on noisy simulator).

Learned multiplier control (Section 5). Standard subgradient updates on λ are stepsize-sensitive and oscillate when subproblem solutions are noisy. The authors formulate multiplier updates as a controlled MDP whose state aggregates per-customer assignment statistics, dual progress signals, and instance descriptors. Stage 1 is behavior cloning of an expert subgradient demonstrator with curriculum-difficulty regularization; Stage 2 is PPO fine-tuning with reward shaped by dense per-iteration progress signals plus periodic end-to-end routing reward (after decoding/repair). The ablation (Section 7.3, Table I) is striking: subgradient → 26.0% mean gap, pretraining → 18.9%, pretraining + PPO → 14.8%, with mean runtime dropping from 148.6 s to 40.1 s. These numbers are before any quantum execution noise enters the picture, so the improvements come from outer-loop control quality rather than execution-side fidelity.

Hardware-aware execution (Section 6). Even bounded-width subproblems suffer SWAP overhead, depth–coherence tradeoffs, and calibration drift. The authors cast configuration as a constrained contextual bandit over (qubit-placement, ansatz family, depth) tuples. Context features include backend descriptors (T1, T2, two-qubit error rates, coupling map degree) and subproblem context (logical width, density). LinUCB with bounded execution-proxy reward learns offline from traces, then deploys with feasibility screening: any top-ranked arm whose estimated gate count exceeds G_max is rejected and the policy falls back to the best feasible alternative.

Results (Section 7). On 30 CVRPLIB test instances under matched evaluation budgets, four findings stand out. (i) FJ+LR gives orders-of-magnitude reduction in logical width vs direct CVRP QUBO, with the slack-free tilted penalty reducing width further. (ii) Learned multiplier control roughly halves both mean gap and runtime relative to subgradient. (iii) Bandit-enabled hardware execution lowers mean gap from 17.83% (RL+HW) to 15.72% (RL+HW+Bandit) and median from 17.02% to 14.32%; noisy simulation alone is worse (18.54% mean). (iv) Gate-budget hit rate is negligible at small widths but rises to 9–16% at larger widths — feasibility screening becomes operationally binding rather than ornamental. The honest discussion (Section 8) notes that none of these variants beats a 20s OR-Tools call in absolute terms; the contribution is at the systems level — making each expensive quantum evaluation more useful within an OR-driven decomposition.

Figures

Excess optimality gap relative to OR-Tools
Figure 1. Excess optimality gap relative to OR-Tools (20 s/instance). For each instance i and method m, Δgap = gap_m(i) − gap_ORTools(i) (percentage points), computed end-to-end after decoding/repair and route reconstruction. Lower is better; red dashed line at 0 denotes parity with OR-Tools.
Logical-width scaling: direct vs FJ+LR
Figure 2. Logical-width (qubit) scaling of direct CVRP QUBO encoding versus Fisher–Jaikumar (FJ) + Lagrangian relaxation (LR) decomposition into per-vehicle knapsack QUBOs. Bars report required logical qubits (log scale) for three instance classes; the slack-free tilted penalty further reduces width.
Multiplier-control ablation under classical subproblem solves
Figure 3. Multiplier-control ablation under classical knapsack subproblem solves (30 CVRPLIB instances). (a) End-to-end CVRP optimality gap (%) after decoding/repair and route reconstruction. (b) Wall-clock runtime per LR episode. Learned multiplier control reduces both median gap and time-to-solution relative to subgradient updates.
Offline bandit learning curve
Figure 4. Offline bandit learning curve under trace-driven training. Per-episode bounded reward in [0,10]; rolling-mean and smoothed traces summarize the learning trend.
Bandit arm-selection dynamics
Figure 5. Bandit arm-selection dynamics during trace-driven offline training. Rolling-window selection frequencies for the most frequently selected configuration arms; the policy transitions from broad exploration to concentration on high-utility configurations.
Gate-budget hit rate by logical-width bin
Figure 6. Gate-budget hit rate by logical-width bin. Bars show the fraction of consult calls in which the top-ranked arm is rejected because Ĝ > G_max; line shows logged consult calls per bin (right axis).
Budget–quality tradeoff
Figure 7. Budget–quality tradeoff: gap to BKS versus quantum-evaluation budget for multi-QPU baseline and noise-aware runs. Quantum budget is total transpiled circuit executions / QPU jobs (log scale); gray segments connect matched instances.
End-to-end CVRP optimality gap across execution variants
Figure 8. End-to-end CVRP optimality gap (%) after decoding/repair and route reconstruction, comparing execution variants under learned multiplier control: noisy simulation, real hardware without bandit, and real hardware with bandit-based configuration.

Citations to Yuan's papers

No direct citation to any of Y1–Y6 found in bibliography. (The single "Yuan" hit in the .bbl is Y.-x. Yuan, an unrelated optimization-methods author, in xiao2022class.)

Overlap with Y1–Y6

Recommended action for Yuan