Distributed Quantum Optimization for Large-Scale Higher-Order Problems with Dense Interactions
Abstract
Many real-world problems are naturally formulated as higher-order optimization (HUBO) tasks involving dense, multi-variable interactions, which are challenging to solve with classical methods. Quantum optimization offers a promising route, but hardware constraints and limitations to quadratic formulations have hampered their practicality. Here, we develop a distributed quantum optimization framework (DQOF) for dense, large-scale HUBO problems. DQOF assigns quantum circuits a central role in directly capturing higher-order interactions, while high-performance computing orchestrates large-scale parallelism and coordination. A clustering strategy enables wide quantum circuits without increasing depth, allowing efficient execution on near-term quantum hardware. We demonstrate high-quality solutions for HUBOs up to 500 variables within 170 seconds, significantly outperforming conventional approaches in solution quality and scalability. Applied to optical metamaterial design, DQOF efficiently discovers high-performance structures and shows that higher-order interactions are important for practical optimization problems. These results establish DQOF as a practical and scalable computational paradigm for large-scale scientific optimization.
Executive summary
A large ORNL+IBM team proposes DQOF — a HPC-orchestrated, distributed QAOA for higher-order unconstrained binary optimisation with up to three-body couplings. The core insight is that rather than quadratise cubic interactions (which blows up auxiliary variable count and distorts the landscape), you should attack the native HUBO directly on quantum hardware using QAOA with three-qubit Z-rotations. The authors scale this to N=500 variables and deliver approximation ratio >0.99 for N=40 in 37 s; they also run it end-to-end on IBM Heron r2 (ibm_fez) for 4–32-qubit combined Hamiltonians. A “clustering” trick packs multiple independent sub-HUBOs into a single wide circuit at fixed depth, exploiting the fact that Heron's width is easier to use than its depth. Directly relevant to Yuan's Y3 (QAOA + classical orchestration for constrained optimisation) and Y4 (ADMM-style hybrid decomposition).
Main contribution
Three interlocking pieces: (1) a decomposition framework that splits a dense N-variable HUBO into sub-HUBOs of size n, solves each via QAOA, and stitches sub-solutions into global candidates within an HPC-parallelised optimisation loop; (2) a clustering strategy that packs m independent sub-HUBOs into one mn-qubit circuit without deepening it, maximising hardware-width utilisation; (3) a full quantum-centric-supercomputing demonstration on IBM-Kingston and IBM-Fez Heron-r2 hardware — crucially including a comparison showing that hardware execution time is effectively width-independent, unlike simulator execution which is exponential in circuit width.
Key algorithmic ingredients
- Higher-order factorisation machine surrogate: ML surrogate fits linear + pairwise + triple-interaction coefficients, which map directly to HUBO self-, 2-local and 3-local terms.
- QAOA with 3-local cost Hamiltonian: exp(−iγ ZZZ) decomposed into 1-Z rotations + 2-qubit CNOT ladder; pairwise + cubic terms encoded natively without auxiliary variables.
- Sub-HUBO decomposition: N-variable HUBO split into many n=4 sub-HUBOs (concurrently optimised); each instance runs P parallel copies with different initialisations.
- Clustering combined Hamiltonian: pack m sub-Hamiltonians as independent blocks onto one hardware coupling map — total circuit width m×n, depth unchanged.
- Hardware execution configurations: S = noiseless simulator, E = fake-backend emulator, S+H = classical expectation / hardware sampling hybrid, H = full hardware.
- Active-learning feedback loop (AL-DQOF): surrogate model → HUBO → DQOF → physics simulation → new data, applied to transparent-radiative-cooler (TRC) metamaterial design (up to 40 bits).
Detailed walkthrough
Section 2.1 motivates HUBOs. The authors use a higher-order factorisation-machine surrogate (Hwang 2025) to express a design problem with linear, quadratic and cubic coefficients as a three-dimensional triangular-pyramid HUBO. Rugged energy landscapes with combinatorial numbers of local minima are characteristic; brute force is exponential in N. Quadratisation (standard path) introduces O(N2) auxiliary variables, inflating the search space; linearisation under MILP does similar damage. So directly attacking cubic terms is motivated.
Section 2.2 lays out the DQOF workflow (Fig. 1a). The full HUBO is decomposed into sub-HUBOs that can be solved in parallel; each sub-solution is aggregated into a candidate global solution, and the best of P candidates wins. Fig. 1b shows that at P=10 instances and N=40, DQOF achieves approximation ratio >0.99 in <37 s. The clustering operation (Fig. 1c) stacks m independent sub-Hamiltonians onto disjoint sets of physical qubits, recovering higher-quality statistics per hardware job; Fig. 1d contrasts the scaling: clustering increases width but keeps depth constant, while directly encoding a full HUBO of equivalent width requires deeper circuits.
Section 2.3 is the hardware story. They run on IBM Heron r2 (ibm_fez/Kingston) with m=1..8 cluster sizes (4–32 qubits wide), QAOA p=2, 200 classical iterations, 10000 shots. Fig. 2a shows approximation ratio across S/E/S+H/H configurations: the noisy emulator fails beyond 12 qubits because density-matrix simulation is O(4n); hardware sampling (“S+H”) introduces measurement error but remains usable; full hardware (“H”) works well up to m=4 and degrades for m≥5 under accumulated noise and the strict 200-iteration budget. The striking result is Fig. 2b,c: simulator time-to-solution grows exponentially in width, hardware time-to-solution is essentially width-flat at fixed depth. At widths ≥28 qubits hardware beats simulator on wall-clock, even accounting for queue time. This is a practical argument for quantum utility that does not rest on asymptotic speedup claims.
Section 2.4 pushes to N=100 with m=1..5, P=2, using S+H (because full-hardware AL loops are prohibitively expensive). Convergence improves with larger m; time-to-solution scales linearly in DQOF iterations; QPU usage is nearly m-independent, implying DQOF is natively quantum-centric-supercomputing friendly.
Section 2.5 compares DQOF against simulated annealing, hybrid quantum annealing (D-Wave), and MILP (Gurobi) on quadratised/linearised formulations. Fig. 3 shows DQOF maintains relative accuracy close to 1 out to N=500 where SA/HQA/MILP all degrade sharply — the quadratisation penalty-strength problem (no globally valid choice) is singled out as the root cause.
Section 2.6 is the applications layer: AL-DQOF optimises a TRC (transparent radiative cooler) photonic stack up to 40 bits. The key finding is that HUBO-based surrogates (with cubic terms) yield substantially lower figure-of-merit than QUBO-based surrogates — i.e. higher-order interactions are essential for the physics. Optimised structures deliver up to 28% cooling energy savings vs standard glass.
For Yuan, the distribution + clustering idea is the part most worth transplanting. Y3 already uses classical orchestration (dual annealing + layerwise) around a QAOA inner loop; DQOF's clustering strategy is a direct drop-in for Y3's portfolio-QAOA when targeting wider Heron devices. The HUBO-vs-QUBO argument also strongly supports the “don't-quadratise-if-you-don't-have-to” position Y2 implicitly takes by using a quasi-binary encoding rather than a penalty-blown-up QUBO.
Figures




Citations to Yuan's papers
Overlap with Y1–Y6
- Y3 (QAOA DGMVP portfolio + layerwise + classical orchestration) — Direct method/scope overlap. Y3 orchestrates classical optimisers around a QAOA inner loop; DQOF adds explicit HPC-level decomposition + clustering, precisely the infrastructure Y3 would need to scale beyond the 20-qubit regime it currently targets. The AL-DQOF loop is a natural fit for portfolio rebalancing.
- Y4 (Grover + ADMM cardinality-constrained BO) — Shared decomposition philosophy. Y4 uses classical ADMM to drive an inner Grover routine; DQOF uses classical HPC to drive an inner QAOA routine. Both build epsilon-approximation pipelines around quantum primitives, though DQOF is empirical while Y4 carries explicit convergence guarantees.
- Y2 (quasi-binary + hard mixer) — Indirect method overlap. DQOF's “don't quadratise” argument aligns with Y2's motivation for avoiding penalty-blown-up QUBOs; the two approaches are complementary (Y2 avoids penalties via encoding + mixer design, DQOF avoids them by handling cubic terms natively).
- Y1 (warm-started QAOA), Y5, Y6 — Limited direct overlap.
Recommended action for Yuan
- Cite in the next portfolio-QAOA or scalability paper. This is the strongest recent demonstration that clustering on Heron wins on wall-clock at width ≥ 28 — a clean empirical data point for Y3's quantum-advantage story.
- Apply the clustering strategy to Y3's DGMVP portfolio QAOA. Y3's current pipeline leaves most of Heron's width unused; packing multiple independent portfolios into one combined Hamiltonian would amortise submission overhead and strengthen Y3's hardware-vs-simulator story.
- Consider reaching out to Suh / Pascuzzi at ORNL / IBM. Their HPC + Heron pipeline is operational; a collaboration applying it to cardinality-constrained portfolio variants (Y3 meets Y4) is mechanically straightforward and could yield a high-visibility joint result.