Towards High Performance Quantum Computing (HPQ): Parallelisation of the Hamiltonian Auto Decomposition Optimisation Framework (HADOF)

Sankar, Miliotis, Caton · arXiv:2604.27836 · 2026-05-01 (new submission, quant-ph) · score 8/10 (HIGH) · ← Back to digest

Abstract

Practical applicability of quantum optimisation on near-term devices is constrained by limited qubit counts and hardware noise. The Hamiltonian Auto Decomposition Optimisation Framework (HADOF) addresses this by decomposing large QUBOs into smaller subproblems that can be solved iteratively on quantum or classical backends. We extend the evaluation of HADOF by benchmarking on real IBM QPUs across sequential, single-QPU parallel, and multi-QPU parallel execution modes, advancing toward High Performance Quantum (HPQ) computing for combinatorial optimisation. Experimental results on IBM hardware show 3–4× wall-clock reduction with four QPUs, and we further validate HADOF on real-world genome-assembly instances.

Executive summary

HADOF is a QAOA-style hybrid that decomposes large QUBOs into fixed-width 5-qubit sub-problems, iteratively solves them with mean-field substitution, and aggregates. The paper's contribution is multi-QPU parallelisation: instead of solving sub-Hamiltonians sequentially, it submits them concurrently to four IBM Heron-class QPUs (ibm_kingston, ibm_pittsburgh, ibm_fez, ibm_marrakesh) using digitised QAOA with annealing-style parameter schedules. Demonstrated on randomly generated QUBOs (100–500 vars) and a 248-variable genome-assembly TSP-derived QUBO. Direct overlap with Yuan's Y3 (end-to-end QAOA on real IBM NISQ hardware), Y2 (constraint-handling for QUBO), and Y4 (decomposition philosophy similar to ADMM hybrid). The trick of using 5-qubit subcircuits to solve 500-variable problems on a 156-qubit device by avoiding monolithic embedding is precisely the kind of practical engineering Y3's analysis would credit.

Main contribution

HADOF iteratively decomposes a global QUBO HQ into sub-Hamiltonians Hi defined over k=5 variables each, with the inactive (n−k) variables substituted by their expected values P(xj) from previous iterations. The QAOA layer count starts at 1 and grows by 1 per global iteration (up to 5 layers). β and γ parameters follow an annealing-trotterised schedule βm = 1−m/p, γm = m/p, removing the classical optimisation loop. The paper's parallel extension generates all sub-Hamiltonians at the start of each global iteration using last-iteration expected values, then submits them concurrently to one or more QPUs. Multi-QPU runs use the Qiskit IBM-runtime scheduler with separate jobs per device. Genome assembly is encoded as a Hamiltonian-path QUBO over a directed acyclic overlap graph (Sec. 2 of the paper); the φX 174 bacteriophage instance has 50 nodes / 248 edges / 248 binary variables.

Key constructions and algorithmic steps

Detailed walkthrough

HADOF (Sec. 3, Methods) is a Hamiltonian-decomposition variant of QAOA: instead of embedding the full n-variable QUBO into a single ansatz, it solves k=5-variable sub-Hamiltonians iteratively. Inactive variables are replaced by their expected values P(xj) ∈ [0,1], creating an effective sub-Hamiltonian whose ground state is biased toward the global landscape. The algorithm progressively grows the QAOA depth L from 1 to p=5 across iterations, mimicking quantum-annealing-style adiabatic evolution with hand-set β/γ schedules — no classical outer optimisation loop, in contrast to standard QAOA. The mean-field-style substitution P(xj) → ⟨xj⟩ is updated by sampling each qubit individually with 500 shots per measurement.

The parallel extension (Sec. 3) is conceptually simple but operationally significant: in the original sequential HADOF, each sub-Hamiltonian within a global iteration uses the most-recently-updated expected values, creating a sequential data dependency. In the parallel version, all sub-Hamiltonians within a global iteration use the previous-iteration expected values, breaking the dependency and allowing concurrent submission to multiple QPUs. Within each iteration, expected values are updated only once at the end. Initial P(xi) = 0.5 for all variables.

The hardware setup (Sec. 4) uses four IBM Heron-class 156-qubit devices: ibm_kingston (Heron r2), ibm_pittsburgh (Heron r3), ibm_fez (Heron r2), and ibm_marrakesh (Heron r2), all heavy-hex topology. Crucially, this is the same Heron family that Y6 uses for its PBR test (Heron2). The paper notes that submission is staggered to avoid queue-time bias; all reported wall-clock times exclude queueing where reported as QPU-usage.

The accuracy story (Sec. 4.2, Figs. acca/accb) is structured around two metrics: (i) accuracy of the most-probable sampled solution, and (ii) average accuracy across the full sampled distribution, both normalised to simulated-annealing benchmarks. Under ideal simulation, both sequential and parallel HADOF achieve accuracy > 0.95 across all sizes 100–500. Under noisy simulation accuracy drops marginally to 0.92–0.97. On real Heron hardware, best-solution accuracy stays consistently above 0.80; the average-distribution accuracy drops to ~0.5–0.6 at larger sizes. Parallelisation does not degrade quality — the wall-clock advantages come "for free" in solution-quality terms.

Section 4.3 is the most directly comparable to Yuan's Y3: head-to-head HADOF vs full-circuit QAOA. At 50 variables on Kingston, full QAOA's average accuracy collapses to 0.02 (effectively random), while HADOF holds 0.53. The reason is mechanistic: full-circuit QAOA at 50 dense variables requires SWAP-heavy embedding on the 156-qubit heavy-hex layout (limited to ~50 dense variables in practice), and the resulting circuit depth saturates the device's coherence budget. HADOF avoids this entirely by working in 5-qubit subcircuits regardless of n. The paper's framing — "HADOF is robust because it never uses the full-circuit embedding" — is essentially what Y3's NISQ analysis was forecasting: monolithic QAOA fails at the noise crossover, but properly-decomposed QAOA can keep going.

The genome-assembly demonstration (Sec. 4.4) is the most ambitious application. The φX 174 bacteriophage genome (5386 bp) is reconstructed from 50 simulated reads (600 bp each) generated by Grinder. The overlap graph has 50 nodes / 248 edges, mapped to a QUBO with 248 binary variables for the Hamiltonian-path objective. Simulated annealing finds the unique optimal energy −496 in 213/N runs; ideal-simulation HADOF (sequential and parallel) reaches −496 in 187 and 193 runs respectively. On real Heron hardware, the best-accuracy is 0.88–0.91, no optimal solutions in the sampled distribution, but 52% of sampled solutions reconstruct the correct final sequence after post-processing with GFAtools. Multi-QPU parallel run-time is 889 s vs 2345 s sequential — a 2.6× speedup.

Section 5 (Conclusions) flags adaptive/problem-specific decomposition as a future direction. This is exactly the design space Y4's ADMM hybrid occupies. A natural composition: replace HADOF's fixed k=5 decomposition with an ADMM-style adaptive partitioning that respects QUBO sparsity, and run on the same multi-QPU testbed. Yuan's Y4 ε-approximation guarantees would translate into a stronger statement about the correctness of the federated solution.

Figures

QAOA circuit
Standard QAOA circuit alternating cost and mixer Hamiltonians (per Sec. 3, Methods). HADOF uses this as the per-sub-Hamiltonian unit with k=5 qubits.
Annealing-trotterised QAOA parameters
Trotterised QAOA parameter schedule βm=1−m/p, γm=m/p (per Sec. 3). The schedule mimics quantum annealing and avoids the classical optimisation loop.
HADOF overview
HADOF framework: (1) decompose binary variables into size-5 subsets; (2) form sub-Hamiltonians using P(xi) as expected values; (3) run depth-L QAOA per subset; (4) aggregate sampled solutions into the global probability distribution. Steps can be sequential or parallel.
Time-to-solution ideal
Wall-clock time-to-solution under ideal simulation across 100–500 variables. Parallel HADOF achieves ~3× speedup over sequential at 500 variables; both scale roughly linearly in n.
Time-to-solution noisy and hardware
Wall-clock TTS under noisy simulation and on real hardware. Parallel HADOF gives 4–5× speedup over sequential under noise; real-hardware (Kingston) parallel run is 3× faster than sequential; multi-QPU is marginally better than single-QPU parallel.
Best-solution accuracy
Best-sampled-solution accuracy normalised to simulated-annealing baseline. Ideal HADOF stays > 0.95; noisy HADOF stays in 0.92–0.97; real-hardware HADOF stays above 0.80 across all sizes.
Average-solution accuracy
Average-distribution accuracy. Ideal: ~0.85; noisy: ~0.75; real-hardware: drops to ~0.50 at larger sizes — noise primarily shifts probability mass off the optimum rather than wrecking the best sampled solution.
phiX174 assembled
Assembled overlap graph of the φX 174 bacteriophage: longest Hamiltonian path through the 50-node / 248-edge overlap graph that reconstructs the 5386 bp genome.

Citations to Yuan's papers

No direct citation to any of Y1–Y6 found in bibliography.

Overlap with Y1–Y6

Recommended action for Yuan