Quantum Search without Global Diffusion
Abstract
Quantum search is among the most important algorithms in quantum computing, offering fundamentally new approaches to optimisation and information processing. At its core is quantum amplitude amplification, a technique that achieves a quadratic speedup over classical search by combining two global reflections: the oracle, which marks the target, and the diffusion operator, which reflects about the initial state. We show that this speedup can be preserved when the oracle is the only global operator, with all other operations acting locally on non-overlapping partitions of the search register. We present a recursive construction that, when the initial and target states both decompose as tensor products over these chosen partitions, admits an exact closed-form solution for the algorithm's dynamics. Applied to unstructured search, the approach retains theO(√N)oracle complexity of Grover search when each partition contains at leastlog₂(log₂ N)qubits. On an 18-qubit search problem, partitioning into two stages reduces the non-oracle circuit depth by 51%–96% relative to Grover, requiring up to 9% additional oracle calls.
Executive summary
Burke and Mc Goldrick prove that the global diffusion reflection at the heart of Grover/amplitude amplification is unnecessary whenever the state preparation unitary A and the target state both factor as tensor products across a partition of the qubit register. They give a recursive construction in which only the oracle remains global; every other reflection acts within a single partition. The principal angles between the recursively defined reflections collapse to two values governed by a single recursion sin(2γᵢ) = sin(2θᵢ) sin(2tᵢ₋₁γᵢ₋₁), yielding a closed-form expression for success probability and an oracle overhead of 1/∏cos(θᵢ) over standard QAA. For Yuan, the relevance is direct: Y4 builds a Grover-based algorithm for cardinality-constrained binary optimisation, and the cardinality-constrained feasible space (under quasi-binary encoding from Y2) is exactly the kind of structured product space where local diffusion may yield significant depth savings on hardware.
Main contribution
The central theorem (Section Per-Stage Oracle Complexity) shows that, with intermediate stage iterates set to tᵢ = 1, the recursive construction Wᵢ = (Sψᵢ Wᵢ₋₁)tᵢ Sψᵢ (Wᵢ₋₁ Sψᵢ)tᵢ drives the amplitude on the outermost register towards the target with the per-round oracle cost bounded by C ≤ CQAA/∏ᵢ cos(θᵢ). The key technical observation is that despite the reflected eigenspaces growing exponentially (dimension 2i-1), the principal angles between consecutive reflections collapse to two values, reducing the analysis at every level to a scalar 2×2 problem. The total complexity sums geometrically over recursive rounds and the dominant cost remains the first round.
Key theorems / lemmas / algorithms
- Tensor-decomposition recursion (Eq. decomp): When
A = ⊗ᵢ Aᵢand|x⟩ = ⊗ᵢ |xᵢ⟩, defineWᵢrecursively from local partial diffusionsSψᵢ = I − 2|ψᵢ⟩⟨ψᵢ|and the oracleSx. - Principal-angle collapse: The principal angles between successive reflection eigenspaces take exactly two values
γᵢ, π/2 − γᵢwithsin(2γᵢ) = sin(2θᵢ) sin(2tᵢ₋₁γᵢ₋₁). - Closed-form success probability (Eq. stage_succ):
‖Pxₘ(SψₘWₘ₋₁)tₘ|ψ⟩‖² = ½[1 − cos(2θₘ)/cos(2γₘ) · cos(2(2tₘ+1)γₘ)]with optimaltₘ* = ⌊π/(4γₘ) − ½⌉. - Per-stage oracle bound:
C ≤ CQAA/∏ᵢ cos(θᵢ). - Asymptotic Grover preservation: for unstructured search with stage size
s ≥ log₂ log₂ Nthe overhead converges to 1, retainingO(√N)queries; the geometric subsequent-round factor is1/(1 − 21−s/2)(≈3.41 at s=3, ≈2 at s=4, <1.15 at s≥6).
Detailed walkthrough
The paper formalises a striking observation about Grover-style amplitude amplification: the global diffusion 2|ψ⟩⟨ψ| − I is conventionally implemented as an n-qubit reflection, requiring multi-controlled gates that dominate the non-oracle depth on near-term hardware. The authors show that whenever the prepared state and the target state both factor across a chosen partition of the n-qubit register into m blocks of sizes n₁, …, nₘ, the global diffusion can be replaced by a recursive construction built only from local partial diffusions Sψᵢ (each acting on register i, implemented from Aᵢ, Aᵢ†, and a local zero-state phase flip) and the global oracle Sx — defining W₀ = Sx and Wᵢ = (SψᵢWᵢ₋₁)tᵢ Sψᵢ (Wᵢ₋₁ Sψᵢ)tᵢ (overview, Section 3.1).
The decisive technical result is the principal-angle degeneracy proved in Section 3.2 (proof.tex, the longest file in the source). Although Wᵢ reflects in a subspace of dimension 2i-1, all principal angles between the eigenspaces of consecutive reflections collapse to just two values γᵢ and π/2 − γᵢ. Without this collapse, the dynamics would require analysing an exponentially growing number of mixing rates; with it, the amplification reduces to a scalar SU(2)-like rotation per recursion level. Equation stage_succ then gives the closed-form success probability for the outermost register and the oracle-optimal iteration count tₘ* = ⌊π/(4γₘ) − ½⌉. The authors prove that intermediate iterates tᵢ < m are optimal at tᵢ = 1 via a doubling argument, leaving a single recursion sin(2γₘ) = ∏ sin(2θᵢ) = 2m sin(θ) ∏ cos(θᵢ).
The total cost (Section Total Oracle Complexity, final_complexity.tex) is dominated by the first round; remaining rounds operate on progressively smaller spaces and form a geometric series. For unstructured single-target search with equal stage size s, the bound is Ctotal ≤ CGrover(n) / [∏cos(θᵢ) · (1 − 21−s/2)]. Both factors converge to 1 once s ≥ log₂ log₂ N, recovering Grover's O(√N) query count exactly. This matches the asymptotic gate scaling of earlier sparse-diffusion Grover variants (Grover '02 trade-offs) but eliminates global non-oracle operators completely.
The evaluation (Section 4, evaluation.tex) backs the theory with Qiskit simulation on an 18-qubit search for |1⟩⊗18. The 3-stage [6,6,6] configuration measures the target in 96.5% of shots vs. theoretical 96.7%; the 2-stage [9,9] in 99.5% vs. 99.8%. Three multi-controlled-X decomposition scenarios (no ancilla, dirty V-chain via idle qubits, single clean ancilla) are compared. Under S2 the diffusion-circuit depth drops 96–97% versus Grover at 9–30% extra oracle calls; with a clean ancilla (S3) depth still drops 51–63%. Failure modes also retain partial-prefix structure: 57% of 3-stage failures still recover the first-stage substring correctly. The scaling analysis (n=20 to n=50) shows failure probability and oracle overhead both decay rapidly: 4-stage at n=20 has 10% failure / 55% overhead, both improving to <0.1%/<5% at n=50. Break-even oracle-to-diffusion ratios under S2 reach ~10³ at n=50 (2-stage), so even AES-128 oracles (≈10× depth) gain about 5% total depth reduction.
The discussion calls out three limitations: mid-circuit measurement noise (mitigated by scaling: 9 measurements vs. 250 oracle calls for 100 qubits in 10 stages), per-stage rounding error from integer iteration counts (potentially fixable via fixed-point amplitude amplification or phase-tunable variants), and the tensor-decomposition requirement on state preparation (a hard constraint, but the target-state version of the constraint can be relaxed at the cost of breaking the angle degeneracy and computing each rotation plane independently). The authors note their construction is particularly suited to distributed quantum architectures because all non-oracle operations stay within a single processing node, removing exactly the inter-node entanglement that dominates noise in distributed Grover demonstrations.
Figures
Citations to Yuan's papers
Overlap with Y1–Y6
- Y4 (Grover + ADMM cardinality-constrained BO) — strongest overlap. Y4 uses Grover-based amplitude amplification with
O(√(C(n,k)/M))rotations on a cardinality-constrained feasible space. The state preparation that produces a Dicke-state-like uniform superposition overC(n,k)bitstrings is itself often built register-wise (e.g. via inclusion-exclusion or sub-block combinatorial preparations). Wherever Y4's preparation can be cast as a tensor product across blocks of qubits — or even approximately so — Burke & Mc Goldrick's local-diffusion recursion gives a drop-in replacement that preserves theO(√N)scaling and dramatically cuts non-oracle depth, which is precisely the bottleneck on heavy-hex hardware. Even if exact factorisation fails, the discussion's pointer to operator-Schmidt low-rank approximations is a research direction Y4 could prototype. - Y2 (quasi-binary QAOA, hard mixer) — methodological adjacency. Y2 uses a hard constraint-preserving mixer over a structured feasible set (2^log₂(U−L+1) qubits per variable) and avoids penalty terms. The same partitioned register structure that makes Y2's mixer block-local is exactly the structure under which Burke's local diffusion applies. The result here suggests a Grover-inspired analogue of Y2's mixer where the global diffusion of constraint-preserving amplitude amplification can be replaced by per-variable local reflections.
- Y3, Y1 — weaker overlap. Y3/Y1 use QAOA rather than Grover, but Y3's circuit-depth analysis under thermal noise speaks the same hardware-cost language. Y3 shows that depth is what kills NISQ-era quantum advantage in DGMVP; this paper's depth-reduction approach is in principle complementary (different ansatz, but same target metric).
- Y5, Y6 — no meaningful overlap.
Recommended action for Yuan
- Read in depth and adapt for Y4 follow-up. Concretely: check whether the Grover state preparation in Y4 (Dicke-state superposition over cardinality-k binary strings) can be cast as a tensor product after a constant-depth basis change. If yes, the recursion in §3.1 gives a depth-reduced variant of Y4's Grover stage with the same query complexity. If only an approximate tensor decomposition is available, the operator-Schmidt truncation suggestion in the discussion is the natural starting point.
- Email John Burke (TCD) to ask whether the construction has been tested with non-uniform partition sizes matched to a heavy-hex topology — directly relevant for IBM Heron/Eagle deployments that Y3/Y6 use. The Qiskit code is referenced as available; it would be a low-cost starting point for a Y4-extension experiment.
- Discuss with collaborators Wu/Chen (Y4 co-authors): the constraint-preserving amplitude amplification family used in Y2/Y4 is the obvious testbed. A short note benchmarking local-diffusion Grover against Y4's baseline on heavy-hex hardware would be a natural near-term collaboration.