Quantum Search without Global Diffusion

Authors: John Burke, Ciaran Mc Goldrick (Trinity College Dublin) · arXiv:2604.15435 · submitted 20 April 2026 · score 8/10 (HIGH)

Abstract

Quantum search is among the most important algorithms in quantum computing, offering fundamentally new approaches to optimisation and information processing. At its core is quantum amplitude amplification, a technique that achieves a quadratic speedup over classical search by combining two global reflections: the oracle, which marks the target, and the diffusion operator, which reflects about the initial state. We show that this speedup can be preserved when the oracle is the only global operator, with all other operations acting locally on non-overlapping partitions of the search register. We present a recursive construction that, when the initial and target states both decompose as tensor products over these chosen partitions, admits an exact closed-form solution for the algorithm's dynamics. This is enabled by an intriguing degeneracy in the principal angles between successive reflections, which collapse to just two distinct values governed by a single recursively defined angle. Applied to unstructured search, a problem that naturally satisfies the tensor decomposition, the approach retains the O(√N) oracle complexity of Grover search when each partition contains at least log₂(log₂ N) qubits. On an 18-qubit search problem, partitioning into two stages reduces the non-oracle circuit depth by as much as 51%–96% relative to Grover, requiring up to 9% additional oracle calls.

Executive summary

Burke and Mc Goldrick prove that the global diffusion operator in Grover/amplitude amplification is not strictly necessary: if the state-preparation unitary and target state both factor as tensor products across a partition of the qubit register, the global reflection can be replaced entirely by local, register-scoped reflections, with the oracle as the only operator that still spans the full register. The proof identifies a surprising degeneracy — the principal angles between successive reflection subspaces (whose ambient dimension grows exponentially) collapse to two values governed by a scalar recursion. For unstructured search the overhead is negligible once each partition holds at least log₂(log₂ N) qubits, and at n = 50 the two-stage scheme matches Grover's success probability at oracle cost increase <0.1% while slashing non-oracle depth. For Yuan, this is directly load-bearing for Y4's Grover-ADMM pipeline: the dominant hardware cost of a Grover rotation for a cardinality-constrained feasible set is the non-oracle diffusion on the full n-qubit register, exactly what this paper localises.

Main contribution

The central object is a family of reflection operators W_i, defined recursively from the oracle S_x and local partial-diffusion reflections S_{ψ_i} = I − 2|ψ_i⟩⟨ψ_i| acting only on register i:

W_i = (S_{ψ_i} W_i−1)^t_i S_{ψ_i} (W_i−1 S_{ψ_i})^t_i, W₀ = S_x

Within the span of the initial and target register states, the product S_{ψ_i} W_i−1 acts as a rotation with two distinct principal angles {γ_i, π/2 − γ_i} where sin(2γ_i) = sin(2θ_i) · sin(2t_i−1γ_i−1) and γ₁ = θ₁. This scalar recursion yields a closed-form success probability per stage and an oracle overhead bounded by 1/∏_i cos(θ_i) relative to standard amplitude amplification. Measure the non-final registers, reinitialise, repeat on the shrinking remainder — the target is recovered stage-by-stage with a geometric-series overhead dominated by the first round.

Key theorems / lemmas / algorithms

Principal-angle degeneracy (Sec. 4. proof): Despite eigenspaces of dimension 2ⁱ⁻¹, principal angles between W_i−1's reflected subspace and the S_{ψ_i}-subspace collapse to two values governed by a single recursively defined angle γ_i.
Stage success probability (Eq. stage_succ): ||P_{x_m}(S_{ψ_m}W_m−1)^t_m|ψ⟩||² = ½[1 − (cos 2θ_m/cos 2γ_m) cos(2(2t_m+1)γ_m)], with optimal iteration count t*_m = ⌊π/(4γ_m) − ½⌉.
Oracle-overhead bound (Sec. final_complexity): Total oracle cost relative to standard AA is bounded by 1/∏_k=1^m cos(θ_k); for unstructured single-target search with partition sizes n_i ≥ log₂(log₂ N), the O(√N) Grover complexity is retained.
Algorithm (Sec. algorithm): Outer loop over stages m→1. At stage i, apply (S_{ψ_i} W_i−1)^t_i to the full register, measure registers 1…i−1, reinitialise, descend to the next stage. Qiskit reference implementation publicly available.

Detailed walkthrough

The paper's pitch is structural: standard Grover/AA requires two global reflections — an oracle S_x that marks the target, and a diffusion operator S_ψ = I − 2|ψ⟩⟨ψ| that reflects about the initial state. The second is typically implemented as a multi-controlled operation that touches every qubit; on near-term hardware with limited connectivity, this is expensive. Prior work on partial-diffusion operators (Grover-Radhakrishnan-Korepin, partial search, Zalka-style tradeoffs) reduces but does not eliminate the global reflection; when global diffusion is replaced entirely, the dynamics become a unitary on an exponentially growing subspace and do not admit a closed-form analysis in general.

The key insight of Sec. 3 (algorithm overview) is that if A = ⊗_i A_i and |x⟩ = ⊗_i |x_i⟩ factor across a register partition of sizes n₁+…+n_m = n (unstructured search with Hadamards satisfies this trivially with any partition), then the recursive reflection W_i defined above acts on span{|ψ_i⟩, |ψ_i^⊥⟩} as a rotation with just two principal angles. The proof in Sec. algorithm/proof.tex is an induction: at each level of the recursion the ambient invariant subspace doubles, but the reflection only "sees" two directions per register. As a consequence, the squared projections onto |ψ_i⟩ and |ψ_i^⊥⟩ are cos²(2t_iγ_i) and sin²(2t_iγ_i) irrespective of the lower-register state — which is what makes the stage-by-stage analysis tractable.

The success-probability formula is then derived by tracking projections onto the register-m target component. Measurement collapses the outer registers; the target register is in a mixed state on span{|ψ_m⟩, |ψ_m^⊥⟩}. The remaining registers are reinitialised and the search iterates on a shrinking register. Because each stage introduces only a multiplicative 1/cos(θ_i) in query count, and the partition sizes geometrically shrink (for unstructured search) or can be held uniform, the total oracle budget is a geometric series dominated by the first round.

Sec. 5 (Evaluation) does two things. First, an 18-qubit Qiskit simulation: for target |1⟩^⊗18, a 3-stage [6,6,6] configuration with (102, 25, 6) iterations per stage measures the target 96.5% of the time vs. the theoretical 96.7%; the 2-stage [9,9] configuration with (201, 17) iterations achieves 99.5% vs. 99.8%. Grover with matched success probability is the baseline. The authors evaluate non-oracle circuit depth under three MCX-decomposition scenarios — S1 (no ancillae), S2 (dirty V-chain using idle qubits), S3 (one clean ancilla). Under S1 at the 3-stage configuration, diffusion depth drops by 95%; under S2 by 97%; with one clean ancilla (S3) by 63%. The oracle-call overheads are 30% (3-stage) and 9% (2-stage). Importantly, the authors compute the break-even oracle-to-diffusion-depth ratio at which the recursive scheme still beats Grover in total depth: for 3-stage it is ∼3.1, 3.2, 1.7 (S1/S2/S3); for 2-stage it is ∼8.8, 10.6, 5.2.

Sec. 5.2 (Scaling analysis) extends to n = 20–50 qubits using the closed-form expressions. The headline: both failure probability and oracle overhead improve with n for fixed stage count. By n = 50, 2-stage failure < 10⁻⁷ with <0.1% oracle overhead; 3-stage < 10⁻⁴ with <1%; 4-stage < 10⁻³ with <5%. At this scale the break-even oracle depth ratio under S2 reaches ∼10³ for 2-stage, meaning the scheme is depth-favourable even when the oracle is three orders of magnitude deeper than a global diffuser. Applied to an AES-128 key-recovery oracle (oracle-to-diffusion depth ratio ∼10×), the recursive scheme reduces total depth by ∼5%.

A subtle point: success probability per stage saturates below Grover's (99.99%+) because each stage's probability is bounded by the scalar recursion — the paper is explicit that quadratic-speedup preservation comes with a small probability deficit that shrinks with n. For applications that tolerate a ~10⁻³–10⁻⁷ miss rate (essentially all quantum-advantage arguments for combinatorial search), this is a very favourable trade: dramatically shallower non-oracle circuits at essentially free success probability. For applications requiring exact success (deterministic Grover variants), this scheme is not a drop-in.

Figures

Figure 1. Simulation and depth comparison of the recursive search scheme against standard Grover on an 18-qubit single-target problem. Top row: measurement-outcome distributions for 3-stage [6,6,6] and 2-stage [9,9] configurations over 1000 shots. Bottom left: diffusion-operator circuit depth (log scale) under decomposition scenarios S1 (no ancillae), S2 (dirty V-chain), S3 (one clean ancilla). Bottom right: total oracle-call counts vs. Grover baseline matched to the same success probability.

Figure 2. Scaling of the recursive scheme for k = 2, 3, 4 near-equal stages across n = 20 to 50 qubits (unstructured single-target search). Left: failure probability 1 − p on log scale. Right: relative oracle overhead vs. Grover matched to the same success. Both quantities improve rapidly with n for every stage count.

Figure 3. Depth scaling of the recursive scheme vs. Grover for k = 2, 3, 4 stages across n = 20 to 50. Top row: break-even oracle-to-diffusion depth ratio under S1/S2/S3 decompositions (below this ratio the recursive scheme is strictly shallower). Bottom row: total circuit depth relative to Grover for oracle-depth multipliers 1×, 5×, 10× under S3. Dashed red line marks parity with Grover.

Citations to Yuan's papers

No direct citation to any of Y1–Y6 found in bibliography.

Overlap with Y1–Y6

Y4 (Grover + ADMM for cardinality-constrained BO) — primary overlap: Y4's algorithm relies on Grover rotations over a structured feasible space (cardinality-k n-bit strings, size C(n,k)) with total query cost O(√(C(n,k)/M)). The dominant per-rotation hardware cost is the diffusion over the full n-qubit register. Burke–Mc Goldrick replace that global diffusion with local partial diffusers over register partitions, at ≤9% extra oracle queries. In Y4's Grover-subroutine inner loop this could directly translate to 50–95% non-oracle depth reduction for the cardinality-feasible-space search, especially on heavy-hex topology (which Y4 is very likely to target since IBM Heron is Yuan's hardware of choice in Y6). Caveat: the partition-factoring condition |x⟩ = ⊗_i |x_i⟩ is trivial for unstructured search; adapting it to cardinality-constrained Grover (where the marked states live on a cardinality-k shell) requires checking whether the initial cardinality-preserving state can be written as a tensor product across some partition — generally it cannot, which is a nontrivial conceptual hurdle worth investigating.
Y1 & Y3 (QAOA warm-starting and DGMVP) — tangential: No direct method link; QAOA is not Grover-structured. However the general philosophy of reducing non-oracle/non-cost-Hamiltonian global operators in favour of local structure aligns with parameter/compilation efficiency work. The "scalar reduction on exponentially growing subspaces" analysis style is potentially portable to QAOA trajectories.
Y2 (quasi-binary QAOA) — no direct overlap.
Y5 (Gibbs + Pauli sparsity SDP) — no direct overlap.
Y6 (PBR on Heron2) — hardware scope only: Both target NISQ superconducting hardware; Burke–Mc Goldrick's heavy-hex depth numbers are directly relevant to what Yuan's group runs on Heron.

Recommended action for Yuan

Adapt to Y4 follow-up. The most tangible opportunity: investigate whether the recursive local-diffusion scheme extends to cardinality-constrained Grover. Key question — does the Dicke-state (or symmetric superposition on the cardinality-k shell) admit a tensor-product decomposition over some block partition? If not exactly, is there an approximate factorisation that preserves the quadratic speedup within an acceptable overhead? Concrete next step: write a one-page analysis applying the W_i recursion to the Dicke-state initial condition used in Y4.
Cite on next Y4 revision. Even without direct adaptation, this is the state of the art on "Grover with local diffusion" as of April 2026 — any future Y4 experimental section implementing Grover rotations on heavy-hex hardware should cite Burke–Mc Goldrick when discussing diffusion-circuit depth.
Read deeper before adapting. The proof of the principal-angle degeneracy is the paper's technical core; Yuan should read Sec. algorithm/proof.tex carefully before building on the scheme, since the degeneracy is what makes the scalar recursion tractable and is also likely what breaks in structured-subspace generalisations.