Quantum Search without Global Diffusion

John Burke, Ciaran Mc Goldrick (Trinity College Dublin) · arXiv:2604.15435 · new submission, 20 April 2026 · score 8/10

Abstract

Quantum search is among the most important algorithms in quantum computing, offering fundamentally new approaches to optimisation and information processing. At its core is quantum amplitude amplification, a technique that achieves a quadratic speedup over classical search by combining two global reflections: the oracle, which marks the target, and the diffusion operator, which reflects about the initial state. We show that this speedup can be preserved when the oracle is the only global operator, with all other operations acting locally on non-overlapping partitions of the search register. We present a recursive construction that, when the initial and target states both decompose as tensor products over these chosen partitions, admits an exact closed-form solution for the algorithm's dynamics. This is enabled by an intriguing degeneracy in the principal angles between successive reflections, which collapse to just two distinct values governed by a single recursively defined angle. Applied to unstructured search, a problem that naturally satisfies the tensor decomposition, the approach retains the O(√N) oracle complexity of Grover search when each partition contains at least log₂(log₂ N) qubits. On an 18-qubit search problem, partitioning into two stages reduces the non-oracle circuit depth by as much as 51%–96% relative to Grover, requiring up to 9% additional oracle calls. For larger problem sizes this oracle overhead rapidly diminishes, and valuable depth reductions persist when the oracle circuit is substantially deeper than the diffusion operator. More broadly, these results show that a global diffusion operator is not necessary to achieve the quadratic speedup in quantum search, offering a new perspective on this foundational algorithm.

Executive summary

Burke and Mc Goldrick eliminate the global diffusion operator from Grover-style amplitude amplification, replacing it with reflections that act entirely within tensor-product partitions of the register. The oracle remains the only operator that touches every qubit. They prove that despite the exponentially growing eigenspaces of the recursive partial-reflection product, the principal angles collapse to two values determined by a single recursively defined angle γ_i, yielding a closed-form success probability and an oracle overhead of 1/∏ cos(θ_i) over standard QAA. For Yuan, this is a direct upgrade to the Grover routine that underpins Y4 (cardinality-constrained binary optimisation via Grover): the partitioned local-diffusion construction promises significant non-oracle depth savings (51%–96% on 18 qubits, 25%–35% at n=50 when oracle and diffusion are comparable in depth) precisely the regime where Y4's amplitude-amplification rotations dominate hardware cost.

Main contribution

The paper proves that whenever the state preparation unitary A and the target |x⟩ factorise as A = ⊗A_i and |x⟩ = ⊗|x_i⟩ across an m-fold partition, recursive reflection W_i = (S_{ψ_i} W_i−1)^t_i S_{ψ_i} (W_i−1 S_{ψ_i})^t_i built from local partial diffusers S_{ψ_i} and the oracle W₀ = S_x drives amplitude into one register at a time. Despite the dimension-2ⁱ⁻¹ reflected subspaces, the principal angles between successive reflections collapse to {γ_i, π/2 − γ_i} with γ_i obeying sin(2γ_i) = sin(2θ_i)·sin(2t_i−1γ_i−1). The total oracle cost is bounded by C_Grover⁽ⁿ⁾ / [∏cos(θ_i) · (1 − 2^1−s/2)] for equal stage size s; for s ≥ log₂ log₂ N the algorithm achieves the asymptotically optimal O(√N) query complexity while every non-oracle operator acts on only s qubits.

Key theorems / lemmas / algorithms

Recursive reflection W_i (Eq. in §3.1): defines the building block from S_{ψ_i} and W_i−1; itself a reflection.
Principal-angle degeneracy: the spectrum of S_{ψ_i}W_i−1 collapses to two angles {γ_i, π/2 − γ_i} with sin(2γ_i) = sin(2θ_i)·sin(2t_i−1γ_i−1), γ₁ = θ₁ — the technical heart of the paper, reducing the dynamics to a scalar recursion.
Closed-form per-stage success: ‖P_{x_m} (S_{ψ_m}W_m−1)^t_m|ψ⟩‖² = ½[1 − cos(2θ_m)/cos(2γ_m) · cos(2(2t_m+1)γ_m)], with optimal t_m* = ⌊π/(4γ_m) − ½⌉.
Oracle overhead bound: per-stage factor 1/∏cos(θ_i) ≤ exp[log₂(N)/(s·2^s+1)], converging to unity when partitions are at least log₂log₂ N qubits.
Total-cost bound (boxed result, §3.4): C_total ≤ C_Grover⁽ⁿ⁾ / [∏cos(θ_i) · (1 − 2^1−s/2)], capturing the geometric series from re-running the search on shrinking sub-registers.

Detailed walkthrough

The classical Grover loop is a pair of global reflections: the oracle S_x marks |x⟩ and the diffusion operator S_ψ = I − 2|ψ⟩⟨ψ| reflects about the prepared initial state |ψ⟩ = A|0⟩. Their product is a rotation in the 2-D plane span{|ψ⟩, |x⟩}. The authors' question: can the second global reflection be replaced entirely by operators acting within sub-registers of the search space?

The answer is yes whenever A and |x⟩ both factor as tensor products across an m-partition of the n qubits. The construction defines partial diffusers S_{ψ_i} = I − 2|ψ_i⟩⟨ψ_i| ⊗ I (acting only on register i) and recursively builds W_i = (S_{ψ_i} W_i−1)^t_i S_{ψ_i} (W_i−1 S_{ψ_i})^t_i with base W₀ = S_x. Each W_i is itself a reflection — it conjugates S_{ψ_i} by the iterate (S_{ψ_i}W_i−1)^t_i.

The non-trivial mathematical claim (§3.2) is that the principal angles between the eigenspaces of S_{ψ_i} and W_i−1, which a priori could populate a complex spectrum of dimension 2ⁱ⁻¹, collapse to two values {γ_i, π/2 − γ_i} controlled by a single scalar γ_i satisfying sin(2γ_i) = sin(2θ_i)·sin(2t_i−1γ_i−1). This is the "scalar reduction" the authors highlight: the global recursion reduces to a 1-D angle recursion governed entirely by the local overlap angles θ_i = arcsin|⟨x_i|ψ_i⟩| and the per-stage iteration counts t_i. Applying (S_{ψ_m}W_m−1)^t_m to |ψ⟩ concentrates amplitude in register m onto |x_m⟩ with the closed-form success probability above. Measurement of the remaining registers projects register m to a pure state, after which the algorithm re-runs on a strictly smaller search space until the full target bit-string is recovered.

Section §3.5 specialises to unstructured Grover search. With equal stage size s = n/m the local overlap angle is θ_i = arcsin(2^−s/2) and the per-stage oracle overhead 1/cos(θ_i) shrinks to ≤ exp[log₂(N)/(s·2^s+1)] — negligible once s ≥ log₂log₂ N. The geometric series of subsequent rounds contributes a bounded constant: 3.41× at s=3, 2× at s=4, 1.15× at s≥6. This recovers the asymptotically optimal O(√N) query complexity while every non-oracle operator acts on at most s qubits — a log₂(n)-qubit local reflection rather than an n-qubit global one.

The evaluation (§4) implements the scheme in Qiskit on an 18-qubit single-target search and compares to standard Grover under three multi-controlled-X decomposition scenarios: S1 (no ancillae, strict locality), S2 (idle qubits used as dirty ancillae in a V-chain), and S3 (one clean ancilla). For the 3-stage [6,6,6] configuration with optimal iteration counts (102, 25, 6), measurement returned the target bit-string 96.5% of the time against the theoretical 96.7%; the 2-stage [9,9] configuration at (201, 17) iterations reached 99.5% (theory 99.8%). Of the 3-stage failures, 57% still produced the correct first-stage substring — useful structure when only a prefix is needed (success rises to 97.0% for the first 12 bits, 98.4% for the first 6). On non-oracle depth, S1/S2 deliver 95%/97% reductions for the 3-stage configuration at a 30% oracle overhead, and 81%/96% for the 2-stage at 9% overhead; S3 narrows these to 63%/51% reductions but still favourable. The break-even oracle-to-diffusion depth ratio is 3.1×–3.2× (S1/S2) and 1.7× (S3) for the 3-stage layout; the 2-stage variant tolerates oracles 8.8×–10.6× deeper before parity.

The scaling analysis (§4.2) covers n = 20–50 and confirms both the failure probability and the oracle overhead decay rapidly with n: at n = 50 the 2-stage failure rate drops below 10⁻⁷ with under 0.1% oracle overhead. With oracle depth equal to the diffusion operator at n = 50, the 2-/3-/4-stage configurations reduce total depth by 25%/34%/35%; at 5× oracle depth the configurations converge to ~10% reduction; at 10× to ~5%. As an applied example, AES-128 (oracle ≈10× diffusion) gives a ~5% depth saving.

The discussion (§5) is candid about three constraints. Mid-circuit measurements (m−1 per outer run) are noise sources but scale only with the number of stages, not √N — for a 100-qubit problem partitioned into 10 stages that is 9 measurements against 2⁵⁰ oracle calls. The integer-iteration rounding error can dilute success because γ_m ≫ θ in general; adjustable-phase variants of S_{ψ_i} on the two outermost registers would suffice to deterministically eliminate it. Finally, the tensor decomposition of A is necessary for partial diffusers to exist — the technique applies cleanly to Grover, partial Grover, and any algorithm whose state-preparation factors across the register, but more entangled state preparations require approximate low-rank operator-Schmidt truncations.

Figures

Figure 1. Simulation and depth comparison of the recursive search scheme against standard Grover search on an 18-qubit single-target problem. Top row: measurement outcome distributions for the 3-stage [6,6,6] (left) and 2-stage [9,9] (right) configurations over 1000 shots, showing the proportion of shots returning the correct target bitstring and partial matches. Bottom left: diffusion operator circuit depth (log scale) for both configurations under three decomposition scenarios — S1 (no ancillae), S2 (dirty V-chain using idle qubits), S3 (one clean ancilla). Bottom right: total oracle call counts, with percentages showing the recursive scheme's calls relative to the Grover baseline matched to the same success probability.

Figure 2. Scaling behaviour of the recursive search scheme for k = 2, 3, 4 stages of near-equal size, computed using the closed-form expressions and applied to unstructured search over n = 20 to n = 50 qubits with a single target state. Left: failure probability 1 − p on a logarithmic scale. Right: relative oracle overhead — the fractional increase in oracle calls over standard Grover search matched to the same success probability.

Figure 3. Depth scaling of the recursive search scheme relative to standard Grover search for k = 2, 3, 4 stages across n = 20 to n = 50 qubits. Top row: break-even oracle-to-diffusion depth ratio under the three multi-controlled-X decomposition scenarios (S1, S2, S3). Bottom row: total circuit depth of the recursive scheme relative to Grover, assuming a single oracle operator contributes 1×, 5×, and 10× the depth of a single diffusion operator respectively. The dashed red line marks parity with Grover; values below indicate a depth advantage for the recursive scheme.

Citations to Yuan's papers

No direct citation to any of Y1–Y6 found in bibliography.

Overlap with Y1–Y6

Y4 (Grover + ADMM for cardinality-constrained BO) — strongest overlap. Y4's quantum subroutine performs O(√(C(n,k)/M)) Grover rotations on a fixed-cardinality feasible space. Burke & Mc Goldrick give a drop-in non-oracle redesign of that loop. For Y4's hardware implementation the cardinality-constrained initial state (a Dicke state on k of n qubits) does not trivially factorise across tensor partitions, so the construction does not apply verbatim — but the approximate low-rank operator-Schmidt truncation route the authors flag (§5, "Limitations") and per-block Dicke decompositions are the natural bridge.
Y1 (warm-started QAOA). Both share the philosophy of preserving the quadratic-speedup geometry while reshaping the operators applied between oracle/cost calls. Y1 warms the initial QAOA state from a measurement; this paper restructures the reflections around the initial state. Conceptually adjacent, methodologically distinct.
Y2, Y3 (QAOA portfolio). Indirect — QAOA is not amplitude amplification, but the depth-reduction message ("global entanglement on every iteration is unnecessary") echoes Y3's broader theme that NISQ-era resource counts are what gate quantum advantage.
Y5 (Gibbs-state GW relaxations). Both papers chase quantum advantage via algorithmic restructuring rather than hardware scaling. No direct method link.
Y6 (PBR on Heron2). No overlap.

Recommended action for Yuan

Read deeper, with Y4 in mind. The break-even and crossover plots (Figs 2–3) are directly applicable to Y4's cost model — even a 5–10% total depth reduction at n=50 is material for the resource estimates Y4 reports. Worth pulling the Qiskit reference implementation (cited as burke2026code) and benchmarking against the cardinality-constrained Grover oracle in Y4.
Adapt the construction. Cardinality-constrained search has a Dicke-state preparation that does not factorise across partitions — but block-wise Dicke states or operator-Schmidt-truncated approximations would let Y4 inherit the local-diffuser advantage. This is a concrete follow-up paper.
Cite, when the right Y4-extension or Y1 revision lands. If/when Y4 is updated for journal submission with NISQ-resource estimates, this paper is a natural citation for "alternatives to global diffusion in our amplitude-amplification subroutine".