IsoHC: How I Solved the Stability Crisis in Hyper-Connections

Misaya Yang
Scholar
February 6, 202690 reads10 min read

IsoHC: How I Solved the Stability Crisis in Hyper-Connections

Introduction

When I first encountered Hyper-Connections (HC)—a powerful technique that generalizes residual networks by expanding the residual stream into multiple parallel pathways—I was immediately excited by its potential. HC allows layers to learn dynamic cross-layer connectivity, dramatically improving expressivity. But as I scaled up my experiments, I hit a wall: training became unstable at depth.

The problem? Unconstrained cross-stream mixing in HC destroys the identity-mapping invariants that make residual networks trainable in the first place. The recently proposed mHC (Manifold-Constrained HC) addressed this by projecting residual mixing matrices onto the Birkhoff polytope (doubly stochastic matrices), restoring mean-preservation. But I realized there was a deeper geometric structure we could exploit.

In this post, I'll share how I developed IsoHC—a new manifold constraint that preserves both (i) global stream mean and (ii) ℓ₂ energy along residuals. My key insight: we can constrain residual mixing to a special manifold of mean-preserving orthogonal matrices, achieving exact isometry while maintaining differentiability through efficient polar retractions.

Background: The Promise and Peril of Hyper-Connections

What Are Hyper-Connections?

Standard residual connections pass information through a single highway:

Xl+1=Xl+F(Xl,Wl)X_{l+1} = X_l + F(X_l, W_l)

Hyper-Connections generalize this by expanding the residual stream to nn parallel streams. An HC block looks like:

Xl+1=HlresXl+(Hlpost)TF(HlpreXl,Wl)X_{l+1} = H_l^{\text{res}} X_l + (H_l^{\text{post}})^T F(H_l^{\text{pre}} X_l, W_l)

where:

  • XlRn×dX_l \in \mathbb{R}^{n \times d} is the nn -stream representation at layer ll
  • Hlpre,HlpostRn×nH_l^{\text{pre}}, H_l^{\text{post}} \in \mathbb{R}^{n \times n} aggregate and distribute information across streams
  • HlresRn×nH_l^{\text{res}} \in \mathbb{R}^{n \times n} mixes the residual streams

This diversifying connectivity improves expressivity—different streams can specialize in different features. But there's a catch: we lose the identity mapping guarantee that makes deep residual networks trainable.

The Stability Crisis

In standard ResNets, the identity path Xl+1=Xl+X_{l+1} = X_l + \cdots ensures that gradients can flow backward without vanishing. With HC, if HlresH_l^{\text{res}} is unconstrained, the residual path can become:

XL=l=0L1HlresX0X_L = \prod_{l=0}^{L-1} H_l^{\text{res}} \cdot X_0

If the product of HlresH_l^{\text{res}} matrices has eigenvalues that shrink or explode, training collapses. I observed this firsthand: networks with L>20L > 20 layers would either fail to converge or suffer from gradient explosion.

mHC: The First Fix

The mHC paper proposed a clever solution: project HlresH_l^{\text{res}} onto the Birkhoff polytope—the set of doubly stochastic matrices:

B={HRn×n:H1=1,1TH=1T,H0}\mathcal{B} = \left\{ H \in \mathbb{R}^{n \times n} : H \mathbf{1} = \mathbf{1}, \mathbf{1}^T H = \mathbf{1}^T, H \geq 0 \right\}

where 1=(1,1,,1)T\mathbf{1} = (1, 1, \ldots, 1)^T . This ensures:

  1. Row-stochastic: Each row sums to 1 (convex combination of inputs)
  2. Column-stochastic: Each column sums to 1 (preserves global mean)
  3. Nonnegative: No cancellation between streams

The projection is done via the Sinkhorn-Knopp algorithm, which iteratively normalizes rows and columns until convergence.

Why mHC Helps

Doubly stochastic matrices preserve the global stream mean:

mean(HX):=1n1THX=1n1TX=mean(X)\text{mean}(HX) := \frac{1}{n} \mathbf{1}^T HX = \frac{1}{n} \mathbf{1}^T X = \text{mean}(X)

This restores a key residual invariant. Moreover, H21\|H\|_2 \leq 1 for any doubly stochastic HH , so gradients don't explode.

But There's a Problem...

While mHC stabilizes training, it has a subtle weakness: doubly stochastic mixing can be strictly contractive on the mean-zero subspace. Specifically, if X1X \in \mathbf{1}^\perp (i.e., 1TX=0\mathbf{1}^T X = 0 ), then:

HX2X2\|HX\|_2 \leq \|X\|_2

with equality only if HH is a permutation matrix. For general HBH \in \mathcal{B} , the inequality can be strict, meaning energy leaks out of the residual path over many layers.

I wanted to fix this: can we preserve both mean and energy exactly?

My Solution: The IsoHC Manifold

The Key Insight

I realized that what we really want is a matrix HH that:

  1. Preserves the global mean: H1=1H \mathbf{1} = \mathbf{1}

  2. Preserves ℓ₂ energy: HX2=X2\|HX\|_2 = \|X\|_2 for all XX The second condition means HH must be orthogonal: HTH=IH^T H = I . Combining these, I arrived at the IsoHC residual manifold:

Miso:={HRn×n:HTH=I,H1=1}\mathcal{M}_{\text{iso}} := \left\{ H \in \mathbb{R}^{n \times n} : H^T H = I, \, H \mathbf{1} = \mathbf{1} \right\}

This is the stabilizer subgroup of 1\mathbf{1} within the orthogonal group, isomorphic to O(n1)O(n-1) . Geometrically, these are rotations and reflections that fix the 1\mathbf{1} direction.

Theorem 1: Exact Invariants

I proved that Miso\mathcal{M}_{\text{iso}} gives us exactly what we want:

Theorem 1 (Residual mean and energy invariants): Let HMisoH \in \mathcal{M}_{\text{iso}} and XRn×dX \in \mathbb{R}^{n \times d} . Then:

(i) mean(HX)=mean(X)\text{mean}(HX) = \text{mean}(X)

(ii) For each channel c{1,,d}c \in \{1, \ldots, d\} : (HX):,c2=X:,c2\|(HX)_{:,c}\|_2 = \|X_{:,c}\|_2

Proof sketch:

  • (i) follows from H1=1H \mathbf{1} = \mathbf{1} : 1THX=1TX\mathbf{1}^T HX = \mathbf{1}^T X

  • (ii) follows from orthogonality: HXF2=tr(XTHTHX)=tr(XTX)=XF2\|HX\|_F^2 = \text{tr}(X^T H^T HX) = \text{tr}(X^T X) = \|X\|_F^2

This means no energy is lost or gained as signals propagate through the residual path—a perfect isometry.

The Challenge: Efficient Projection

The hard part was figuring out how to project an arbitrary matrix H~\tilde{H} onto Miso\mathcal{M}_{\text{iso}} efficiently and differentiably. Direct optimization on this manifold is tricky because it's a non-convex constraint.

My solution: subspace polar retraction.

Algorithm 1: Iso-NS (Isometric Newton-Schulz)

The key idea is to separate the 1\mathbf{1} direction from its orthogonal complement 1\mathbf{1}^\perp , project the orthogonal part onto O(n1)O(n-1) , then reconstruct.

Here's the algorithm:

Input: H~Rn×n\tilde{H} \in \mathbb{R}^{n \times n} (unconstrained matrix)

Output: HMisoH \in \mathcal{M}_{\text{iso}}

  1. Define the mean direction: e0=1/ne_0 = \mathbf{1} / \sqrt{n}

  2. **Choose an orthonormal basis for 1\mathbf{1}^\perp **: Let URn×(n1)U \in \mathbb{R}^{n \times (n-1)} such that UTU=IU^T U = I and UTe0=0U^T e_0 = 0

  3. **Decompose H~\tilde{H} **: Write H~=e0e0T+UAUT+(off-diagonal terms)\tilde{H} = e_0 e_0^T + U A U^T + \text{(off-diagonal terms)}

  4. Extract the 1\mathbf{1}^\perp block: A=UTH~UR(n1)×(n1)A = U^T \tilde{H} U \in \mathbb{R}^{(n-1) \times (n-1)}

  5. Project AA onto O(n1)O(n-1) via Newton-Schulz:

  • Initialize: X0=A/γX_0 = A / \gamma (where γ\gamma keeps σmax(X0)\sigma_{\max}(X_0) in the convergence radius)
  • Iterate for KK steps:
Xk+1=32Xk12XkXkTXkX_{k+1} = \frac{3}{2} X_k - \frac{1}{2} X_k X_k^T X_k
  • Result: R=XKR = X_K (polar factor of AA , i.e., RO(n1)R \in O(n-1) )
  1. **Reconstruct HH **: He0e0T+URUTH \leftarrow e_0 e_0^T + U R U^T

Why Newton-Schulz?

The Newton-Schulz iteration is a matrix-free method for computing the polar factor (the closest orthogonal matrix to AA ). It converges quadratically under standard spectral conditions, and crucially, it's fully differentiable—perfect for backpropagation.

In practice, I use K=3K = 355 iterations, which is enough for high accuracy.

Computational Cost

The dominant cost is the (n1)×(n1)(n-1) \times (n-1) matrix multiplications in Newton-Schulz. For typical n{4,8}n \in \{4, 8\} , this is negligible compared to the transformer block itself. I measured ~1–2% overhead relative to standard HC.

Theoretical Deep Dive: Gradient Preservation

One of the most exciting properties of IsoHC is exact gradient norm preservation along the residual path.

Residual-Path Linearization

Consider the residual-only linearization (ignoring the nonlinear function FF ):

Xl+1=HlresXlX_{l+1} = H_l^{\text{res}} X_l

Let Gl=LXlRn×dG_l = \frac{\partial \mathcal{L}}{\partial X_l} \in \mathbb{R}^{n \times d} be the gradient at layer ll . By the chain rule:

Gl=(Hlres)TGl+1G_l = (H_l^{\text{res}})^T G_{l+1}

Since HlresH_l^{\text{res}} is orthogonal, we have:

GlF=Gl+1F\|G_l\|_F = \|G_{l+1}\|_F

This means gradient norms are exactly preserved through deep products:

G0F=GLF\|G_0\|_F = \|G_L\|_F

No vanishing, no explosion—perfect stability.

Comparison with mHC

In contrast, doubly stochastic mixing guarantees non-expansiveness ( H21\|H\|_2 \leq 1 ) but can be strictly contractive on 1\mathbf{1}^\perp . Over many layers, this can lead to gradual gradient decay, especially if the signal is mostly in the mean-zero subspace.

IsoHC avoids this by enforcing exact isometry.

The Nonnegativity Trade-Off

You might wonder: can we have both orthogonality and nonnegativity?

Unfortunately, no. I proved this with a simple proposition:

Proposition 1: Nonnegative orthogonal row-stochastic matrices are permutations.

Proof: Let HRn×nH \in \mathbb{R}^{n \times n} satisfy H0H \geq 0 , HTH=IH^T H = I , and H1=1H \mathbf{1} = \mathbf{1} . Since H0H \geq 0 and H1=1H \mathbf{1} = \mathbf{1} , each row rr satisfies iri=1\sum_i r_i = 1 . Since HTH=IH^T H = I , each row has r2=1\|r\|_2 = 1 . But for any nonnegative vector with r1=1\|r\|_1 = 1 :

r22=iri2(iri)2=1\|r\|_2^2 = \sum_i r_i^2 \leq \left( \sum_i r_i \right)^2 = 1

with equality if and only if exactly one entry is 1 and the rest are 0. Thus, HH must be a permutation matrix. \square

Design Trade-Off

This reveals a fundamental trade-off:

  • Birkhoff constraints (mHC): Enable continuous convex mixing (diffusive behavior), but allow energy contraction
  • Orthogonal constraints (IsoHC): Enable continuous isometries (rotations/reflections), but require signed coefficients

I view this as a feature, not a bug. Signed mixing allows richer representational dynamics, and the isometry constraint prevents runaway cancellation.

Hybrid Option

For users who want the best of both worlds, I also explored a diffusion-isometry interpolation:

Hlres=(1λl)Hliso+λlHlbirkH_l^{\text{res}} = (1 - \lambda_l) H_l^{\text{iso}} + \lambda_l H_l^{\text{birk}}

where HlisoMisoH_l^{\text{iso}} \in \mathcal{M}_{\text{iso}} (via Algorithm 1), HlbirkH_l^{\text{birk}} is Sinkhorn-projected doubly stochastic, and λl[0,1]\lambda_l \in [0,1] is learnable or scheduled. This lets the model adaptively blend isometric and diffusive mixing.

Experiments

I evaluated IsoHC on deep transformer training with n{4,8}n \in \{4, 8\} streams, comparing against:

  • Standard HC (unconstrained)
  • mHC (Birkhoff polytope)
  • IsoHC (my method)

Key Findings

  1. Stability at depth: IsoHC trains stably up to L=50L = 50 layers, while unconstrained HC collapses beyond L=20L = 20

  2. Gradient norm preservation: Measured gradient norms GlF\|G_l\|_F across layers. IsoHC maintains near-constant norms; mHC shows gradual decay

  3. Collapse diagnostics: Tracked stream similarity (cosine similarity across streams). IsoHC maintains diverse streams; unconstrained HC suffers from stream collapse

  4. Overhead: Newton-Schulz projection adds ~1.5% training time compared to Sinkhorn (which adds ~2%)

  5. Performance: On language modeling benchmarks, IsoHC achieves slightly better perplexity than mHC at depth L>30L > 30 , likely due to better gradient flow

Limitations and Open Questions

Cancellation Risk

IsoHC allows signed residual mixing, so cancellation could still occur—energy is preserved in ℓ₂ norm, but individual components can cancel. I haven't observed this in practice, but it's worth monitoring.

Convergence Sensitivity

Newton-Schulz iterations require careful scaling (choosing γ\gamma ) for convergence. I use a simple heuristic ( γ=1.2σmax(A)\gamma = 1.2 \cdot \sigma_{\max}(A) ), but more adaptive schemes could improve robustness.

Reduced Degrees of Freedom

The manifold Miso\mathcal{M}_{\text{iso}} has dimension (n1)(n2)2\frac{(n-1)(n-2)}{2} , compared to (n1)2(n-1)^2 for Birkhoff polytope. This might limit expressivity in some settings.

Future Directions

  • Sign-stable parameterizations: Combine IsoHC with regularizers that control elementwise extremes without collapsing to permutations
  • Multi-modal architectures: How does IsoHC behave in vision transformers or multimodal models?
  • Theoretical optimality: Can we prove that Miso\mathcal{M}_{\text{iso}} is optimal for gradient flow in some sense?

Conclusion

IsoHC demonstrates that we can have our cake and eat it too: exact mean preservation and exact energy preservation in hyper-connected residual networks. By constraining residual mixing to a carefully chosen manifold and projecting via efficient polar retractions, we achieve stability at scale without sacrificing expressivity.

The key lesson: geometry matters. By respecting the manifold structure of mean-preserving isometries, we unlock training dynamics that are both stable and expressive.


Code: Full implementation will be released upon publication.

Acknowledgments: I thank the mHC authors for inspiring this work, and the reviewers for their thoughtful feedback.

References

  1. Zhenda Xie et al. mHC: Manifold-Constrained Hyper-Connections. arXiv:2512.24880, 2025.
  2. Hyper-Connections authors. Hyper-Connections. arXiv:2409.19606, 2024.
  3. R. Sinkhorn and P. Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 1967.
  4. E. Grishina, M. Smirnov, and M. Rakhuba. Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials. arXiv:2506.10935, 2025.
  5. D. A. Levin, Y. Peres, and E. L. Wilmer. Markov Chains and Mixing Times. American Mathematical Society, 2009.

Misaya Yang

Researcher focusing on Deep Learning, Transformers, Large Language Models, and Position Encoding.