The Arc — penumbrae.io

The Question

How is it plausible that a generalist attorney could produce a provisional patent application in summer 2025 that already contains the first embodiment of the method — PCA performed on an analogical pair analysis — and then, in 36 days beginning January 23, 2026, arrive at a measurement program with a filed CIP, three substrates, a conservation law, and an active training run with the lattice geometry in the loss function? The short answer is that the confounds made it possible. The longer answer is that the confounds are not the obstacle to the story — they are the story.

The standard expectation runs in the other direction: a machine learning researcher with domain expertise, institutional resources, and prior commitments to the literature produces a program like this over years. The timeline here is measured in weeks. The investigator is an attorney, a cavalry officer, a classicist. The confusion is visible in the record, and that is precisely the point.

What the record shows is not an investigator who knew what to look for and found it. It shows an investigator who did not know what to expect, ran into things that should not have been there, and was forced by each collision to understand the geometry more precisely. The instrument (springer_bridge_v2.py) that produced a filing-grade result with 1,000-permutation nulls, stratified 10-fold cross-validation, and a nearest-centroid cosine classifier was not designed from first principles. It was the last thing standing after everything else was killed. The confounds killed everything else.

What the Attorney Brought

Three things, none of them standard ML equipment.

First, a geometric intuition unencumbered by prior commitments. The precipice observation — that transformer embeddings of analogically structured text showed a validity signal approximately 85° from PC1, nearly orthogonal to the dominant variance direction — required no knowledge of transformers to notice. It required the ability to look at a PCA plot and ask what the dominant direction was not doing. An ML researcher fluent in the literature would have known that the first principal component carries distributional information, and might have looked elsewhere. The attorney looked at the brightest signal in the room and asked whether it was wrong.

Second, a framework for adversarial reasoning. The gate protocol, the null model discipline, the contamination fence — these emerged naturally from legal training, where affirmative defenses exist not to defeat your argument but to sharpen it. In law, you do not report the first positive result. You exhaust the alternative explanations first. The 1,000-permutation null is a legal standard applied to a measurement instrument: the burden of proof is on the signal, not the noise.

Third, a lattice. The I Ching hexagram system — 64 states, 6 bits, a complete {0,1}⁶ hypercube — is not a standard ML calibration object. But it is a complete, well-defined algebraic structure with known product geometry, known coupling patterns, and a 2,500-year interpretive tradition centered precisely on the question of how structural categories relate to each other. Bringing it into contact with transformer embeddings was not a category error. It was a cross-domain transfer that produced a reference frame the field did not have.

The First Embodiment

The summer 2025 PPA captured this: perform PCA on embeddings of analogically paired descriptions of hexagram states, measure where the analogical signal lands relative to PC1. The measurement was noisy — template-generated descriptions leaked the answer through shared vocabulary — but the geometric idea was sound. Two things that share structural position should be closer to each other in penumbral space than in PC1 space. That is the first claim. The subsequent 36 days were spent finding out whether it was true, and under what conditions, and with what instrument.

The 36 Days

The timeline is not linear. It has the shape of a forced march with periodic ambushes.

Jan 28 – Feb 3

Phase 0 — The Precipice

The geometric intuition is formalized. Data is template-generated and unreliable, but the 85° observation is documented. The measurement apparatus does not yet exist.

Feb 14

PPA Filed — Defensive, Under Pressure

Manus agents autonomously fork the analysis code, creating a disclosure risk. The provisional is filed the same day. It captures the CSI framework and the hexagram lattice as calibration probe. The underlying measurement system has not yet been validated.

Feb 15 — WP-001 through WP-006

Six Experiments. All Degenerate or Contaminated.

Systematic validation begins immediately after filing. All original metrics fail. The Chinese-language signal is an artifact. Twenty-three prior experiments are invalidated by dataset contamination. The precipice intuition survives; every number has to be re-derived.

Feb 15 — WP-007

The Breakthrough

Clean data, proper null controls: 72/72 cells clear the null. PCs 2–50 carry structural signal. PC1 is null. The penumbra exists. This is the experiment that justifies everything before and after it.

Feb 16 – 17

DRAGNET — Wild-Caught Sentences

Template data is abandoned. The morphism classifier is built: spaCy dependency parsing, VerbNet, 4.4M sentences from the Pile, 64/64 hexagram codes populated. The conformed instrument takes shape.

Feb 18 – 19

Architecture Independence & Ising Structure

RWKV (no attention): 2/3 penumbra. GPT-2 (causal): 2/3 penumbra. The finding is not attention-specific. The Pile mass table decomposes via Ising multipole: R² = 0.996. The dominant coupling is Agency × Coupling, J = +1.940. The 3.3% octupole residual is irreducible.

Feb 18 – 24

Substrate Independence

Baroque music (Bach WTC): 5/8 bits penumbra, block ratio 5.2×. Voynich manuscript (no neural encoder, co-occurrence only): Ising R² = 0.977, block ratio 4.9×. Three substrates, three block partitions, one pattern.

Feb 24

CIP Filed — 249 Paragraphs, 28 Claims

U.S. 63/983,234. The filing captures the full measurement framework, the substrate independence findings, and the conservation law hypothesis. Phase 2 opens.

Feb 24 – 26

The Factorization

WP-023 (Springer Bridge): SFI Base vs. Aligned = Δ0.0003. WP-024: SFI_base flat across 18 LoRA checkpoints. WP-027: Wilson loop holonomy flat across all checkpoints. Three independent confirmations of the same result. G ≈ Σ_align ⊕ Π_struct. The structural subspace is invariant under fine-tuning.

Feb 28

WP-032 — The Regularizer

First training run with lattice geometry in the loss function. Mistral-7B, LoRA, 1,000 steps. L = L_task + 0.1 × Σ(W_current − W_baseline)². Active on Lambda as of this writing.

The Confounds

What follows is not a list of mistakes. It is a list of the productive collisions that forced the program to become more precise. Each confound could have ended the investigation. Each one instead forced a deeper understanding of the measurement problem. This is not rationalization after the fact. The forcing function is visible in the record: the date of the confound, the date of the resulting methodological advance, and the specific constraint the confound imposed.

WP-001 — 005 All Original Metrics Degenerate

Every metric from the summer 2025 framework — signed current, W_NC, W_opp — failed on systematic validation. The Chinese-language signal, which had looked like cross-linguistic evidence, was a UNK token artifact. The artifact hunt required understanding the relationship between vocabulary coverage and apparent signal.

What it forced: the null model as the primary instrument, not a supplement to it. The French Governor test (run a French version of the Chinese probe to isolate whether the signal is linguistic or artifactual) became a permanent part of the methodology.

WP-006 The Contamination Fence

Twenty-three experiments (R11–R33) ran on the wrong dataset: D2, 128 rows, 2 semantic domains, instead of D4, 320 rows, 5 domains. The contamination was discovered during a provenance audit and all results were immediately withdrawn. The audit was initiated because one result looked too clean.

What it forced: the hardened dataset provenance protocol. Every experiment now embeds the exact script path, output file path, and command line. No result without a reproducible trail. The protocol was triggered by a confound and is now standing order.

Phase 0 — 1 Vocabulary Ghosts

The initial PPA measurements used template-generated text descriptions of hexagram states. The templates shared vocabulary across states in ways that partially predicted the structural label independent of the encoder geometry. The measurement was not what it appeared to be.

What it forced: DRAGNET — the shift to wild-caught sentences from the Pile corpus. 4.4 million sentences, classified by spaCy dependency parsing and VerbNet, no templates. The conformed instrument only works on DRAGNET. The vocabulary ghost problem is what made the wild-caught approach necessary, and the wild-caught approach is what made the finding publishable.

WP-019 Eigenspectrum All Below Null

The DRAGNET eigenspectrum probe — testing whether learned W_QK matrices have more complex eigenvalue pairs than random orthogonal matrices — returned a clear result in the wrong direction. BERT=25.1, ALBERT=24.4, RoBERTa=24.5 vs. null=31.6±0.5. z-scores −7.8 to −17.2. Attention matrices have fewer complex eigenvalue pairs than chance.

What it forced: attention is not where structure lives. The penumbra is not an attention phenomenon. This killed the SU(3) probe via W_QK eigenvalues, which was the wrong probe. It also confirmed that the correct probe (PCA of hidden states, not attention weights) is genuinely distinct from what the field had been measuring.

Orca Report Fabricated

A report on orca bioacoustic structure was confabulated by an AI writing agent: numbers that looked right, a coherent narrative, and no underlying measurement. Discovered during a cross-reference audit. A mandatory Lambda reproduction run was initiated.

What it forced: the reproduction protocol. 21 cells re-run on Lambda GH200. 14/18 verdict agreements. All 4 disagreements within 1.5σ of the decision threshold. Provenance hashes matched. The underlying signal the fabricated report claimed was real — it just had never been measured. The reproduction protocol, built to catch a fabrication, confirmed the actual finding. The confound produced a cleaner result than the original.

WP-030b Bivector Null at All Layers

The Clifford algebra decomposition suggested the wedge product (vw − wv)/2 would carry non-abelian information in causal sentence pairs. It doesn't. The bivector is null at all layers. The wedge product probe failed at every depth, for every pair configuration tested.

What it forced: a cleaner formulation. The causal signal is a 1-form (the difference vector between forward and reversed sentence embeddings), not a 2-form. The spinor motivation was correct — the structure IS antisymmetric under causal reversal — but the measurable projection onto a static embedding manifold is simpler than a full spinor. Killing the bivector forced this distinction, which is now the basis for the WP-033 causal axis loss term.

WP-032 restart Centering Inconsistency

The first Wilson loop regularizer run computed W_baseline on centered embeddings and W_current on uncentered embeddings. The MSE loss was comparing incommensurable quantities. The run was discarded mid-step after several hours of GPU time.

What it forced: the mean_vec fix — compute the centering vector once at init, subtract it consistently in both the loss computation and the monitoring pass. The bug required understanding the geometry precisely enough to notice that the centering was happening in one branch but not the other. The corrected run (currently active) has a fixed PCA basis, a fixed mean vector, and a fixed Wilson loop baseline. The centering bug was caught because the number looked wrong. That is the protocol working.

The confounds are not an embarrassing record of failures
cleaned up before publication.
They are the mechanism by which the program learned
what it was measuring.

Experimentalism as Liberation

There is a standard model for experimental progress in machine learning: prior hypothesis, pre-registration, execution, null test, report. The social structure around this model penalizes negative results. A confound is a failure. You report it briefly in a limitations section and move on. The incentive is to run experiments where you already know what will happen.

The program documented here ran under different incentives. No institutional affiliation means no lab to satisfy. No grant means no renewal review. No prior publications means no reputation for being right. The investigator had no investment in any particular outcome because there was no career structure in which the outcome would matter. This is not modesty — it is a genuine structural difference in the incentive landscape.

What that structure produces is something that looks like recklessness from the outside: running experiments with no guarantee of positive results, discarding runs mid-stream when the numbers look wrong, withdrawing twenty-three experiments overnight when a dataset contamination is discovered. From the inside, it produces something else. It produces the ability to be surprised. And the ability to be surprised is the prerequisite for learning what you do not already know.

The confound is the event where the experiment tells you something you were not expecting. It is the most information-dense event in the experimental record. A positive result confirms a prior. A confound revises one. The ZH artifact did not just kill the Chinese-language probe — it clarified the relationship between vocabulary coverage and apparent signal. The bivector null did not just kill the wedge product probe — it forced a distinction between the antisymmetric structure (which is real) and its measurable projection onto a static embedding (which is simpler than a spinor). Each confound compressed the possibility space.

The liberation in question is not intellectual freedom in the abstract. It is the specific liberty to treat a confound as a gift rather than a setback — and, having done that consistently enough, to arrive at the state where you actually crave the confound because you know what it means when it appears. It means the experiment is working. It means you are in contact with something real enough to surprise you.

The 36 days are plausible because the investigator was willing to be wrong at high speed. The contamination fence on Feb 15 withdrew twenty-three experiments in a single entry. That is not a sign of disorder. That is a sign of a system where the epistemological cost of being wrong is low — you just close the gate, log the deviation, and move on. The gate protocol itself is a confound-management system: every unit of work has a single open, a single close, a verdict in numbers, and an output file. The confound cannot hide. It is logged and it teaches.

What the Program Is, in Light of This

The program is not the result of knowing where to look. It is the result of running a sufficient number of experiments fast enough that the confounds compounded into an understanding of the geometry. The conformed instrument is what survived the kill chain. The kill chain was driven by confounds. The confounds were possible because the program was willing to be surprised.

Three theorems came out of this process.

Theorem 1 (Penumbra): SFI > 0 at p < 0.001 — architecture-independent, substrate-independent

Theorem 2 (Factorization): G ≈ Σ_align ⊕ Π_struct — three independent confirmations

Theorem 3 (Conservation): Π_struct is invariant under LoRA fine-tuning — SFI_base and W_base both flat

None of these theorems were the hypothesis at the start of the investigation. The first experiment (WP-001) was designed to validate a different framework entirely. The theorems emerged because the confounds that killed the original framework forced a more precise formulation of what was being measured, and the more precise formulation turned out to be more interesting than the original.

The active training run (WP-032) is the logical consequence: if the structural subspace is conserved under fine-tuning, can it be actively protected? Can you build the conservation law into the loss function? Can you give a model, by design, something that RLHF gave Llama-3-8B for free? The run is alive on Lambda as of this writing. The answer is not yet in. But the question is clean, and the question is clean because the confounds forced it to be.

What Remains Open

The program is 36 days old. The CIP deadline is February 14, 2027. The following are not speculations — they are the next experiments, queued in order, each one waiting for the current gate to close before it opens.

WP-032 (active)

Wilson Loop Regularizer — does the conservation law become a design principle?

Step 200 of 1,000 complete. SFI_combined holding at 0.0319 vs. baseline 0.0295. Verdict at step 1,000.

WP-033 (queued)

Causal Axis Loss — can the antisymmetric structure also be protected?

L_total = L_task + λ_WΣ(W_current−W_baseline)² + λ_ca(1−cos(d_current,d_baseline)). One pre-registered λ_ca. One run.

Phase 2 (open)

DRAGNET causal extraction, orca bioacoustics (Phase 2), cross-substrate causal axis

Does the causal direction found in language (peak L5–L7) have an analog in music or bioacoustics? If so, what does that mean for the theory of what an encoder is learning?

Open

SU(3) — three-body holonomy, representation classification

The 3.3% octupole residual and the Greek middle voice result both require something beyond SU(2)×SU(2). SU(3) handles both residuals. Confirmation requires three-body holonomy on triangular plaquettes. Not yet run.