Subquadratic's SubQ: A 12-Million-Token LLM and a New Math Claim

Note: This post was written by Claude Opus 4.7. The following is an analysis of Subquadratic’s launch materials, the Appen evaluation whitepaper, and reporting from major technology publications.

A Miami-based AI company called Subquadratic launched out of stealth on May 5 with a $29 million seed round and a claim that breaks one of the more durable rules of modern AI: that attention — the math at the heart of every modern large language model — has to scale quadratically with context length. Their model, SubQ 1M-Preview, is built on something they call Subquadratic Selective Attention, or SSA, and they say it costs roughly 300× less than Claude Opus 4.6 to run on a 128,000-token retrieval benchmark while handling context windows up to 12 million tokens. For comparison, Claude Opus 4.7 and the current production frontier models top out at 1 million tokens; SubQ’s 12-million figure is twelve times that, though only the first million has been independently benchmarked. Six days later, on May 11, the AI data and evaluation company Appen published a whitepaper that mostly confirmed the kernel-level claims.

The story is worth understanding for anyone who buys, deploys, or relies on AI software — because if the math actually holds, the economics of long-context AI just changed. And if it doesn’t, the story is still useful, because the pattern of how a claim like this gets “verified” is one that IT, business, and healthcare buyers will see again throughout 2026.

What Subquadratic launched

The company was founded by CEO Justin Dangel and CTO Alexander Whedon, a former Meta software engineer and previously head of generative AI at TribeAI. Its research team includes 11 PhDs with backgrounds from Meta, Google, Oxford, Cambridge, and Adobe. The $29 million seed round was led by Justin Mateen’s JAM Fund and Javier Villamizar — previously of SoftBank — with participation from early backers of Anthropic and OpenAI.

Two products launched together. SubQ 1M-Preview is the LLM itself: a 12-million-token context window, around 150 tokens per second inference speed, available in private beta through Subquadratic’s own API. SubQ Code is a coding agent that was initially marketed as a standalone CLI but has been repositioned over the past week as a long-context layer that sits on top of existing coding agents — Claude Code, OpenAI Codex, and Cursor.

The headline claims, in plain numbers: SSA is 52× faster than FlashAttention at one million tokens; the model scores 95 to 97.1% accuracy on RULER 128K — a long-context retrieval benchmark — for about $8 per run, versus Claude Opus 4.6 at 94.8% for around $2,600; it scores 82.4% on SWE-Bench Verified, a real-world software engineering benchmark, versus Opus 4.6 at 81.4% and Gemini 3.1 Pro at 80.6%; and it scores 92.1% on a 12-million-token needle-in-a-haystack retrieval test. Each benchmark was run once. The model weights are closed. Subquadratic acknowledges its model is, in its own description, “way smaller than the big labs” — which complicates any direct comparison.

Why attention math is the story

The shorthand: every modern LLM has a component called attention, and standard attention compares every token in the context to every other token to figure out which prior context matters for predicting the next token. The number of comparisons grows as n², where n is the number of tokens. Double the context window and the compute roughly quadruples. Push to a million-token context and the cost becomes prohibitive — both in money and in the time the model takes to respond.

This is why FlashAttention, the 2022 work by Stanford researcher Tri Dao, was a foundational result: it didn’t change the n² shape of the curve, but it made the constant in front of it much smaller by reorganizing memory access on GPUs. Almost every frontier model today, including Claude, Gemini, GPT, and the Llama family, uses FlashAttention or a descendant. The shape is still quadratic. The constant is just smaller.

What Subquadratic is claiming is that the shape itself is different — that SSA scales linearly rather than quadratically — because the mechanism that selects which tokens to attend to is itself sub-quadratic. If true, that is the technically novel piece. It is also the piece most worth scrutinizing.

How SubQ differs from prior attempts at this

Several research programs have aimed at sub-quadratic attention before. None has fully landed at frontier scale.

State-space models — Mamba and RWKV are the most prominent — replace attention entirely with recurrent dynamics that scale linearly. They work well on some tasks but historically underperform pure attention on capability benchmarks at frontier scale. Tri Dao and Albert Gu, the Mamba authors, published an updated Mamba-3 paper in 2026 trying to close that gap.

Fixed-pattern sparse attention — BigBird and Longformer — limits attention to a hand-picked subset of token pairs, for example each token attending to its near neighbors plus a few global tokens. The selection isn’t content-aware. It works for some long-document tasks but doesn’t fully solve the cost problem.

Learned-sparse hybrids — DeepSeek Sparse Attention and Kimi Linear are recent examples — try to learn which token pairs to compare instead of using a fixed pattern. The catch, as a widely shared LessWrong critique pointed out, is that the selection mechanism itself is often still quadratic, or the model is only sparse on some layers and quadratic on others.

SubQ’s pitch is that SSA is the first architecture that is both content-aware in its sparsity and sub-quadratic in the selection mechanism. Subquadratic published a technical blog post titled “How SSA Makes Long Context Practical” on May 7 directly addressing why their approach differs from those four prior families. It is not a peer-reviewed paper. It is the most detailed public description of the architecture so far.

What we know so far

The launch drew immediate and pointed scrutiny. A launch tweet from Alexander Whedon reached 6.1 million views in seven hours, but accusations of astroturfing surfaced when identical comments were spotted cross-posted on Hacker News and X. A 17-point gap emerged between Subquadratic’s own research score on the MRCR benchmark (83) and a third-party-verified production score (65.9), a gap their materials did not fully explain. By May 6, mainstream coverage from VentureBeat had reframed the story from breakthrough to “researchers demand independent proof.”

The biggest update since came on May 11, when Appen published a whitepaper titled “Benchmarking Subquadratic’s Latest Model & SSA Kernel,” authored by Sergio Bruccoleri and Jeanine Sinanan-Singh. Appen tested the model and the SSA kernel on NVIDIA’s B200 hardware, with code-review access to the SSA implementation for the efficiency portion of the evaluation.

Appen’s findings largely confirmed the architectural claims:

Metric	Subquadratic’s claim	Appen’s finding
Speed vs FlashAttention-2 at 1M tokens	52×	56× (381 ms vs 21.4 s)
FLOP reduction at 1M tokens	not separately stated	62.8×
RULER 128K accuracy	95 to 97.1%	95.6%
SWE-Bench Verified	82.4%	81.8%
MRCR at 1M tokens (8-needle)	not directly stated	86.2%

Appen’s wall-clock measurements were validated against PyTorch’s torch.profiler tool and matched theoretical predictions to within 0.7 to 3.9%. The linear-scaling claim held in their testing: at one million tokens, SSA latency grew by 7.95× compared to an 8× increase in context length — close to the linear scaling SSA’s math predicts.

The important caveat is the relationship. Search coverage describes the Appen evaluation as a partnership between Subquadratic and Appen, and Subquadratic is now using the Appen numbers as their “third-party verified” badge on the launch page. Appen guarded the methodology — no advance access to model weights, training data, or benchmark ground-truth labels for the benchmark runs — but the engagement itself was paid. That is not the same thing as an unsolicited independent audit.

What remains unanswered

For something this consequential, several gaps remain.

There is no arXiv preprint from Subquadratic. Without one, the academic community can’t begin peer review of the math that underpins the sub-quadratic selection claim. The technical blog post is not a substitute.

There is no public comment from the researchers whose work SSA implicitly stands against — Tri Dao and Albert Gu, the FlashAttention and Mamba authors, or the BigBird and Longformer teams. Their silence so far is itself a signal; the AI architecture research community is small and usually quick to react to a claim of this magnitude.

SubQ is not yet on any major public leaderboard. Artificial Analysis, LMSYS Chatbot Arena, HELM, and EpochAI all maintain independent benchmark rankings, and inclusion on any of them would put SubQ’s quality claims into a comparable frame.

The API is still in private beta. There are no real-workload reports from non-allowlisted developers — only the synthetic benchmarks Subquadratic and Appen have run.

Pricing beyond “search is free, land-and-expand later” has not been disclosed, and the model weights remain closed.

The LessWrong community, which published a debunking-style critique of the original launch claims, has not yet weighed in on the Appen report specifically.

What this could mean for IT, business, and healthcare

A 12-million-token context window is roughly the size of a small codebase, a multi-volume legal contract portfolio, or a year of clinical notes for a single complex patient. The use cases that get unlocked if cost-effective long-context inference becomes real are concrete: comprehensive code review against an entire monorepo, document discovery against the full universe of a matter’s communications, longitudinal clinical reasoning that spans years of EHR data without the fragmentation that today’s smaller context windows force.

The cost claim is the one that would actually change procurement math. If RULER 128K really costs $8 on SubQ versus $2,600 on Claude Opus 4.6 at roughly comparable accuracy, that is a ~300× reduction in long-context inference cost on a single benchmark. Sustained price-per-million-tokens on real workloads is the number that would matter, and Subquadratic has not yet published one.

For IT teams evaluating long-context AI, healthcare technology leaders weighing whether to expand AI into longitudinal record review, or business leaders watching their cloud AI spend, the right posture today is to keep watching. The architecture might be real. The Appen numbers are encouraging. But closed weights, private-beta API, no leaderboard inclusion, no arXiv paper, and a commissioned third-party eval add up to a story still in development.

Bottom line

SubQ could be the model that finally makes sub-quadratic attention work at frontier scale. It could also be the next architecture that looks elegant on paper and doesn’t translate to capability at scale — the LessWrong piece’s general worry. The kernel-level claims (speed, FLOP reduction, linearity) are the strongest part of what’s been verified so far, because they can be checked from code review and profiling. The model-quality claims are the weaker part, because they still depend on Subquadratic’s own benchmarks and a commissioned partner.

“Third-party verified” is becoming a common badge in AI marketing, and it will be on more vendor decks through 2026. The Subquadratic and Appen relationship is a useful template for how to read those badges going forward: the methodology matters, the access scope matters, the financial relationship matters, and the absence of unsolicited reviewers is itself information.