China's Coding Models Stopped Lagging — The Numbers That Changed the Narrative

The conventional wisdom held for years: American frontier models were roughly six to nine months ahead of the best Chinese labs on agentic coding tasks. The evidence was mostly anecdotal — the gap felt real, the benchmarks supported it, and nobody had reason to challenge it.

Then April 2026 happened. Four Chinese labs released open-weights coding models inside a twelve-day window. The numbers came back, and they didn't match the old story.

The Twelve-Day Sprint

From Air Street Press's May 2026 State of AI report: Z.ai's GLM-5.1, MiniMax's M2.7, Moonshot's Kimi K2.6, and DeepSeek's V4 all landed in rapid succession. GLM-5.1's stock closed up 15.92% on launch day. MiniMax's M2.7 completed over 100 rounds of self-optimizing its own scaffold. Kimi K2.6 ran a 12-hour continuous tool-use trace porting an inference engine to Zig. DeepSeek's V4-Pro claimed parity with Opus 4.6 and GPT-5.4.

The aggregate result on SWE-Bench Pro: all four models scored between 56 and 59. For context, that puts them in the same tier as the leading Western frontier models on coding tasks — not six to nine months behind, not追赶 (catching up), but effectively at parity.

Why the Gap Closed

Three things happened in parallel. First, the open-weights ecosystem matured — when a technique works, it propagates fast when anyone can download and build on the model. Second, the cost structure changed: all four of these models are priced at under one-third the cost of Claude Opus 4.7 per token, which means the economics of using them for production coding work are dramatically different from the American frontier options.

Third, and less discussed: the training methodologies converged. Reinforcement learning with verifiable rewards, scratchpad reasoning, and agentic tool use are not proprietary techniques. Once they work, they're documented, replicated, and improved on by the next lab within weeks.

The Benchmark Nuance

Air Street Press notes an important caveat: "Both true — different evaluators measuring different things." NIST's CAISI evaluation showed an aggregate cross-domain benchmark where the Chinese models still lag the leading American frontier by roughly eight months. DeepSeek's own model card claims V4-Pro at parity with Opus 4.6 and GPT-5.4.

This is actually consistent. Different benchmarks measure different capabilities. SWE-Bench measures real-world code fix capability on GitHub issues — a task that is verifiable and reproducible. Broader capability benchmarks like the Intelligence Index include reasoning, planning, and multi-step task completion that may show larger gaps.

What matters for practical use: if you're evaluating these models for coding assistance, the SWE-Bench numbers are the relevant data point, and those numbers have compressed significantly.

What This Means for the Market

Cursor is reportedly raising $2 billion at a $50 billion-plus valuation with enterprise revenue surging toward a $6 billion run-rate exit (TechCrunch, April 2026). Cognition is in talks for a follow-on at $25 billion, up from $10.2 billion in September 2025. These valuations are being priced on the assumption that coding agents represent a large and defensible market. If the Chinese open-weights models are at parity on coding benchmarks at under one-third the cost, that changes the competitive dynamics for every team building in this space.

The practical implication: if you're building a coding agent workflow today and paying per-token for Claude Opus or GPT-5 class models, you're doing it with a cost structure that may not be defensible in six months. The open-weights alternatives are good enough for most tasks and significantly cheaper.

The Attribution Problem

Anthropic publicly stated in February 2026 that three Chinese labs — including some of those above — were essentially fine-tuned distillations of frontier models rather than independent research. That accusation is worth noting and is still cited by defenders of the "China lags" narrative. But the SWE-Bench scores don't distinguish between "independent research" and "distillation." They measure output quality. If the output is at parity, the attribution argument is less relevant for people choosing which model to use in production.

Key Takeaways

GLM-5.1, M2.7, Kimi K2.6, and DeepSeek V4 all scored 56-59 on SWE-Bench Pro in April 2026 — at coding parity with Claude Opus 4.6 and GPT-5.4
All four models are priced at under one-third the cost of Claude Opus 4.7, fundamentally changing the economics of coding agent workflows
The "China lags 6-9 months" frame on coding no longer holds — SWE-Bench is verifiable and reproducible; the numbers don't support the old narrative
Benchmark disagreements (NIST CAISI vs. model cards) reflect different evaluation methodologies, not arbitrary data — both can be true simultaneously