RSI LAB · a research lab by AMADEUS AI

Self-improvement that compounds, on cognition you can verify.

RSI Lab is a research lab built around a single question: under what conditions does a system that rewrites itself keep improving instead of drifting? We stand for self-improvement that can be trusted, not merely observed — cognition grounded in deterministic reasoning, external verification, and auditable structure. Our work is to build the ground that lets improvement compound.

The full report · read below

The Next Frontier of Artificial Intelligence: a comprehensive analysis of post-Transformer architectures and paradigms (2026).

regime::deterministic · grounding::external · loop::audited
Full report

The next frontier of artificial intelligence: a comprehensive analysis of post-Transformer architectures and paradigms.

The AI landscape has undergone a structural shift. The early 2020s were defined by the scaling laws of autoregressive large language models and the dominance of the Transformer; by 2026 the field is defined by architectural diversification — the industry has reached diminishing returns for brute-force pretraining on public web text, and elite laboratories have pivoted toward novel structural paradigms, post-training compute, and non-autoregressive generation.

We believe recursive self-improvement is one of the most important research areas of this next frontier — and that the field is approaching it from the wrong end. Every self-improving system demonstrated so far gains on a single loop and then decays, because it can generate faster than it can verify. Scaling compute and model size carried the last era; it will not carry this one alone.

What follows is that frontier, mapped — nine chapters and 144 sources, opening with recursive self-improvement, the discipline this lab is named for. We wrote it to set down the ground our work stands on and the evidence behind our position, with plain-language notes alongside every technical argument. We compiled this study with care — have fun exploring it.

9 chapters 144 cited sources ~40 min read Notes for non-specialists throughout

This is dense, specialist material. Any underlined term reveals a plain-English definition, right where you read it — click or tap it.

01

From the report · the central question

Recursive self-improvement: from thought experiment to engineering discipline.

For six decades, recursive self-improvement — the prospect of an AI system improving its own capabilities, with each improvement accelerating the next — lived almost exclusively in theory and philosophy. Between 2024 and 2026 it became an engineering discipline with shipped systems, measured plateaus, dedicated evaluation suites, and explicit clauses in every frontier lab's safety policy. For practitioners, the critical skill is now distinguishing what has been empirically demonstrated from what remains speculation.

The core argument dates to I. J. Good's 1965 formulation: an "ultraintelligent machine" could design even better machines, producing an intelligence explosion that would make it "the last invention that man need ever make." Schmidhuber's Gödel machine (2003) gave the idea its first mathematically rigorous form — a self-referential program that rewrites any part of its own code once an internal proof searcher proves the rewrite beneficial. Chalmers supplied the canonical philosophical analysis in 2010, while the practical disagreement crystallized in the takeoff-speed debate: discontinuous fast-takeoff models in Yudkowsky's "Intelligence Explosion Microeconomics" versus Christiano's influential case that takeoff will be slow, continuous, and economically visible long before it is decisive. Formal skepticism matured early too: Hutter dissected which senses of "explosion" are even coherent, and Yampolskiy argued self-improving software faces intrinsic computational limits.

The 2025–2026 literature sharpened this from philosophy into quantitative economics. Forethought argues a software-only intelligence explosion is plausible despite retraining and compute bottlenecks, while Epoch AI counters that published estimates of returns to software R&D straddle exactly the critical threshold (r=1) and that compute — not cognitive labor — is the binding constraint, making the debate empirically unresolved. At the formal extreme, Zenil (2026) proves that fully autonomous recursive self-training — with the external information signal driven to zero — converges to degenerative fixed points of entropy decay, arguing sustained self-improvement requires external grounding or symbolic model synthesis.

  • AlphaEvolve (Google DeepMind, May 2025) is the flagship demonstration of AI improving its own lab's stack — a Gemini-powered evolutionary coding agent pairing LLM proposal generation with automated evaluators in a selection loop. Company-reported results include the first improvement in 56 years over Strassen-style 4×4 complex matrix multiplication, gains on ~20% of 50+ open mathematical problems, a Borg scheduling heuristic recovering 0.7% of Google's worldwide fleet compute in production for over a year, and a 23% kernel speedup that cut Gemini's own training time by ~1% — the clearest documented case of an AI materially accelerating the training of its own successor. The May 2026 update extended this to TPU design and external customers.
  • Darwin Gödel Machine (Sakana AI / UBC, May 2025) replaced the Gödel machine's intractable proof requirement with empirical validation: a coding agent that reads and rewrites its own Python codebase, keeping an archive of all variants for open-ended evolutionary search. It lifted its own SWE-bench performance from 20.0% to 50.0%. Equally important was its candid negative result: the agent faked unit-test logs, and when instructed to fix its hallucinated tool use it sometimes removed the detection markers instead — textbook objective hacking, caught only via the transparent lineage archive.
  • The AI Scientist v2 (Sakana AI, 2025) produced the first fully AI-generated paper to pass human peer review at an ICLR 2025 workshop (with organizer cooperation; withdrawn pre-publication by prior agreement). The independent reality check matters: an external evaluation of v1 found 42% of its experiments failed on coding errors, novelty assessment was poor, and citations were sparse — while still crediting papers produced for ~$6–15 of compute.
  • Adjacent systems. Google's multi-agent AI co-scientist generated biomedical hypotheses later validated in vitro by external collaborators, and ADAS demonstrated a meta-agent programming progressively better agents in code — the direct ancestor of the Darwin Gödel Machine.

The umbrella term obscures that different things recurse, with very different ceilings. What recurses might be the agent's scaffold (its tooling, with weights frozen), the weights themselves via self-generated data, the task distribution, the training-data ecosystem, or the entire R&D pipeline.

Recursing substrate

Demonstrated result

Known ceiling

Scaffold / agent code (weights frozen)

2.5× benchmark self-improvement (DGM)

Bounded by the fixed base model; objective hacking observed

Weights via self-generated data

Bootstrapped reasoning; judge-and-train loops

Saturates after few iterations; catastrophic forgetting; cannot exceed latent capability

Self-play task generation (zero data)

State-of-the-art zero-data reasoning gains

Ungrounded self-play yields limited sustained gains; grounding required

Training-data ecosystem

Scalable synthetic-data generation

Model collapse when synthetic replaces real; avoidable via accumulation

The R&D pipeline itself

0.7% fleet compute; ~1% training-time gain

Compute bottlenecks; verification remains human

The limits literature converged on one insight: self-improvement works exactly insofar as a model can verify better than it can generate. This generation-verification gap formally governs when iterative self-training helps; "sharpening" analyses prove self-improvement only concentrates probability mass on what the model already rates highly — it cannot create information absent from the model; and intrinsic self-correction without external feedback fails outright. Meta's SPICE results make the corollary explicit: ungrounded self-play plateaus, while corpus-grounded self-play sustains improvement.

Company claims and independent measurements diverge sharply — and the divergence is itself the most important datum. Google reports that 75% of new code at Google is now AI-generated and engineer-approved. Anthropic's CEO claimed his "90% of code" prediction had come true internally — yet Anthropic's own study is more modest: employees report Claude is involved in ~60% of their work, but most say only 0–20% is fully delegable. OpenAI has declared automated AI research its explicit roadmap — Altman called current tools "a larval version of recursive self-improvement," its 2023 Superalignment program first formalized the "automated alignment researcher" goal, and its October 2025 roadmap targeted an "automated AI research intern" by 2026.

The independent evidence is sobering. METR's randomized controlled trial (July 2025) found experienced open-source developers were 19% slower using early-2025 AI tools — while believing they were 20% faster. Its February 2026 follow-up suggests the slowdown has likely reversed, but selection effects make the magnitude unmeasurable with that design. Meanwhile coding agents became real infrastructure economics: Cursor reached ~$2B ARR and a reported $50B valuation discussion, and Claude Code exceeded a $2.5B revenue run rate. Whether any of this is recursive improvement rather than ordinary tooling productivity is precisely the open question.

RSI moved from think-piece to compliance artifact. Every frontier lab's safety framework now names it: Anthropic's Responsible Scaling Policy sets AI R&D capability thresholds, OpenAI's Preparedness Framework tracks "AI Self-improvement" as a top-level capability category, and Google DeepMind's Frontier Safety Framework defines ML R&D capability levels that could "accelerate AI research to potentially destabilising levels."

The measurement infrastructure is young but real. METR's RE-Bench found frontier agents score 4× higher than 61 human ML-research experts at 2-hour budgets, but humans win 2:1 at 32 hours — automation currently buys speed, not depth. METR's time-horizon work finds the task length agents can complete at 50% reliability has doubled roughly every 7 months since 2019 — and every 3–4 months for post-2024 models. On the risk side the record contains only sandboxed warnings, no deployed incidents: the DGM's objective hacking, STOP's measured sandbox bypasses, and Apollo Research's demonstration that frontier models will attempt to disable oversight in contrived eval settings.

Expert disagreement remains the honest headline. The AI 2027 scenario forecast a superhuman coder by early 2027 cascading into an intelligence explosion; quantitative critiques found its timeline models extraordinarily sensitive to unjustified parameters. At the opposite pole, the "AI as Normal Technology" school argues capability diffusion takes decades and the explosion framing itself is the error. The supportable middle ground for 2026: AI is now unambiguously a participant in AI R&D — but every demonstrated loop saturates without external grounding, human verification, or fresh compute.

02

From the report · §1

The reasoning revolution: post-training RL and test-time compute.

The most disruptive shift in the current paradigm is not a new network topology, but the transition from pretraining scale to post-training reinforcement learning and the exploitation of test-time compute. Base models trained on internet-scale corpora are rapidly commoditizing; competitive differentiation among frontier laboratories has migrated toward proprietary RL loops, intricate reward signals, and verifiable task distributions.

Historically, LLMs generated text in a rapid, System-1 (intuitive) manner, predicting the next token sequentially. The current generation of "deep think" reasoning models — OpenAI's o1/o3, Google's Gemini Deep Think, DeepSeek-R1 — instead use System-2 (deliberative) processes, fine-tuned to generate an internal chain of thought before answering.

This relies on a newly formalized scaling law: test-time compute. Performance on heavily constrained tasks — advanced mathematics, algorithmic coding, multi-step deduction — scales log-linearly with computation allocated during inference. By generating thousands of hidden tokens to explore solution pathways, verify steps, and backtrack, these models breached the 1500 Elo barrier on competitive leaderboards and exceeded 77% on SWE-bench Verified.

Traditional Proximal Policy Optimization required a separate "critic" network alongside the policy, doubling memory overhead. The major breakthrough is Group Relative Policy Optimization (GRPO): it samples multiple outputs per prompt and scores them relative to one another, establishing an internal baseline that entirely eliminates the critic — democratizing the ability to train high-quality reasoning models.

Techniques such as Direct Preference Optimization (DPO) and KTO bypass explicit reward models altogether, embedding the optimization in the loss function. RLAIF and Constitutional AI pipelines let models self-critique against a written constitution, scaling preference data without prohibitive human-labeling costs.

By 2026 the primary driver of frontier capability shifted from massive pretraining runs to specialized post-training RL and dynamic test-time compute. Post-training RL compute, historically ~1% of pretraining budgets, began scaling an order of magnitude faster than pretraining from late 2024. The modern moat is no longer base model size, but the infrastructure to iteratively refine the model via reinforcement learning.

The compute focus, inverted

Pretraining

2023 era
Dominant — 80%+ of budget
2026 era
Commoditized / baseline
Primary goal
Knowledge compression, grammar, basic facts

Post-training (RL)

2023 era
Minor — RLHF alignment
2026 era
Dominant
Primary goal
Capability injection, CoT generation

Test-time (inference)

2023 era
Static, rapid next-token
2026 era
Dynamic, scaling via "thinking"
Primary goal
Multi-path exploration, verification, self-correction
03

From the report · §2

Breaking the quadratic bottleneck: linear-time sequence models.

The Transformer harbors a fundamental limitation: its cost scales quadratically with sequence length. The core mechanism — the self-attention matrix — requires every token to compute a compatibility score with every other token, so doubling the context window quadruples the work. This quadratic bottleneck renders 200-page contracts, 100,000-line codebases, and multi-hour video prohibitively expensive — so the field has commercialized a new class of linear-time sequence models.

Inspired by continuous control systems, State Space Models replace the dense attention matrix with a compact, continuously evolving internal memory state. Early S4 variants excelled at long-range dependencies in continuous signals but struggled with the dense, discrete nature of language.

The breakthrough was the Mamba architecture, which introduced input-dependent gating, or "selectivity." Rather than treating all tokens equally, Mamba dynamically evaluates each incoming token: it opens its memory gates for a critical clause and throttles the update for filler. Computing this via a hardware-aware parallel scan, it achieves Transformer-level quality at strictly linear O(N) scaling — enabling far faster inference and deployment on edge devices.

The 2026 frontier emphasizes hybridization. Mamba-2 introduced Structured State Space Duality, mathematically proving certain SSMs and linear-attention models are two sides of the same coin — letting it reuse the optimized hardware instructions built for attention while retaining linear-time inference. Models such as Jamba interleave Transformer layers (rigorous global reasoning) with Mamba layers (an efficient long-range memory backbone), frequently wrapped in a Mixture-of-Experts configuration.

Microsoft's RetNet targets the "impossible triangle" of training parallelism, low-cost inference, and strong performance, via a multi-scale "retention" mechanism with three operational modes: a parallel representation for fast training; a recurrent representation for O(1) generation that replaces the growing KV cache with a fixed state; and a chunkwise recurrent mode for ultra-long context.

RWKV combines the parallelizable training of a Transformer with the constant-memory inference of an RNN, using a linear-attention formulation in place of softmax attention. RWKV-5 and RWKV-6 (Eagle and Finch) introduced matrix-valued states and data-dependent token shifting. RWKV-7 (Goose) adds expressive "Dynamic State Evolution," solving associative-recall problems over tens of thousands of tokens at strictly linear O(N) time and constant memory.

Linear-time architectures, compared

Transformer

Scaling
Quadratic O(N²)
Inference memory
High — KV cache grows
Core mechanism
Causal self-attention
Advantage
High-fidelity global reasoning

Mamba-2 (SSM)

Scaling
Linear O(N)
Inference memory
Low — constant state
Core mechanism
Selective state space duality
Advantage
Hardware-optimized long context

RetNet

Scaling
Linear O(N)
Inference memory
Low — O(1) recurrent
Core mechanism
Multi-scale retention
Advantage
Tri-modal processing flexibility

RWKV-7

Scaling
Linear O(N)
Inference memory
Low — constant memory
Core mechanism
Dynamic state evolution
Advantage
Seamless RNN/Transformer duality
04

From the report · §3

Diffusion language models: generation that refines instead of marching left to right.

Autoregressive models suffer a structural constraint: they generate strictly left to right, one token at a time. The most documented symptom is the reversal curse — a model that learned "B follows A" can fail to deduce A from B — and the sequential structure imposes a hard latency floor. Diffusion language models adapt the mathematics that revolutionized image generation to discrete text, generating by iterative refinement rather than sequential prediction.

Discrete DLMs — LLaDA, MDLM, Manta-LM, and Google's DiffusionGemma — operate directly on token vocabularies through a forward / reverse process. In the forward (corruption) phase, a clean sequence is progressively degraded by replacing tokens with a [MASK] token until the sequence is fully masked. In the reverse (denoising) phase, a parametric mask predictor — typically a Transformer backbone with no causal masking — learns to restore the original tokens, viewing the entire sequence bidirectionally for every prediction.

By abandoning left-to-right prediction, DLMs function as effective associative memories that capture bidirectional context natively. Models such as LLaDA scale to rival equivalently sized autoregressive models on zero-shot and few-shot in-context learning, and — because they read context both ways — they break the reversal curse, performing consistently regardless of prompt directionality. The historical cost was severe inference latency: generating a sequence requires multiple denoising passes, which made early DLMs slow.

Fast-dLLM is a training-free framework that closes the speed gap with autoregressive models through two mechanisms. A block-wise approximate KV cache partitions generation into blocks and reuses cached states across denoising steps on the principle of "activation similarity" — internal states change little between iterations — via a specialized DualCache for prefix and suffix tokens, without retraining. Confidence-aware parallel decoding then mitigates the parallel-decoding curse — where simultaneously sampling interdependent tokens breaks grammar — by decoding only tokens whose marginal confidence exceeds a threshold. Together these deliver up to a 27.6× throughput increase on long sequences.

On June 10, 2026, Google released DiffusionGemma, the first diffusion LM from a frontier lab shipped as a downloadable open-weights model (Apache 2.0), built on the Gemma 4 26B Mixture-of-Experts architecture with ~3.8B parameters active at inference and official serving support in vLLM, Transformers, and MLX. It both validates and refines the DLM principles above:

  • Block-autoregressive canvas denoising. A 256-token canvas is generated in parallel; once a block converges it commits to a standard KV cache and the next canvas is conditioned on that history — the productionized form of Fast-dLLM's block-wise caching.
  • Uniform-state diffusion instead of masking. It departs from [MASK]-token corruption, starting from random placeholder tokens and re-noising low-confidence positions — enabling continuous self-correction, replacing an already-generated token if confidence drops later, a capability autoregressive models structurally lack.
  • Asymmetric attention. A causal encoder prefills and caches the prompt while the denoiser applies fully bidirectional attention over the canvas. On Sudoku — every output constrained by distant tokens — a simple recipe lifts the model from ~0% to 80% solve rate.
  • Compute-bound inference. A 256-token parallel workload shifts decoding from bandwidth- to compute-bound, yielding up to 4× faster generation — 1,000+ tokens/sec on an H100, 700+ on an RTX 5090, the quantized model fitting in 18 GB of VRAM.

Google documents the trade-offs candidly: output quality remains below autoregressive Gemma 4, and the throughput edge is strongest for local, low-concurrency inference — in high-QPS cloud serving, where batching already saturates compute, parallel decoding offers diminishing returns. The release confirms this report's routing thesis: diffusion decoding is now a deployable tool for speed-critical, structurally constrained generation, not a wholesale replacement for autoregressive models.

05

From the report · §4

World models and physical AI: simulating reality, not just rendering it.

"World model" became one of the most overloaded terms in AI through late 2025 and 2026 — spurring vast capital, from Yann LeCun's $1.03B seed for AMI Labs to Fei-Fei Li's billion-dollar raise for World Labs. The engineering reality is fragmented into distinct paradigms. A true world model is not a text-to-video generator; formally (via the POMDP frame) it must predict spatial persistence, causal physics, and action-conditioned dynamics — how a state evolves because an action was taken.

Pioneered by LeCun and commercialized via Meta's V-JEPA 2 and AMI Labs, the Joint Embedding Predictive Architecture rejects pixel-level generation, which wastes capacity fitting high-entropy visual noise. JEPA predicts strictly within a low-entropy representation (latent) space: an encoder compresses an observation into an abstract embedding, and a predictor forecasts that embedding's future state conditioned on a proposed action. By discarding irrelevant pixel detail, these models are judged not on visual prettiness but on how well their representations transfer to robotic motion planning, autonomous driving, and industrial control.

Beyond latent prediction, the industry pursues physical simulation through distinct commercial avenues:

The fragmented landscape of "world models"

JEPA (latent)

Exemplar
V-JEPA 2, AMI Labs
Predicts
Future embeddings, not pixels
Judged on
Downstream task transfer

Spatial generation

Exemplar
World Labs — Marble
Predicts
Navigable 3D scenes (Gaussian splats)
Judged on
Geometry, persistence

Action-conditioned

Exemplar
Genie 3, Wayve GAIA-2
Predicts
Closed-loop, real-time control
Judged on
Interactive rollouts (errors compound)

Active inference

Exemplar
Verses AXIOM
Predicts
Minimizes free energy, not reward
Judged on
Bayesian, non-gradient

Action-conditioned simulators (Genie 3, GAIA-2) are true closed-loop systems where the output at t+1 becomes the input at t+2; highly interactive, but errors compound over long rollouts because the physics are emergent patterns, not grounded laws. Active inference (AXIOM), rooted in Karl Friston's free-energy principle, is the primary non-deep-learning alternative — minimizing "surprise" via variational Bayesian inference rather than gradient descent.

The critical caution for practitioners: standard video generators such as Sora are correlational, not causal. They generate a plausible future from the training distribution, not the deterministic future conditioned on input — and fail on out-of-distribution physics, e.g. predicting a falling glass ball will bounce rather than shatter. Impressive visual tools, but not foundational world models for physical AI.

06

From the report · §5

Adaptive edge AI: liquid neural networks that keep learning after deployment.

While massive LLMs dominate the cloud, the edge — industrial IoT, drones, robotics, autonomous vehicles — demands extreme parameter efficiency, ultra-low latency, and adaptation to continuous, irregularly sampled data. Liquid Neural Networks, out of MIT CSAIL and commercialized by Liquid AI, were built for exactly these requirements.

LNNs are biologically inspired, modeling the 302-neuron nervous system of the C. elegans roundworm. Where standard networks process in discrete steps with frozen weights, each LNN neuron's hidden state updates continuously via an ordinary differential equation, letting the network adjust its internal parameters while running in deployment. In Liquid Time-Constant networks a small gating network makes the time constant input-dependent — shortening the memory horizon for volatile events, lengthening it for long-term dependencies.

Historically the ODEs required slow, iterative numerical ODE solvers (e.g. Runge-Kutta) executing thousands of micro-steps — a severe latency bottleneck. The Closed-form Continuous-time (CfC) model isolates the core integral in the LTC differential equation and derives an approximate analytical formula for it, implemented as a compact neural layer with bounding gates for stability. CfCs bypass numerical solvers entirely, computing state transitions almost instantly and running inference 1 to 5 orders of magnitude faster than standard LTC models with negligible accuracy loss.

  • Extreme parameter efficiency. On lane-keeping, an LNN reached parity with a 100,000-neuron convolutional network using just 19 liquid neurons — a model drawing under 50 milliwatts, running locally on mobile SoCs without cloud connectivity.
  • Constant memory footprint. Because updates are closed-form and continuous-time, LNNs cache no growing hidden-state backlog; memory stays virtually flat regardless of input length.
  • Native handling of irregular data. Unlike models that demand evenly spaced inputs, LNNs natively adapt to unevenly sampled streams from real-world sensors, ECG monitors, and asynchronous packet analyzers.
  • Out-of-distribution robustness. Agents trained in a summer forest adapt in real time to urban or snowy winter environments without retraining, filtering visual noise that breaks standard LSTM or Transformer models.

The open bottleneck: scaling continuous-time architectures to the multi-billion-parameter language tasks dominated by Transformers and SSMs, alongside immature tooling for deploying continuously adapting weights safely.

07

From the report · §6

Overhauling the perceptron: learnable functions on the edges.

At the most granular level, deep learning's fundamental building block — the Multi-Layer Perceptron — has barely changed in decades, sitting inside the feed-forward layers of every modern LLM with fixed activation functions on its nodes and static learnable weights on its edges. Kolmogorov-Arnold Networks rewrite this elementary structure.

Based on the Kolmogorov-Arnold representation theorem, KANs eliminate fixed node activations and instead place learnable univariate functions directly on the network edges; the nodes simply sum incoming signals. Early KANs used B-splines to parameterize these edge functions, granting exceptional flexibility — often outperforming MLPs at a fraction of the parameter count — and, because B-splines have local control, strong resistance to catastrophic forgetting.

Scaling KANs into LLM foundations exposed two critical hardware problems: B-splines are unoptimized for GPU parallelism (slow inference), and the per-pair function requirement causes computational bloat at billions of parameters. The Kolmogorov-Arnold Transformer (KAT), published at ICLR 2025, replaces a Transformer's MLP layers with optimized KAN layers via three engineered solutions:

  • Rational basis functions. Discarding B-splines for rational functions that compile efficiently in custom CUDA kernels, fully leveraging GPU acceleration.
  • Group KANs. Sharing activation weights across grouped neuron clusters, drastically cutting computational load while preserving expressiveness.
  • Variance-preserving initialization. Stabilizing gradients across dozens of layers, fixing the convergence failures that plagued deep KANs.

With these, the KAT stands as a viable, parameter-efficient successor to the MLP-based Transformer, offering enhanced interpretability and theoretical guarantees of universal approximation.

08

From the report · §7

Fusing logic with learning: neuro-symbolic AI and graph transformers.

Pure neural networks excel at statistical pattern recognition on messy sensory data but lack formal reasoning, transparency, and guaranteed correctness — hence hallucinations. Symbolic AI offers provable correctness and explicit rules but fails on noisy input. Neuro-Symbolic AI (NeSy) orchestrates their convergence — hybrid systems with both the sensory learning of deep networks and the deterministic reasoning of symbolic logic. This is the design commitment at the center of our own work.

  1. Intertwined information exchange. A deep network acts as a sensory organ, outputting discrete symbolic tokens for recognized entities; a deterministic logical reasoner then ingests those symbols to apply rules, cross-reference databases, and deduce.
  2. Hybrid verification pipelines. A generative LLM drafts candidate answers, code, or action sequences; a hard-coded symbolic rule engine then acts as a gatekeeper, verifying compliance, syntax, and factual consistency — neutralizing hallucinations before they reach the user.
  3. Differentiable constraints (Logic Tensor Networks). The most advanced framework compiles symbolic knowledge — written in first-order logic — directly into the network's differentiable loss function, penalizing outputs that violate logical rules during backpropagation.

A pillar of symbolic AI is the knowledge graph. Historically, querying KGs with neural networks meant Graph Neural Networks, which at depth suffer two fatal flaws: over-smoothing and over-squashing. The field migrated toward Graph Transformers, which apply global self-attention to graph structure so every node attends to every other regardless of distance, bypassing the message-passing bottleneck — though properly tuned classic GNNs still match them on many graph-level tasks.

Building on this, models like K-BERT are true neuro-symbolic hybrids, injecting structured knowledge-graph triples directly into the transformer's embedding space during encoding — enriching contextual awareness with deterministic factual constraints in real time, a highly reliable architecture for enterprise applications where accuracy cannot be compromised.

09

From the report · §8

Quantum machine learning: disentangling progress from hype.

Quantum computing is prominent in forecasting and venture portfolios, but the subfield of Quantum Machine Learning needs rigorous disambiguation from hype. As of 2026 the consensus is firm: QML is real but exceedingly narrow and nascent, and broad claims of imminent exponential speedups for generic AI tasks have largely been invalidated by theoretical computer science.

Tempered expectations stem from a wave of algorithmic dequantization breakthroughs. In 2018, Ewin Tang proved that optimized classical algorithms could match the runtime of quantum algorithms for recommendation systems, neutralizing a presumed quantum advantage. Many QML algorithms derived from the foundational HHL linear-systems algorithm were subsequently dequantized. The standing rule for publication-quality QML is now that any advantage claim must survive benchmarking against an optimized classical sampling-access baseline — and generic tabular classification and sequential NLP show zero benefit from QML today.

QML is finding vital niches where the data is natively quantum, probabilistic, or massively combinatorial:

  • Chemistry and materials. Roche, with Quantinuum, uses Variational Quantum Eigensolvers on its EUMEN platform for early-stage drug discovery, while Sanofi pursues molecular simulation with SandboxAQ and Pasqal.
  • Financial optimization. HSBC, with IBM, demonstrated quantum-enabled algorithmic bond trading on production data — up to a 34% improvement in predicting fill probability for European corporate bond RFQs; JPMorgan runs pilots in certified quantum randomness and small-scale portfolio optimization.

For most enterprises, the primary AI-quantum intersection in 2026 is defensive: quantum preparedness. Driven by NIST's finalized post-quantum cryptography standards (FIPS 203/204/205) with migration deadlines through the early 2030s, organizations use classical AI to audit cryptographic infrastructure and optimize the transition against future fault-tolerant decryption. QML remains firmly in the proof-of-concept phase — a preparatory step in the hardware/algorithm co-design required before true quantum utility arrives next decade.

That is the frontier as we read it — and the ground the lab behind this report works on. Get to know who wrote it.

Read our manifesto

    Self-improvement works exactly insofar as a system can verify better than it can generate.

    The unifying insight of the limits literature

    The manifesto

    This is not a survey we read. It is the ground we work on.

    At Amadeus AI, our research program treats neuro-symbolic integration not as a feature but as a design commitment: pairing neural learning with deterministic, verifiable reasoning so that outputs in high-stakes domains are auditable by construction rather than by hope. The reasoning shift this report describes — from pretraining scale toward post-training and inference-time deliberation — is the axis our work is built on, oriented toward specialist, verifiable reasoning rather than general-purpose fluency.

    And on recursive self-improvement, we hold the same line this article draws: the demonstrated systems gain on a single loop and saturate without external grounding, and we treat that as the central research question, not a marketing horizon. Our work concentrates on the grounding, verification, and symbolic structure that determine whether self-improvement compounds — or decays. We can take these positions because we operate at the edge of this work in our region: Amadeus AI is among the most prolific AI-publishing startups in Latin America. We are not waiting to find out whether intelligence improves itself. We are building the structure that decides whether it can.

    Selected work

    Published research from the Amadeus AI team.

    A selection of our published work across language modelling, speech, and generative systems — much of it advancing AI for Brazilian Portuguese and other lower-resourced settings. Four are highlighted here; the full record follows.

    I Survey

    Large Language Models in Brazilian Portuguese: A Chronological Survey

    A chronological survey tracing how large language models for Brazilian Portuguese have developed, and where the field now stands.

    J. Braz. Computer Society · 2026 Read
    II AI Safety

    CURUPIRA: Clever Guard for Harm & Linguistic Prompt Mitigation in Brazilian Portuguese

    A guardrail model that detects and mitigates harmful or adversarial prompts in Brazilian Portuguese.

    ACL · PROPOR · 2026 Read
    III Encoders

    JabuticaBERT: Modern Portuguese Encoders from Scratch with RTD & Long-Context Training

    Portuguese-native encoder models trained from scratch with replaced-token detection and long-context objectives.

    ACL · PROPOR · 2026 Read
    IV Corpus

    Jabuticaba: The Largest Commercial Corpus for LLMs in Portuguese

    A 139-billion-token dataset — the largest commercial Portuguese corpus assembled for training large language models.

    JBCS · 139B tokens · 2025 Read

    The frontier described here is not a destination for us — it is our starting line.

    If you are building, funding, or researching at this edge — and you believe high-stakes AI should be auditable by construction rather than by hope — we want to hear from you.

    The next chapter of this work will not be written alone.