Skip to content
Large Language Models
Large Language Models6 min read16 views

Beyond Transformers: Mamba, RWKV, and State-Space Models Challenging the Dominant Architecture

Technical comparison of emerging transformer alternatives including Mamba's selective state spaces, RWKV's linear attention, and hybrid architectures that combine the best of both worlds.

The Transformer Bottleneck

Transformers have dominated language modeling since 2017, but their quadratic attention mechanism creates a fundamental scaling problem. Processing a sequence of length N requires O(N^2) computation and memory for the self-attention step. This means doubling the context length quadruples the cost. At 128K+ token context windows, this cost becomes prohibitive for many applications.

Several alternative architectures are emerging that achieve linear or near-linear scaling with sequence length while approaching transformer-quality performance.

Mamba and Selective State Spaces

Mamba, introduced by Albert Gu and Tri Dao in December 2023, is the most prominent transformer alternative. It builds on Structured State Space Models (S4) with a critical innovation: selective state spaces that allow the model to dynamically filter information based on input.

flowchart TD
    START["Beyond Transformers: Mamba, RWKV, and State-Space…"] --> A
    A["The Transformer Bottleneck"]
    A --> B
    B["Mamba and Selective State Spaces"]
    B --> C
    C["RWKV: Linear Attention for Language"]
    C --> D
    D["Hybrid Architectures"]
    D --> E
    E["Where Non-Transformer Models Struggle"]
    E --> F
    F["Practical Implications"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

How Mamba Works

Traditional state space models process sequences through a fixed linear recurrence:

h_t = A * h_{t-1} + B * x_t    (state update)
y_t = C * h_t                    (output)

Where A, B, and C are fixed matrices. Mamba makes B, C, and the discretization step size input-dependent, allowing the model to selectively retain or forget information based on the current token.

Performance Characteristics

  • Linear time complexity: O(N) instead of O(N^2), enabling efficient processing of very long sequences
  • No KV cache: Mamba uses a fixed-size state instead of a growing KV cache, making inference memory constant regardless of sequence length
  • Hardware-efficient: The selective scan operation is implemented as a custom CUDA kernel that achieves high GPU utilization

Mamba-2 and Improvements

Mamba-2, released in mid-2024, reformulated the selective state space as a form of structured matrix computation, connecting it theoretically to attention. This enabled:

  • 2-8x faster training than the original Mamba
  • Better parallelization across GPUs during training
  • Clearer theoretical understanding of what the model learns

RWKV: Linear Attention for Language

RWKV (pronounced "RwaKuv") combines the parallelizable training of transformers with the efficient inference of RNNs. It achieves this through a linear attention mechanism that avoids the softmax operation responsible for transformers' quadratic cost.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

flowchart TD
    ROOT["Beyond Transformers: Mamba, RWKV, and State-…"] 
    ROOT --> P0["Mamba and Selective State Spaces"]
    P0 --> P0C0["How Mamba Works"]
    P0 --> P0C1["Performance Characteristics"]
    P0 --> P0C2["Mamba-2 and Improvements"]
    ROOT --> P1["RWKV: Linear Attention for Language"]
    P1 --> P1C0["Architecture"]
    P1 --> P1C1["RWKV v6 Eagle/Finch"]
    ROOT --> P2["Hybrid Architectures"]
    P2 --> P2C0["Jamba AI21"]
    P2 --> P2C1["NVIDIA39s Hybrid Approach"]
    style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
    style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P2 fill:#e0e7ff,stroke:#6366f1,color:#1e293b

Architecture

RWKV uses two key mechanisms:

  • Time mixing: A linear interpolation between the current input and previous states, weighted by learned decay factors
  • Channel mixing: A feed-forward layer similar to transformers but applied with recurrent state

During training, RWKV processes all tokens in parallel (like a transformer). During inference, it operates as an RNN, processing one token at a time with constant memory and compute.

RWKV v6 (Eagle/Finch)

The latest RWKV versions introduce data-dependent linear recurrence, similar to Mamba's selective mechanism:

  • Eagle (v6): Improved training dynamics with dynamic recurrence
  • Finch (v6): Multilingual variant with expanded vocabulary and training data
  • Models available up to 14B parameters with competitive performance against similarly-sized transformers

Hybrid Architectures

The most practical approach emerging in 2025-2026 is hybrid architectures that combine transformer attention layers with linear-complexity layers.

Jamba (AI21)

Jamba interleaves Mamba layers with transformer attention layers and adds mixture-of-experts (MoE) for parameter efficiency. The result:

  • 256K token context window with manageable memory
  • Attention layers handle tasks requiring precise token-level recall
  • Mamba layers handle long-range dependencies efficiently
  • MoE keeps active parameter count reasonable

NVIDIA's Hybrid Approach

NVIDIA has explored architectures that use Mamba for the majority of layers with strategically placed attention layers for tasks requiring exact retrieval (like copying specific strings from the context). This gives near-linear scaling for most of the model while preserving the capabilities that pure state-space models struggle with.

Where Non-Transformer Models Struggle

Despite their efficiency advantages, transformer alternatives have consistent weaknesses:

flowchart TD
    CENTER(("LLM Pipeline"))
    CENTER --> N0["2-8x faster training than the original …"]
    CENTER --> N1["Better parallelization across GPUs duri…"]
    CENTER --> N2["Clearer theoretical understanding of wh…"]
    CENTER --> N3["Channel mixing: A feed-forward layer si…"]
    CENTER --> N4["Eagle v6: Improved training dynamics wi…"]
    CENTER --> N5["Finch v6: Multilingual variant with exp…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
  • In-context learning: Transformers excel at learning new patterns from examples provided in the prompt. SSMs are weaker at this, likely because attention's O(N^2) comparison mechanism is genuinely useful for matching patterns across the context.
  • Exact recall: Tasks like "What was the third word in the second paragraph?" require precise attention to specific positions. Linear models tend to blur positional information.
  • Established ecosystem: The transformer ecosystem (optimization tools, deployment frameworks, fine-tuning methods) is vastly more mature.

Practical Implications

For most application developers, the architecture underlying the LLM is transparent — you call an API and get text back. Architecture matters when:

  • Self-hosting long-context models: Linear models require dramatically less memory for long sequences
  • Edge deployment: Mamba's constant-memory inference fits devices with limited RAM
  • Streaming applications: RNN-style inference (one token at a time, constant compute) suits real-time applications
  • Cost optimization: Linear scaling means 10x longer contexts cost 10x more, not 100x more

The future likely involves hybrid architectures that combine attention where it matters most with linear layers for efficiency. Pure transformer dominance is ending, but transformers are not going away.

Sources: Mamba Paper - arXiv:2312.00752 | RWKV Project | Jamba Architecture - AI21 Labs

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

7 ML Fundamentals Questions That Top AI Companies Still Ask in 2026

Real machine learning fundamentals interview questions from OpenAI, Google DeepMind, Meta, and xAI in 2026. Covers attention mechanisms, KV cache, distributed training, MoE, speculative decoding, and emerging architectures.

Learn Agentic AI

The Future of Agentic AI: AGI Stepping Stones, Agent-Native Applications, and the Path Forward

Explore where agentic AI is headed — from current capabilities and near-term trajectory to agent-native application design, autonomous skill acquisition, and the architectural patterns that will define the next generation of AI systems.

Learn Agentic AI

Reflection Agents: Building AI Systems That Critique and Improve Their Own Output

Learn how to build reflection agents that evaluate their own outputs, assign quality scores, and iteratively refine results through multi-round self-improvement loops using the Reflexion pattern.

Learn Agentic AI

Fine-Tuning with Hugging Face Transformers and PEFT: Complete Tutorial

A hands-on tutorial for fine-tuning open-source LLMs using Hugging Face Transformers, PEFT, and TRL libraries, covering setup, training configuration, evaluation, and pushing to the Hugging Face Hub.

Learn Agentic AI

What Is RAG: Retrieval-Augmented Generation Explained from Scratch

Understand what Retrieval-Augmented Generation is, why it exists, how the core architecture works, and when to choose RAG over fine-tuning for grounding LLM responses in your own data.

Learn Agentic AI

What Is a Large Language Model: From Neural Networks to GPT

Understand what large language models are, how they evolved from simple neural networks to GPT-scale transformers, and why they can generate human-quality text.