ai-benchmark/tests/summarization/venturebeat.com_orchestration_mits-new-recursive-framework-lets-llms-process-10-million-tokens-without.txt

Featured MIT’s new ‘recursive’ framework lets LLMs process 10 million tokens without context rot Ben Dickson January 20, 2026 Image credit: VentureBeat with ChatGPT Recursive language models (RLMs) are an inference technique developed by researchers at MIT CSAIL that treat long prompts as an external environment to the model. Instead of forcing the entire prompt into the model's context window, the framework allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the text. Rather than expanding context windows or summarizing old information, the MIT team reframes long-context reasoning as a systems problem. By letting models treat prompts as something they can inspect with code, recursive language models allow LLMs to reason over millions of tokens without retraining. This offers enterprises a practical path to long-horizon tasks like codebase analysis, legal review, and multi-step reasoning that routinely break today’s models. Because the framework is designed as a wrapper around existing models, it can serve as a drop-in replacement for applications that make direct calls to LLMs. 0:03 / 14:09 Keep Watching The LLM context problem While frontier models are becoming increasingly sophisticated at reasoning, their ability to process massive amounts of information is not scaling at the same rate. This bottleneck is driven by two distinct limitations: the hard physical constraint on how much text a model can process at once (context length) and "context rot." The challenge, the researchers argue, is whether it’s possible to scale the effective context size of general-purpose LLMs by orders of magnitude without retraining them. This capability is becoming increasingly important for enterprise applications, where LLMs are adopted for long-horizon tasks requiring the processing of millions of tokens — a challenge Zhang argues can’t be solved by simply expanding context windows. "There is an entropy argument that implies you need exponentially more data samples as you increase the effective context window size," Alex Zhang, a co-author of the paper, told VentureBeat. Current approaches to extending context often rely on compaction, where the model summarizes older parts of the conversation to free up space. However, this method fails for tasks requiring random access to specific details located in earlier parts of the prompt. How RLMs work The concept behind RLMs is drawn from "out-of-core" algorithms used in classical computing. These algorithms are designed to process datasets too large to fit into a computer's main memory by keeping the data on a hard drive and fetching only the necessary chunks as needed. RLMs apply this logic to generative AI. Instead of feeding a long prompt directly into the neural network, the framework loads the text as a string variable inside a Python coding environment. The LLM is given general context about the data (such as the total character count) but does not "see" the text initially. Once the prompt is stored as a variable, the LLM acts as a programmer. It writes Python code to interact with the external variable, using standard commands to peek into the data. For example, the model might use regular expressions to search for specific keywords like "Chapter 1" or "financial results." When the code execution finds a relevant snippet, the RLM pulls only that specific chunk into its active context window for analysis. For example, if the prompt is a massive book, the LLM might write a loop that identifies chapter boundaries and then triggers a sub-call to summarize each chapter individually. RLM architecture (source: arXiv) The architecture typically involves two agents. A "root language model," often a capability-heavy model like GPT-5, acts as the orchestrator. It plans the approach, writes the code, and manages the data flow within the REPL environment. A "recursive language model," often a faster and cheaper model, acts as the worker. The root LM calls this worker to process the specific text snippets isolated by the code. Because the prompt resides in the environment's memory rather than the model's context window, the system can handle inputs far larger than the model's training limit. Importantly, to the end-user, the RLM behaves exactly like a standard model: It accepts a string and returns an answer. This allows enterprise teams to swap standard API calls for RLMs. For developers looking to experiment, the RLM code is currently available on GitHub. "A key argument for RLMs is that most complex tasks can be decomposed into smaller, 'local' sub-tasks," Zhang said. "However, how to perform this context/problem decomposition is non-trivial, and the model must be capable of performing this." RLMs in action To validate the framework, the researchers tested RLMs against base models and other agentic approaches like CodeAct and summary agents across a variety of long-context tasks, including retrieval and multi-hop question answering. The results demonstrated strong performance gains at the 10 million+ token scale. On BrowseComp-Plus, a benchmark involving inputs of 6 to 11 million tokens, standard base models failed completely, scoring 0%. In contrast, the RLM powered by GPT-5 achieved a score of 91.33%, significantly outperforming the Summary Agent (70.47%) and CodeAct (51%). The framework also excelled at tasks with high computational complexity. On OOLONG-Pairs, an information-dense reasoning benchmark where the difficulty scales quadratically with input length, base GPT-5 models failed catastrophically with a score of just 0.04%. The RLM achieved an F1 score (a balanced measure of precision and recall) of 58%, demonstrating emergent capabilities to handle dense tasks that paralyze standard models. Similarly, on code understanding tasks (CodeQA benchmark), the RLM more than doubled the performance of the base GPT-5 model, jumping from 24% to 62%. RLM maintains its performance even after it hits the context window limit of the underlying model (source: arXiv) Regarding the context rot problem, the data showed that while the base GPT-5 performance degrades rapidly as task complexity increases, RLM performance holds steady, consistently outperforming the base model on contexts longer than 16,000 tokens. Despite the increased complexity of the workflow, RLMs often maintained comparable or lower average costs than the baselines. On the BrowseComp-Plus benchmark, the RLM was up to three times cheaper than the summarization baseline. However, the researchers noted that while median costs are low, RLM trajectories are "long-tailed." Outlier runs can become expensive if the model gets stuck in loops or performs redundant verifications. While GPT-5 was conservative in its sub-calls, the open-source Qwen3-Coder model sometimes attempted thousands of sub-calls for simple tasks. "Today, you likely will have to implement your own guardrails and logic to control RLM behavior," Zhang said. However, he hypothesizes that future models could be trained to manage their own compute budgets more effectively. Companies like Prime Intellect are planning to integrate RLM into the training process of models, possibly addressing the edge cases where the model’s inference budget spikes. For enterprise architects deciding where to place their bets, the RLM framework offers a new tool for handling information-dense problems. "I think RLMs are still extremely useful for chatbots (think long chat histories), but ultimately they argue for an alternative way of using LMs," Zhang said. "I think RLMs work in tandem with standard retrieval methods like RAG; they do not serve as a replacement, and can be used in different settings or together." Subscribe to get latest news! Deep insights for enterprise AI, data, and security leaders VB Daily AI Weekly AGI Weekly Security Weekly Data Infrastructure Weekly VB Events All of them By submitting your email, you agree to our Terms and Privacy Notice. Get updates You're in! Our latest news will be hitting your inbox soon. CleoP made with Midjourney Why reinforcement learning plateaus without representation depth (and other key takeaways from NeurIPS 2025) Every year, NeurIPS produces hundreds of impressive papers, and a handful that subtly reset how practitioners think about scaling, evaluation and system design. In 2025, the most consequential works weren't about a single breakthrough model. Instead, they challenged fundamental assumptions that academicians and corporations have quietly relied on: Bigger models mean better reasoning, RL creates new capabilities, attention is “solved” and generative models inevitably memorize. Maitreyi Chatterjee,Devansh Agarwal January 17, 2026 Credit: VentureBeat made with Google Gemini 3 Pro Image / Nano Banana Pro Claude Code just got updated with one of the most-requested user features Anthropic's open source standard, the Model Context Protocol (MCP), released in late 2024, allows users to connect AI models and the agents atop them to external tools in a structured, reliable format. It is the engine behind Anthropic's hit AI agentic programming harness, Claude Code, allowing it to access numerous functions like web browsing and file creation immediately when asked. Carl Franzen January 15, 2026 Credit: VentureBeat, generated with MidJourney Why MongoDB thinks better retrieval — not bigger models — is the key to trustworthy enterprise AI As agentic and RAG systems move into production, retrieval quality is emerging as a quiet failure point — one that can undermine accuracy, cost, and user trust even when models themselves perform well. Emilia David January 15, 2026 CleoP created with Midjourney AI agents can talk — orchestration is what makes them work together Rather than asking how AI agents can work for them, a key question in enterprise is now: Are agents playing well together? Taryn Plumb January 14, 2026 CleoP made with Midjourney Why Egnyte keeps hiring junior engineers despite the rise of AI coding tools The approach challenges a dominant 2025 narrative that automation will replace developers, showing instead how enterprises are using AI to scale engineering capacity while keeping humans firmly in the loop. Taryn Plumb January 13, 2026 Credit: VentureBeat made with Seedream v4.5 on fal.ai This new, dead simple prompt technique boosts accuracy on LLMs by up to 76% on non-reasoning tasks In the chaotic world of Large Language Model (LLM) optimization, engineers have spent the last few years developing increasingly esoteric rituals to get better answers. Carl Franzen January 13, 2026 CleoP made with Midjourney Why your LLM bill is exploding — and how semantic caching can cut it by 73% Our LLM API bill was growing 30% month-over-month. Traffic was increasing, but not that fast. When I analyzed our query logs, I found the real problem: Users ask the same questions in different ways. Sreenivasa Reddy Hulebeedu Reddy January 12, 2026 Credit: VentureBeat, generated with MidJourney Orchestral replaces LangChain’s complexity with reproducible, provider-agnostic LLM orchestration A new framework from researchers Alexander and Jacob Roman rejects the complexity of current AI tools, offering a synchronous, type-safe alternative designed for reproducibility and cost-conscious science. Emilia David January 9, 2026 Partner Content How KPMG is redefining the future of SAP consulting on a global scale Presented by SAP VB Staff January 9, 2026 Credit: VentureBeat made with Google Gemini 3 Pro Image / Nano Banana Pro Claude Code 2.1.0 arrives with smoother workflows and smarter agents Anthropic has released Claude Code v2.1.0, a notable update to its "vibe coding" development environment for autonomously building software, spinning up AI agents, and completing a wide range of computer tasks, according to Head of Claude Code Boris Cherny in a post on X last night. Carl Franzen January 8, 2026 Credit: VentureBeat, generated with MidJourney Nvidia’s Cosmos Reason 2 aims to bring reasoning VLMs into the physical world Nvidia CEO Jensen Huang said last year that we are now entering the age of physical AI. While the company continues to offer LLMs for software use cases, Nvidia is increasingly positioning itself as a provider of AI models for fully AI-powered systems — including agentic AI in the physical world. Emilia David January 5, 2026 Credit: VentureBeat, generated with MidJourney Brex bets on ‘less orchestration’ as it builds an Agent Mesh for autonomous finance Fintech Brex is betting that the future of enterprise AI isn’t better orchestration — it’s less of it. Emilia David January 5, 2026
==============
MIT разработала новую "рекурсивную" структуру, позволяющую LLM обрабатывать 10 миллионов токенов без потери контекста. Эта структура позволяет LLM программно анализировать, разбивать и рекурсивно вызывать себя над фрагментами текста, вместо того, чтобы просто увеличивать контекстное окно или суммировать старую информацию. Это дает предприятия практический способ решать долгосрочные задачи, такие как анализ кода, юридический анализ и многоступенчатое рассуждение, которые сегодня не могут решить современные модели. Поскольку структура разработана для работы с существующими моделями, она может заменить их в приложениях, которые напрямую взаимодействуют с LLM.