News Summary for February 23, 2026

Summary

This week’s AI news is dominated by significant developments in agentic AI capabilities and infrastructure challenges. Anthropic’s Claude Opus 4.6 achieved a major benchmark milestone, demonstrating 50% success on multi-hour expert ML tasks—a strong signal that AI agents are approaching practical autonomy for complex knowledge work. Meanwhile, OpenAI faces serious infrastructure headwinds as the $500B Stargate project remains stalled, forcing a scramble for alternative compute through Oracle, CoreWeave, and multi-cloud partnerships. On the architectural front, Andrej Karpathy introduced “Claws” as a new paradigm for persistent AI agents, shifting the conversation from agents-as-scripts to agents-as-services with always-on scheduling and cross-session context. Security concerns around Model Context Protocol implementations and production incidents from AI coding tools highlight the growing pains of deploying agentic AI at scale.

Top 3 Articles

1. Claude Opus 4.6 Achieves 50% on Multi-Hour Expert ML Tasks (METR Benchmark)

Source: Reddit (r/MachineLearning)

Date: February 21, 2026

Detailed Summary:

METR (Model Evaluation and Threat Research) has updated its task horizon benchmark, revealing that Anthropic’s Claude Opus 4.6 now achieves 50% success on complex multi-hour ML tasks—including challenges like “fix complex bug in ML research codebase.”

The METR methodology measures AI autonomous capabilities using a “time horizon” framework, evaluating success rates on tasks calibrated by how long they take human experts to complete. The benchmark suite (TH1.1) comprises 228 tasks spanning software engineering, machine learning, and cybersecurity, ranging from under 5 minutes to over 16 hours for human experts.

Claude Opus 4.6’s demonstrated capabilities include: 76% accuracy on 8-needle 1M MRCR v2 (compared to 18.5% for Sonnet 4.5), the highest score on Terminal-Bench 2.0, and 72.9% on SWE-bench Verified. Partners report success on codebase migrations, debugging unfamiliar codebases, and multi-step planning without hand-holding.

METR has documented an exponential trend showing AI task completion horizons doubling approximately every 7 months (or faster—89 days since 2024). If this trajectory continues, AI agents capable of completing week-long projects autonomously could emerge within 2-4 years.

Key implications for developers: The 50% success rate means AI can handle complex multi-hour debugging or ML research tasks independently about half the time—enabling significant productivity gains while still requiring human oversight. Development patterns are shifting toward longer-running autonomous workflows with built-in verification loops, graceful degradation, and human escalation points. The benchmark treats AI as a capable but probabilistically unreliable collaborator rather than a deterministic tool.

2. Claws Are Now a New Layer on Top of LLM Agents

Source: Hacker News (via Andrej Karpathy on X/Twitter)

Date: February 21, 2026

Detailed Summary:

Andrej Karpathy has introduced “Claws” as a new evolutionary layer in the AI stack, sitting on top of LLM agents. As Karpathy defines it: “Just like LLM agents were a new layer on top of LLMs, Claws are now a new layer on top of LLM agents, taking the orchestration, scheduling, context, tool calls and a kind of persistence to a next level.”

The key architectural distinction is persistence. Unlike traditional agents that execute on-demand and terminate, Claws are designed as always-on services with continuous runtime on dedicated hardware (Mac Mini, home servers), autonomous scheduling without explicit prompts, cross-session context persistence, MCP-based communication for tool integration, and multi-agent orchestration.

Traditional patterns like ReAct and function calling operate in a request-response paradigm. Claws fundamentally differ by treating the agent as a service rather than a script—designed for continuous operation, self-scheduling, and cross-session state management.

Karpathy has expressed security skepticism about running OpenClaw’s 400K lines of “vibe coded” code with exposed API keys, citing reports of RCE vulnerabilities and supply chain attacks. He highlights NanoClaw (~4,000 lines) as appealing because it “fits into both my head and that of AI agents,” enabling auditability. NanoClaw introduces a novel configuration paradigm where settings happen via “skills” (instructions that modify code) rather than traditional config files.

Key implications for developers: This signals a shift toward agent-as-infrastructure thinking. Developers must design agents for continuous operation rather than request-response cycles. Security-first architecture becomes critical given the attack surface of persistent, tool-enabled agents. The Mac Mini trend suggests developers are deploying AI on personal hardware, blurring lines between development and production environments.

3. How OpenAI Scrambled for Compute as Stargate Stalled

Source: Techmeme / The Information

Date: February 22, 2026

Detailed Summary:

More than a year after President Trump announced the $500 billion Stargate project at a White House event, the joint venture between OpenAI, SoftBank, and Oracle has not hired staff, not finalized a single site deal, and is not developing any OpenAI data centers. Oracle CEO Safra Catz bluntly told investors: “Stargate is not formed yet.”

Key factors behind the stall include internal disputes over responsibilities, organizational structure, site locations, and operating terms; financing paralysis as SoftBank hasn’t finalized a financing blueprint; tariff risks that could raise data center costs 5-15%; and lender reluctance to back “a company with an unproven business model and heavy losses.”

Unable to wait for Stargate, OpenAI has pursued aggressive multi-cloud diversification:

Oracle: $30 billion annual deal for 4.5 gigawatts of U.S. capacity
CoreWeave: $11.9 billion compute partnership
AWS, Google Cloud, AMD, Cerebras: Additional deals to cover computing needs
Microsoft Azure shift: Actively diversifying away from historical reliance on Azure
Nvidia investment: $30 billion equity stake as part of a $100+ billion mega funding round

OpenAI now projects $600 billion in total compute spend through 2030 and expects $111 billion more cash burn over that period. Inference costs increased fourfold in 2025, dropping gross margins from 40% to 33%.

Competitive implications: While OpenAI scrambles for capacity, Anthropic is deepening its AWS partnership and Google leverages its own TPU infrastructure. The compute constraints could slow OpenAI’s model development cadence at a critical juncture. The Stargate saga reveals fundamental tensions in AI infrastructure financing: unprecedented capital intensity, execution complexity even with unlimited capital commitments, and a shift toward sharing infrastructure risk with cloud providers rather than owning facilities outright.

Summary#

Top 3 Articles#

1. Claude Opus 4.6 Achieves 50% on Multi-Hour Expert ML Tasks (METR Benchmark)#

2. Claws Are Now a New Layer on Top of LLM Agents#

3. How OpenAI Scrambled for Compute as Stargate Stalled#

Other Articles#

Summary

Top 3 Articles

1. Claude Opus 4.6 Achieves 50% on Multi-Hour Expert ML Tasks (METR Benchmark)

2. Claws Are Now a New Layer on Top of LLM Agents

3. How OpenAI Scrambled for Compute as Stargate Stalled

Other Articles