Mews — AI Industry Intelligence

Last 30 days in AI

See all news

Jun 02
Microsoft Build: MAI-Thinking-1 and MAI Family models, Surface RTX Spark Dev Box, and OpenClaw in Windows
Jun 02
Microsoft introduced MAI-Thinking-1, a 35B parameter MoE model with 256K context, achieving 97% on AIME 2025 and outperforming Sonnet 4.6 in human preference tests. The broader 7-model MAI family spans reasoning, code, image, speech, and voice, with third-party availability on OpenRouter, fal, and Baseten. The detailed 109-page technical report revealed insights on scaling, MFU, RL/post-training, and data curation, highlighting no third-party distillation and advanced prompt optimization techniques. Microsoft emphasized agent-native devices and local inference with projects like Project Solara / Scout and the Surface RTX Spark Dev Box, alongside software innovations such as the Copilot desktop app and MAI-Code-1-Flash integration. Meanwhile, local-first computer-use agents like Holo 3.1 (Qwen-based, 0.8B to 35B parameters) support laptops and small workstations with optimized formats and strong benchmark results. Desktop shells for agents, including Hermes Desktop, Devin Desktop, and agent-neutral approaches compatible with Devin, Claude Code, and Codex, are proliferating, with hybrid local/cloud execution becoming the default architecture as seen in Perplexity Computer's hybrid agentic inference.
Jun 01
not much happened today
Jun 01
NVIDIA led open-source AI model releases with Cosmos 3, a comprehensive omnimodal world model unifying language, image, video, audio, and action using a Mixture-of-Transformers design, and Nemotron 3 Ultra, a 550B parameter open-weight model noted for high serving speed and strong evaluation performance. The Cosmos Coalition was launched to foster an open ecosystem for physical AI world models. Meanwhile, MiniMax M3 debuted as a multimodal agent/coding model with 1M context and strong benchmark scores, gaining rapid ecosystem support from vendors like Novita and Vercel AI Gateway. However, MiniMax M3 showed some inefficiencies such as high token consumption and verbose self-check loops. These developments highlight advances in open physical AI, multimodality, and agent models with significant community and infrastructure engagement.
May 29
not much happened today
May 29
Anthropic rolled out Claude Opus 4.8, which shows incremental improvements but mixed benchmark results, including better cooperation and coding behavior but some regressions in document parsing. Platform updates include mid-conversation system instructions enhancing long agent sessions, though API pricing remains a concern. A Hugging Face analysis revealed a critical bug in multi-turn reinforcement learning training loops involving tokenization mismatches, with a proposed "Token-In, Token-Out" fix. Agent harness design is evolving as a key optimization area, with LangChain's Deep Agents v0.6 achieving strong performance at much lower cost, and vllm_project releasing native weight syncing APIs and a Rust BPE tokenizer to improve tokenization efficiency. Debate continues on the value of multi-agent systems, with some seeing them as speedups and others expecting capability breakthroughs.
May 28
Anthropic raises $65B in Series H at a $965B post-money valuation, releases Opus 4.8 and Dynamic Workflows
May 28
Anthropic announced a massive $65B Series H financing at a $965B valuation, led by Altimeter, Dragoneer, Greenoaks, and Sequoia, with run-rate revenue surpassing $47B. They launched Claude Opus 4.8, an update to Opus 4.7 featuring "sharper judgment," "more honesty," and longer autonomous work at the same price. Anthropic also introduced Dynamic Workflows in Claude Code, enabling orchestration of hundreds of parallel subagents for large tasks, available in research preview across multiple platforms. Opinions on Opus 4.8 vary, with some praising it as a major leap and others viewing it as incremental or catch-up to OpenAI's GPT-5.5 family.
May 26
not much happened today
May 26
Harness engineering is emerging as the key differentiator for coding agents, emphasizing the stack of model + harness + eval loop over just stronger base models. DeepSeek is building a harness team to optimize interaction and verification loops, while Google's Gemini Managed Agents and LangChain formalize harness concepts like context governance and dynamic skill routing. New benchmarks like DeepSWE align closely with real developer experience, with Qwen3.7 Max and Claude Opus 4.6 showing strong agentic coding performance. Anthropic introduced a security-guidance plugin for Claude Code reducing security PR comments by 30–40%, and OpenAI highlighted GPT-5.5 in Codex for improved document parsing. In research, Claude Mythos solved Erdős problem #90 with a cleaner proof path than previous models, showing latent capabilities unlocked by appropriate harnesses. The paper "Language Models Need Sleep" proposes a sleep-like consolidation phase for long-horizon memory, addressing bottlenecks in persistent context storage. Open research agents like QUEST (2B–35B parameters) advance long-horizon fact-seeking and citation grounding, while the CUSP benchmark from Sakana/Stanford/Oxford/AI2 evaluates current model capabilities in science.
May 26
not much happened today
May 26
Inference optimization is increasingly architectural, with EAGLE 3.1 improving speculative decoding and long-context handling, collaborating with vLLM and TorchSpec. Perplexity open-sourced a rebuilt Unigram tokenizer cutting CPU use by 5–6× and achieving 63 µs at 514 tokens. Qwen3.5 hits 580 tokens/s via joint efforts from Alibaba, LightSeek, NVIDIA, Mooncake, and FlashAttention-4 contributors. Price cuts in APIs from Chinese labs are sustainable due to structural KV-cache and attention improvements, exemplified by DeepSeek V4-Pro and Xiaomi MiMo reducing caching costs significantly. Agent engineering shifts focus from model quality to model-harness-memory fit, with LangChain releasing Deep Agents v0.6 and tools like LangSmith Engine automating evaluation loops. Trajectory launched a continual learning platform with $15M funding and partners like Clay and Harvey, supporting large models including a 397B-parameter model deployed on autoscaled H100 infrastructure. Open-source memory-centric agents and minimal training harnesses also gained attention.
May 21
not much happened today
May 21
RAEv2 advances representation-first tokenization with >10x faster convergence and improved generation, tested on text-to-image and world models. NVIDIA's Gated DeltaNet-2 innovates linear attention with channel-wise gates, outperforming KDA and Mamba-3 at 1.3B parameters on language modeling and reasoning tasks. Studies on subword tokenization reveal only some benefits at scale, while data filtering research suggests that with enough compute, no filtering may be optimal at around 1e30 FLOPs. Mechanistic interpretability updates propose clustering features by joint firing patterns for better geometry understanding. OpenAI's AI-assisted breakthrough on an Erdős unit-distance math problem sparks debate on AI's role in mathematical research. Harnesses remain key for capability improvements in agent infrastructure.
May 18
not much happened today
May 18
Agent infrastructure is advancing with LangSmith Engine providing CI/CD loops for agents and SmithDB enabling low-latency querying for observability. Cognition's Devin Auto-Triage offers persistent automation for bug triage with memory and subagent structures. Anthropic improves Claude Code for large codebases with prompt cache diagnostics and faster modes, while OpenAI enhances Codex workflows with remote execution and plugins. Microsoft released remote control for GitHub Copilot CLI and VS Code. The community emphasizes verification, decomposition, and feedback loops over prompt cleverness for coding agents. Cursor's Composer 2.5 is highlighted as a strong new coding model, with plans for a larger model trained with SpaceXAI using 10× more compute on Colossus 2 hardware, praised for efficiency and collaboration improvements.
May 18
Google I/O 2026: Gemini 3.5 Flash, Omni, and Google’s Agent Stack
May 18
Google announced at I/O the repositioning of Gemini as a consumer AI and developer/agent platform with three key releases: Gemini 3.5 Flash for fast agentic and coding tasks, Gemini Omni for multimodal generation and editing including video, and the expanded Antigravity 2.0 agent stack. Google reports processing over 3.2 quadrillion tokens per month, a 7x increase year-over-year, with 900M+ monthly Gemini users across 230+ countries and 70+ languages. Gemini 3.5 Flash features a 1M-token context window, 65k max output tokens, 4 thinking levels, and "thought preservation" across turns, outperforming Gemini 3.1 Pro on multiple benchmarks and running up to 12x faster in Antigravity. Independent benchmarks show Gemini 3.5 Flash scoring 55 on the Intelligence Index, with higher costs than previous versions. Gemini Omni Flash supports text, image, video, and audio inputs for generative media tasks, available now for paid users.
May 15
not much happened today
May 15
Cerebras made headlines with its IPO, marking a significant milestone for the company known for its contrarian hardware approach. The Cerebras CFO Bob Komin emphasized the company's capability to serve trillion-parameter models, including internal OpenAI 5.4 and 5.5 models, pushing back against the notion that Cerebras only supports small models. Investor Ishan N. Taneja praised Cerebras for its persistence and execution, calling their chip a "banger." The IPO is seen as a validation of Cerebras's long-term strategy in inference infrastructure, highlighting themes like compute scarcity, inference demand, and model routing.
May 14
not much happened today
May 14
OpenAI expanded Codex integration with the ChatGPT mobile app enabling remote task management and introduced Remote SSH, hooks, and programmatic tokens for enterprise automation. The IDE ecosystem is shifting to "agent-first" UX with GitHub Copilot App preview and VS Code launching a multi-agent workflow window. Open-source agents like Nous/Hermes integrated Codex runtime, and Kimi released a web bridge extension supporting multiple coding agents. LangChain released significant agent infrastructure including SmithDB for agent trace data and LangSmith Engine for trace analysis and continual learning, launching LangChain Labs to improve agents via production trace feedback loops.
May 13
not much happened today
May 13
Cline, LangChain, Notion, and Cursor advanced agent infrastructure and developer platforms with innovations like Cline SDK, LangSmith Engine, SmithDB (offering 12–15× faster observability), and Notion's External Agents API integrating third-party agents such as Claude and Codex. Agent UX trends emphasize long-running state, streaming, and orchestration over chat, with tools like Duet Agent and VS Code Agents window enhancing durable execution and inspectable states. Research highlights include Nous Research's Token Superposition Training achieving 2–3× speedup in pretraining, a multi-stream LLM architecture for parallel reasoning by Jonas Geiping et al., and δ-mem external memory improving benchmark scores. NVIDIA's Star Elastic offers post-training model compression at 360× lower cost than pretraining, while Datology focuses on data curation for vision-language models.
May 12
not much happened today
May 12
Research-level reasoning benchmarks are advancing with 439 new math problems from 64 mathematicians and expanded medical benchmarks in Medmarks v1.0 covering 30 benchmarks and 61 models. Google DeepMind's AI Co-Mathematician achieves 48% on FrontierMath Tier 4, while Gemini 3.1 Pro improves physics benchmark scores significantly. GPT-5.5 high/xhigh outperforms Opus 4.7 xhigh on program synthesis tasks. Retrieval benchmarks favor smaller models like LightOn's Agent-ModernColBERT with 149M parameters. Training optimization advances include SOAP/Muon-style updates reducing training steps, and a Lean4-to-TileLang superoptimizer achieving 1.8× speedup on A100 GPUs. Scaling laws are reconsidered with arguments for measuring in bytes rather than tokens. New training-time efficiency methods like Lighthouse Attention enable subquadratic training wrappers removable before deployment.
May 11
not much happened today
May 11
Thinking Machines previewed their new native interaction models designed for full-duplex multimodal interaction enabling real-time concurrent listening, speaking, watching, thinking, searching, and reacting, marking a shift beyond turn-based AI. This approach emphasizes continuous audio, video, and text processing, with innovations like visual proactivity and background tool use, implemented using SGLang. Meanwhile, OpenAI announced the OpenAI Deployment Company, a new unit with 150 Forward Deployed Engineers and $4B initial investment to help enterprises deploy frontier models, signaling a move into the deployment layer of the AI economy. OpenAI also launched Daybreak, a security-focused initiative integrating GPT-5.5 and Codex for cyber defense, threat modeling, and automated patching, offering differentiated access tiers including GPT-5.5-Cyber. This contrasts with Anthropic's more restrictive cyber approach, highlighting tensions in AI security strategies.
May 08
not much happened today
May 08
OpenAI rapidly expanded the GPT-5.5 family with multiple variants including gpt-image-2, GPT-5.5 Pro, and GPT-5.5 Cyber, receiving positive feedback for efficiency and usability. Codex evolved into a long-running agent runtime with a new /goal mechanism, achieving 61% success on ARC-AGI-3 games after extensive testing. OpenAI also introduced cybersecurity-focused models like GPT-5.5-Cyber targeting enterprise and government sectors. Meanwhile, Zyphra released the open-model ZAYA1-74B-Preview, a 74B parameter mixture-of-experts model trained on AMD hardware under Apache 2.0 license, alongside a vision-language model ZAYA1-VL-8B. Inference infrastructure competition intensified with vLLM updates improving throughput and latency, including support for DeepSeek V4 and enhanced quantization/backends.
May 07
GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs
May 07
OpenAI released GPT-Realtime-2, a voice model with GPT-5-class reasoning, tool use, interruption handling, and extended context windows up to 128K tokens, achieving top scores on Big Bench Audio and Conversational Dynamics benchmarks. They also launched a Chrome plugin for Codex enabling browser control and multitasking, and introduced GPT-5.5 with Trusted Access for Cyber for secure defensive workflows and red teaming. Anthropic introduced Natural Language Autoencoders for interpreting model activations as human-readable text, aiding interpretability and debugging, while Goodfire proposed a neural geometry research agenda focusing on manifolds as primitives for neural network behavior. Anthropic also announced The Anthropic Institute to advance AI safety and economic resilience research.
May 06
Anthropic-SpaceXai's 300MW/$5B/yr deal for Colossus I, ARR growth is 8000% annualized
May 06
Anthropic announced a new SpaceX compute partnership to significantly increase capacity for Claude products, doubling Claude Code's 5-hour rate limits for Pro, Max, Team, and Enterprise users, removing peak-hour limit reductions, and substantially increasing API rate limits for Opus models. The deal grants Anthropic access to Colossus 1 via SpaceXAI, with Claude inference expected to ramp up on Colossus soon. Anthropic also hosted a "Code with Claude" event featuring updates on Claude Code, GitHub-scale usage, and managed agents. Discussions highlighted compute bottlenecks, user reactions to limit changes, debates on managed-agent features, and ongoing safety/governance discourse around AGI trustworthiness.
May 04
not much happened today
May 04
AI Twitter Recap highlights the shift from model-centric AI to context pipelines and agent orchestration as key performance drivers. Notably, gpt-5.2-codex and gpt-5.3-codex showed significant benchmark improvements through prompt and middleware tuning. The ecosystem around open harnesses like Hermes, deepagents, and Flue is rapidly evolving, with innovations in multi-agent coordination and model-agnostic orchestration. Developer workflows are adapting to coding agents such as Codex and Claude Code, with emerging challenges in pricing models due to high token usage in agentic workloads. The practical takeaway is that agent performance depends on the synergy of model × harness × memory/context strategy, not just model weights alone.
May 04
not much happened today
May 04
OpenAI rolled out GPT-5.5 Instant as the new default for ChatGPT and API, enhancing factuality, intelligence, image understanding, and tone with stronger personalization features like saved memories and Gmail integration. OpenAI also shared infrastructure updates on a rebuilt WebRTC stack for voice and real-time API, aiming to reduce latency for speech-paced conversations. Developer tools expanded with an Agents SDK for TypeScript, sandbox agents, and open-source harnesses, improving coding and automation workflows. Discussions highlighted the importance of Model–Harness–Task fit over raw model quality for agent performance, with debates on agent coding UX and benchmarks. Community sentiment praises GPT-5.5 for high-token-budget coding and non-coding tasks.
May 04
not much happened today
May 04
OpenAI achieved a major math breakthrough by disproving a long-standing Erdős unit distance problem using a general-purpose reasoning model, marking a milestone in AI-driven formal science and long-horizon reasoning. The result was validated by prominent mathematicians like Timothy Gowers and OpenAI researcher Hongxun Wu, highlighting the model's advanced reasoning capabilities beyond prior AI math achievements. Meanwhile, Cohere released Command A+ as an open-source Apache 2.0 licensed model, featuring a 218B MoE / 25B active multimodal architecture supporting 48 languages and optimized for low hardware requirements, runnable on as little as 2× H100 GPUs. Benchmarks place Command A+ near Claude 4.5 Haiku in intelligence with strong non-hallucination but weaker scientific reasoning and coding. The architecture includes novel elements like a parallel transformer block, shared experts, and LayerNorm over RMSNorm.
May 04
not much happened today
May 04
AI News for 5/4/2026-5/5/2026 highlights a shift in AI product development emphasizing model + harness + workflow + UI + memory + economics over model quality alone, with notable updates from OpenAI Codex and Claude including new features like Appshots, auto mode, and Sonnet 4.6. DeepSeek made a significant market impact by permanently discounting DeepSeek-V4-Pro by 75%, drastically improving cost/performance ratios compared to Gemini 3.1 Pro, GPT-5.5, and Claude Opus 4.7. Meanwhile, Gemini 3.5 Flash showed benchmark improvements but received mixed feedback on practical utility. The competitive landscape continues to tighten with Qwen and other Chinese frontier models.
May 01
not much happened today
May 01
xAI released Grok 4.3, improving cost/performance with a 53 Intelligence Index score, 4 points higher than Grok 4.20, and significant gains on GDPval-AA and τ²-Bench Telecom. However, accuracy tradeoffs raised reliability concerns. Community opinions are mixed, with some praising token-efficiency and others noting regressions and pricing concerns. DeepSeek V4 Pro emerges as a leading open-weight coding/agent model, comparable to Codex and Claude Code, featuring a 1M context window and efficient attention mechanisms. Benchmarking shows open-weight models like Kimi K2.6, MiMo V2.5 Pro, and DeepSeek V4 Pro closing the gap with closed models such as Gemini 3.1 Pro Preview, Claude Opus 4.7, and GPT-5.5. DeepSeek's multimodal efforts focus on explicit spatial grounding with a novel "point while thinking" approach using DeepSeek-ViT and CSA compression.
Apr 30
not much happened today
Apr 30
OpenAI's GPT-5.5 achieves top-tier performance in long-horizon cyber tasks, matching or surpassing Claude Mythos Preview with a 71.4% pass rate and showing ongoing improvement beyond 100M tokens inference. OpenAI also released an Advanced Account Security update for ChatGPT enhancing phishing resistance. The Codex update expands beyond coding to general computer tasks, improving speed by up to 42% and introducing role-based onboarding and app integrations. Economically, GPT-5.5 Pro shows a slight SOTA improvement on CritPt with ~60% lower cost and token use compared to GPT-5.4 Pro. In open-weight models, Qwen3.6 27B leads under 150B parameters with an Intelligence Index score of 46, featuring 262K context, native multimodal input, and efficient BF16 weights. Tencent's Hy3-preview (295B total, 21B active MoE) scores 42 on the Intelligence Index with strong scientific reasoning on CritPt. xAI's Grok 4.3 shows sharp improvements on agentic benchmarks with reduced cost.
Apr 29
not much happened today
Apr 29
OpenAI is expanding Codex from a coding tool to a general work surface with persistent context, tools, integrations, and team rollout, including Codex-only seats with $0 seat fee for Business/Enterprise customers through June. Performance improvements focus on agent-loop systems engineering, achieving up to 40% faster agentic workflows via WebSocket mode on the Responses API. VS Code enhances coding-agent UX with semantic indexing, cross-repo search, chat session insights, and prompt/agent evaluation extensions. Cursor launches a Cursor SDK to enable programmable agent infrastructure for CI/CD, automations, and embedded agents, signaling a shift toward headless agent runtimes and usage-based economics. Research highlights Agentic Harness Engineering improving Terminal-Bench 2 pass@1 from 69.7% to 77.0%, surpassing human-designed baselines and reducing token use by 12%. Related work on HALO shows recursive self-improving agents with significant AppWorld score improvements. LangChain’s Deep Agents introduces Harness Profiles for model-specific harness tuning and deployability.
Apr 28
not much happened today
Apr 28
vLLM v0.20.0 introduces significant improvements in memory and MoE serving efficiency, including TurboQuant 2-bit KV cache for 4× KV capacity and a 2.1% latency improvement. The update supports multiple hardware platforms like DeepSeek V4 MegaMoE on Blackwell, Jetson Thor, ROCm, Intel XPU, and Grace-Blackwell setups. Early benchmarks show DeepSeek V4 Pro on B300 hardware can be up to 8× faster than H200. The ecosystem is rapidly adopting day-0 support for new open models such as Poolside Laguna XS.2, Ling-2.6-flash, and NVIDIA Nemotron 3 Nano Omni. Poolside released Laguna XS.2, a 33B total / 3B active MoE coding model under Apache 2.0, capable of running on a single GPU, with hybrid attention and FP8 KV cache, performing near Qwen-3.5. NVIDIA launched Nemotron 3 Nano Omni, a 30B / A3B multimodal MoE with 256K context, supporting text, image, video, audio, and documents, with immediate distribution across multiple platforms. Discussions highlighted tradeoffs in quantization methods and a shift away from CUDA lock-in towards heterogeneous accelerator support.
Apr 27
not much happened today
Apr 27
OpenAI loosens its Azure exclusivity, allowing distribution across Google TPU, AWS Trainium, and Bedrock with commitments through 2032 and revenue share through 2030. GPT-5.5 shows improved benchmarks but is not uniformly dominant, ranking variably across coding, document, math, and vision tasks. GitHub's Copilot shifts to usage-based billing starting June 1, reflecting increased runtime costs. OpenAI open-sourced Symphony, an orchestration layer for issue tracking and Codex agents. Xiaomi released MiMo-V2.5 and MiMo-V2.5-Pro, large context models with up to 1M-token context and trillions of tokens trained, emphasizing complex agent and omni-modal capabilities. Kimi K2.6 leads OpenRouter's leaderboard, noted for coding and long-horizon agent capabilities with large-scale sub-agent coordination.
Apr 24
DeepSeek v4
Apr 24
DeepSeek-V4 technical release features a 1.6T-parameter MoE with 49B active parameters and 1M-token context, showcasing hybrid attention and compressed KV schemes for major memory reductions. It ranks as the #2 open-weights reasoning model behind Kimi K2.6 but has a high hallucination rate and higher serving costs. Hardware-model co-design is emphasized, with NVIDIA Blackwell Ultra delivering 150+ TPS/user and support for FP4 and FP8 quantization enabling deployment on single nodes. Positioning among open Chinese models is competitive with GLM-5.1 and Xiaomi MiMo V2.5 Pro. Meanwhile, OpenAI launched GPT-5.5 and GPT-5.5 Pro APIs with a 1M context window, focusing on improved long-running workflows and token efficiency, quickly integrated into tools like GitHub Copilot and Cursor. *"GPT-5.5 handles complex, tool-heavy, ambiguous workflows with fewer retries,"* highlighting rapid distribution and agent integration.
Apr 23
GPT 5.5
Apr 23
OpenAI launched GPT-5.5 as its new flagship model for "real work and powering agents," immediately available in ChatGPT and Codex but with delayed API access due to enhanced safety requirements. The model features improved token efficiency and supports longer multi-step execution with tool use and self-checking. Pricing is set at $5/$30 per million tokens for GPT-5.5 and $30/$180 for GPT-5.5 Pro, roughly double the cost of GPT-5.4. The release includes significant Codex upgrades such as browser control, document handling, and OS-wide dictation. Early reactions are mixed but generally positive, noting improvements in coding and long-horizon tasks, though some benchmarks show incremental gains and hallucination issues persist. Third-party ecosystem support like Hermes Agent integration appeared quickly.
Apr 22
not much happened today
Apr 22
Alibaba released Qwen3.6-27B, a dense, Apache 2.0 open coding model with thinking and non-thinking modes, outperforming the larger Qwen3.5-397B-A17B on multiple coding benchmarks including SWE-bench and Terminal-Bench. It supports native vision-language reasoning over images and video, with immediate ecosystem support from vLLM, Unsloth, ggml, and Ollama. OpenAI open-sourced a practical Privacy Filter model for PII detection and masking, a 1.5B parameter token-classification model with a 128k context window aimed at enterprise redaction tasks. Xiaomi announced MiMo-V2.5-Pro and MiMo-V2.5 models, emphasizing software engineering advances, long-horizon agents, and large context windows (up to 1M tokens), with strong benchmark results and integrations with Hermes and Nous. At Google Cloud Next, Google and Google DeepMind unveiled 8th-gen TPUs (TPU 8t for training and TPU 8i for inference) with claims of scaling to a million TPUs in a cluster, and launched the Gemini Enterprise Agent Platform evolving Vertex AI with Agent Studio and access to 200+ models including Gemini 3.1 Pro and Gemini 3.1 Flash Image. This marks a significant vertical integration of hardware, models, and enterprise tooling.
Apr 21
GPT-Image-2
Apr 21
OpenAI launched GPT-Image-2, enhancing image generation with improved text rendering, layout fidelity, editing, multilingual support, and "thinking" capabilities. It supports generating slides, infographics, diagrams, UI mockups, and QR codes, and integrates with tools like Figma, Canva, Adobe Firefly, and Hermes Agent. Benchmarks show GPT-Image-2 leads image generation tasks with a +242 Elo advantage. Hugging Face released ml-intern, an open-source agent automating post-training research loops, improving scientific reasoning and healthcare benchmarks significantly. Hermes is evolving into a richer local/open agent platform with enhanced multi-process orchestration capabilities.