a quiet day.

AI News for 4/14/2026-4/16/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews' website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

Top Story: Claude Opus 4.7

What happened

Anthropic officially launched Claude Opus 4.7 as its newest top-tier Opus model, positioning it as better at long-running work, coding, instruction following, self-verification, computer use, and knowledge work than Opus 4.6, while keeping list pricing unchanged at $5 / $25 per million input/output tokens according to user summaries and launch discussion [@claudeai, @kimmonismus]. The rollout appears broad: Anthropic’s own app/platform, API, Claude Code, AWS Bedrock, Google Vertex AI, and Microsoft Foundry were all cited by users on day one [@dejavucoder, @kimmonismus]. Third-party integrations also landed quickly, including Cursor [@cursor_ai], GitHub Copilot / @code [@pierceboggan, @code], Perplexity [@perplexity_ai], Devin [@cognition], Cline [@cline], Replit Agent [@pirroh], MagicPath [@skirano], Hermes Agent [@Teknium], and Arena [@arena]. The release sparked unusually active technical discussion around benchmark gains, a new tokenizer, higher image resolution support, new xhigh reasoning effort, token-cost implications, and whether Opus 4.7 is a straightforward 4.6 successor, a new base model, or a partially distilled “Mythos-adjacent” system.

Release details and product changes

Official framing. Anthropic’s launch pitch emphasized three behavioral improvements: better handling of long-running tasks, more precise instruction following, and stronger self-verification before responding [@claudeai].

Availability.

Claude platform / app reported live immediately [@dejavucoder].
API and cloud providers reported available across Bedrock, Vertex AI, and Microsoft Foundry [@kimmonismus].
Claude Code shipped day-one support and set xhigh as the default effort level [@_catwu, @_catwu].
Anthropic also launched or highlighted task budgets in public beta, /ultrareview in Claude Code, and broader Auto mode access for Claude Code Max users [@kimmonismus].

New effort tier.

Multiple users noted a new xhigh reasoning effort mode, positioned between high and max [@scaling01, @scaling01].
Cat Wu said Claude Code now defaults to xhigh for Opus 4.7 [@_catwu].

Vision/computer use changes.

User summaries reported support for images up to 2,576 px on the long edge (~3.75 MP), described as 3x larger than previous Claude image inputs [@kimmonismus].
Anthropic employee Alex Albert highlighted “No more downscaling of high-res images” and better output taste in UI/slides/docs [@alexalbert__].
This was repeatedly linked to better computer use and screenshot-heavy workflows [@dejavucoder, @omarsar0].

Tokenizer and token economics.

Several observers discovered Opus 4.7 uses a different tokenizer from 4.6 [@natolambert, @nrehiew_].
Kimmonismus summarized Anthropic’s caveat that the same input can map to 1.0–1.35x more tokens depending on content type [@kimmonismus].
This triggered debate over whether 4.7 is effectively a new base model, a tokenizer-swapped continuation, or some kind of midtraining/distillation bridge from Mythos [@natolambert, @stochasticchasm, @eliebakouch, @maximelabonne].
Anthropic employee Boris Cherny later said they increased limits for all subscribers to offset increased token use [@bcherny, @bcherny].

Benchmarks and measurable progress

Reported benchmark gains vs Opus 4.6

The most cited launch numbers came from benchmark screenshots and summaries shared by external accounts:

SWE-bench Pro: 64.3%, with users citing roughly +11 points over Opus 4.6 [@scaling01, @kimmonismus]
SWE-bench Verified: 87.6%, roughly +7 points vs 4.6 [@scaling01, @scaling01]
TerminalBench 2.0: 69.4%, around +4 points [@scaling01, @kimmonismus]
Document reasoning: 80.6%, up from 57.1% per third-party discussion [@scaling01, @llama_index]
GDPval-AA: 1753 Elo [@scaling01, @ArtificialAnlys]
ARC-AGI-1: 92%; ARC-AGI-2: 75.83% per user screenshot/summary [@scaling01]

Artificial Analysis said Opus 4.7 launched as the new #1 on GDPval-AA, with an implied ~60% head-to-head win rate vs GPT-5.4 on that task set [@ArtificialAnlys].

Vals AI said Opus 4.7 took the #1 spot on the Vals Index at 71.4%, up from a previous best 67.7%, and also ranked #1 on Vibe Code Bench, Vals Multimodal, Finance Agent, Mortgage Tax, SAGE, SWE-Bench, and Terminal Bench 2 [@ValsAI].

They separately said Opus 4.7 became #1 on Vibe Code Benchmark at 71%, versus no model above 25% when they first launched the benchmark 4.5 months earlier [@ValsAI].

Product/evals from partners and customers

Cursor said its internal benchmark jumped from 58% to 70% with Opus 4.7 [@cursor_ai, @scaling01].
A separate Cursor post said, across 500 teams, developers are tackling 68% more high-complexity tasks this year, though that was about better models generally, not solely Opus 4.7 [@cursor_ai].
Notion reportedly saw a 14% lift on internal evals with one-third of tool errors [@mikeyk].
GitHub reportedly saw similar improvements, though no hard numbers were included in the tweet thread [@scaling01].

Document understanding: progress, but mixed economics

LlamaIndex and Jerry Liu provided useful independent nuance:

LlamaIndex’s ParseBench-style comparison said Opus 4.7 massively improved charts (13.5% → 55.8%) but only slightly improved formatting (64.2% → 69.4%), content (89.7% → 90.3%), tables (86.5% → 87.2%), and regressed on layout (16.5% → 14.0%) [@llama_index].
Jerry Liu separately said Opus 4.7 is “quite good at tables,” better on charts, and strongest on content faithfulness, but expensive for OCR-like use at ~7¢/page vs their agentic mode at ~1.25¢/page and cost-effective mode around ~0.4¢/page [@jerryjliu0].

This is one of the clearest examples of independent evaluation tempering launch optimism: broad capability improved, but specific enterprise document pipelines may still prefer specialized stacks on cost/performance grounds.

Facts vs opinions

Facts and near-facts supported by launch materials or consistent cross-reporting

Opus 4.7 was officially launched by Anthropic [@claudeai].
It is framed as better for long-running tasks, instruction following, and self-verification [@claudeai].
A new xhigh effort tier exists [@scaling01].
Claude Code defaulted to xhigh for the model [@_catwu].
It uses a different tokenizer from 4.6 [@natolambert].
Anthropic increased subscriber limits to compensate for greater token usage [@bcherny, @bcherny].
Anthropic acknowledges benchmark tradeoffs and retained MRCR in the system card “for scientific honesty,” while signaling a shift toward Graphwalks as a preferred long-context metric [@bcherny].

Opinions / interpretations

“This is a distilled version of Mythos” [@eliebakouch].
“This is a new base model because the tokenizer changed” [@natolambert].
“Anthropic artificially kept cyber scores low during training” is partly factual insofar as users quote the system card language about differentially reducing some capabilities, but broader claims about “nerfed Mythos” are interpretation [@scaling01, @Yuchenj_UW].
“Benchmarks don’t do it justice” and “actual usage is massively improved” are subjective but widely repeated by hands-on users [@mweinbach, @jeremyphoward].
“System prompt has lobotomized the model” is a user complaint about behavior changes, not an established fact [@theo].

Different perspectives

Supportive: meaningful real-world upgrade

A large portion of technical users argued this is a substantial iteration, especially given more frequent release cadence.

Scaling01 repeatedly pushed back on “mid update” takes, noting the jump from around 80% to almost 90% on SWE-bench Verified and emphasizing this would have looked huge in prior release cycles [@scaling01, @scaling01, @scaling01].
Alex Albert highlighted better async work, more predictable effort levels, better image handling, and stronger taste in UI/docs [@alexalbert__].
Michael Weinbach said after just two prompts that behavior and instruction following were “pretty massive” improvements [@mweinbach].
Jeremy Howard said it was the first model that “gets” what he’s doing and praised its willingness to discuss rather than bulldoze ahead [@jeremyphoward, @jeremyphoward].
Cat Wu explicitly advised users to treat it like an engineer you delegate to, not a pair programmer you micromanage, suggesting Anthropic sees it as stronger in autonomous execution [@_catwu].

Neutral / analytical: strong update with tradeoffs

Some of the best commentary was technical and mixed.

Kimmonismus called it a “solid upgrade” focused on Anthropic’s core buyer priorities: agentic coding reliability, vision for computer-use agents, and knowledge work—but also “obviously shy to Mythos” [@kimmonismus].
Artificial Analysis validated the GDPval-AA gain and #1 ranking, but did not frame it as an across-the-board blowout [@ArtificialAnlys].
LlamaIndex and ParseBench results suggested noticeable but uneven document gains with real pricing constraints [@llama_index, @jerryjliu0].

Skeptical / critical: regressions, token inflation, and UX concerns

There was also substantial pushback.

Multiple users said long-context performance looked worse, especially on MRCR / needle-in-a-haystack-style metrics [@scaling01, @nrehiew_, @eliebakouch, @kimmonismus].
Anthropic’s Boris Cherny replied that MRCR is being phased out because it overweights distractor-stacking tricks and that Graphwalks is a better applied-reasoning signal; he gave numbers showing Graphwalks 38.7% → 58.6% from 4.6 to 4.7 [@bcherny, @scaling01].
Tokenizer changes led to complaints about Opus becoming a “token guzzler” and potentially raising effective costs despite flat list pricing [@dejavucoder, @madiator].
Yuchen said Claude web only exposed “Adaptive” or non-thinking, with no explicit force-thinking toggle, which for some users made non-coding tasks feel worse in practice [@Yuchenj_UW].
Mikhail Parakhin similarly said first impressions on non-coding replies were “dumber” because he couldn’t force reasoning [@MParakhin].
Theo sharply criticized the new system prompt as “lobotomized,” and later suggested trying the model in T3 Chat “without the lobotomized system prompt” [@theo, @theo].

Safety / governance angle

Scaling01 highlighted a system-card statement that Anthropic experimented with efforts to differentially reduce cyber capabilities during training [@scaling01].
At the same time, users noted Opus 4.7 still scores higher than 4.6 on some exploitation-related evaluations like Firefox shell exploitation, and has prompt-injection robustness close to Mythos [@scaling01, @scaling01].
One user hyperbolically said “Opus is going to be a bioweapon risk at this pace,” reflecting the ongoing tendency to conflate general capability jumps with worst-case misuse narratives [@scaling01].

Technical details and notable numbers

Core benchmark numbers repeatedly cited

SWE-bench Pro: 64.3%
SWE-bench Verified: 87.6%
TerminalBench 2.0: 69.4%
Document reasoning: 80.6% vs 57.1%
GDPval-AA Elo: 1753
ARC-AGI-1: 92%
ARC-AGI-2: 75.83%
Graphwalks: 38.7% → 58.6% from 4.6 to 4.7
CursorBench: 58% → 70%
Vals Index: 71.4% vs previous best 67.7%
Notion evals: +14%, one-third of tool errors
Image size support: up to 2,576 px long edge (~3.75 MP)
Pricing: unchanged at $5 / $25 per million tokens, per user reports
Tokenization impact: same prompt may tokenize to 1.0–1.35x more tokens

Availability / rollout specifics

Claude app/platform
API
AWS Bedrock
Google Vertex AI
Microsoft Foundry
Cursor
GitHub Copilot / VS Code @code
Perplexity
Devin
Cline
Replit Agent
MagicPath
Hermes Agent
Arena

Claude Code workflow guidance from Anthropic

Cat Wu’s thread is a useful operational signal for engineers:

Delegate, don’t micromanage [@_catwu]
Put full goal + constraints + acceptance criteria up front [@_catwu]
Tell the model how to verify changes; encode testing workflows in claude.md or skills [@_catwu]

That strongly suggests Anthropic optimized toward autonomous task loops where explicit validation is central.

Examples of progress in practice

These were the recurring “progress examples” in the conversation:

Coding autonomy: Cursor called it “impressively autonomous and more creative in its reasoning” [@cursor_ai].
Long-horizon software work: Devin said Anthropic had “clearly optimized Claude Opus 4.7 for long-horizon autonomy,” unlocking investigations they couldn’t reliably run before [@cognition].
Computer use / UI tasks: MagicPath cited stronger image-to-code and cleaner React components [@skirano].
Knowledge work: Artificial Analysis and Anthropic both stressed GDPval-AA and finance-agent style tasks [@ArtificialAnlys, @kimmonismus].
Document/chart reasoning: third-party evals show large chart gains, albeit with cost concerns [@llama_index, @jerryjliu0].
Reduced destructive behavior: users cited lower tendency to take reckless destructive actions like sudo rm -rf in production-like setups [@scaling01].
Prompt injection robustness: cited as comparable to Mythos [@scaling01].

At the same time, anecdotal regressions surfaced:

schlechter performance on specific visual/web-generation tasks like Simon Willison’s “pelican benchmark” [@simonw].
weaker result than 4.6 on a canvas tree animation test [@stevibe].
mixed outcomes on large codebase modernization from Theo [@theo, @theo].

Context: Mythos, release tempo, and what Opus 4.7 means

A lot of the discussion only makes sense in the context of Claude Mythos Preview, Anthropic’s more restricted, stronger model that many commenters see as the true internal frontier.

Several users explicitly said Opus 4.7 is not Anthropic’s best model, just the best broadly released Opus line model [@nrehiew_, @Yuchenj_UW].
Scaling01 posted internal-survey snippets about Mythos, including claims about perceived ability to manage day-long and even week-long tasks, and that some Anthropic respondents think Mythos may soon replace entry-level researchers/engineers in limited contexts [@scaling01, @arankomatsuzaki].
This led many to interpret Opus 4.7 as either:
- a safer, public-facing derivative of stronger internal work,
- a distilled / capability-shaped sibling of Mythos,
- or a separate new-base Opus line converging toward Mythos in selected domains.

Another context point is cadence. Multiple users noted Anthropic is now shipping Opus updates at a tempo that feels new relative to earlier frontier labs [@arohan, @scaling01].

This matters because:

Smaller deltas can still be meaningful if releases are monthly-ish instead of semiannual.
Prompting and harnesses now need constant retuning, as Drew Breunig noted more generally: “new models, new prompts” [@dbreunig].
Tooling ecosystems can support day-0 rollout across IDEs, inference providers, and agent frameworks much faster than before.

Implications

For agent builders:
Opus 4.7 looks optimized for a style of work where the model gets a full task spec, acts semi-autonomously, uses memory/filesystem state, and validates itself before returning. That is closer to “delegate to an engineer” than “autocomplete in a chat box.”

For benchmark interpretation:
The release exposed growing tension between benchmark families:

coding/agentic evals look strong,
long-context retrieval metrics look mixed,
Anthropic is actively trying to reframe which long-context metrics matter.

For economics:
Flat posted pricing does not mean flat usage costs if tokenization and reasoning depth increase token counts. Anthropic’s response was to raise user limits, but API buyers still need to reassess effective costs.

For safety policy:
Anthropic appears to be capability-shaping public models in domains like cyber while using Opus 4.7 as a stepping stone toward broader Mythos-style deployment. Whether that balance is stable will depend on competition from OpenAI and others.

Bottom line:
The strongest consensus in the tweets is not that Opus 4.7 is flawless, but that it is a real upgrade in the domains Anthropic cares about most: coding autonomy, computer-use-adjacent vision, instruction fidelity, and economically useful knowledge work. The biggest unresolved questions are around tokenizer/base-model changes, effective cost, long-context regressions on some benchmarks, and whether Anthropic’s public line is now becoming a capability-managed derivative of a stronger internal model.

Open models and model releases

Alibaba released Qwen3.6-35B-A3B, a sparse multimodal MoE with 35B total / 3B active params, Apache 2.0 licensed, with “thinking” and non-thinking modes [@Alibaba_Qwen]. Qwen claims strong agentic coding and VLM performance, including RefCOCO 92.0 and ODInW13 50.8 [@Alibaba_Qwen].
Qwen highlighted LM gains vs prior models, including stronger coding benchmarks than Qwen3.5-35B-A3B [@Alibaba_Qwen]. Independent summaries called it a “very solid upgrade” and listed SWE-bench Verified 73.4, Terminal-Bench 2.0 51.5, QwenWebBench Elo 1397, and strong MMMU/MathVista numbers [@kimmonismus].
vLLM shipped day-0 support for Qwen3.6 in v0.19+, including thinking, tool calling, MTP speculative decoding, and text-only mode [@vllm_project].
Ollama added qwen3.6 support immediately, including ollama launch claude --model qwen3.6 and OpenClaw support [@ollama].
Unsloth published GGUFs for local execution, claiming Qwen3.6 can run in 23GB RAM and later showed a 2-bit version doing a repo bug hunt with 13GB RAM [@UnslothAI, @UnslothAI].
A separate HF-adjacent release: Jackrong’s Qwen3.5-9B-GLM5.1-Distill-v1, reportedly distilled on GLM-5.1 reasoning and fitting in 8GB VRAM [@leftcurvedev_].
Bonsai open models launched in 8B / 4B / 1.7B with ternary weights, MLX/ONNX/WebGPU support, 65k context, 7.1x smaller than FP16, and 27 TPS on iPhone according to the release thread [@mervenoyann, @mervenoyann].

Agents, coding tools, and infra

OpenAI shipped a major Codex update: background computer use on Mac, in-app browser, image generation/editing with gpt-image-1.5, 90+ plugins, multiple terminals, SSH to remote devboxes, automations that continue threads over time, memory, and proactive suggestions [@OpenAI, @OpenAIDevs, @reach_vb]. OpenAI framed Codex as a broader work agent; Kimmonismus noted OpenAI says it now has 3M weekly users and that nearly half of usage is non-coding [@kimmonismus].
Cloudflare launched Artifacts, Git-compatible versioned storage “built for agents,” plus public beta Email Service for Workers / REST, and continued pitching Workers/DOs as core agent infra [@Cloudflare, @elithrar, @thomasgauvin].
Nous/MiniMax expanded the Hermes ecosystem: MiniMax launched MaxHermes, Mirra launched Workspaces for shared agent skills/context, and Nous shipped Tool Gateway with one subscription covering 300+ / 400+ models plus browser automation, scraping, image generation, cloud terminal, and TTS [@MiniMax_AI, @mirra, @NousResearch, @Teknium].
Hermes usage threads showed practical browser-control deployment patterns, including local Chrome via CDP and cloud browser backends like Browserbase / Browser Use / Firecrawl [@0xme66].
Several users continued to report strong hands-on impressions from Hermes and Claude Code style agent workflows, including multibot experiments and no-code app generation [@KSimback, @friesmakesfries].

Benchmarks, evals, and research on real-world capability

Google’s Auto-Diagnose paper described an LLM-based root-cause diagnosis tool for integration test failures deployed inside Critique. Reported numbers: 90.14% diagnosis accuracy on 71 real failures, deployed across 52,635 distinct failing tests, with “Not helpful” only 5.8% of the time and ranking #14 of 370 Critique tools [@omarsar0].
AlphaEval introduced a production-grounded benchmark with 94 tasks from 7 companies across 6 O*NET domains, evaluating full agent products like Claude Code and Codex using mixed paradigms like formal verification, UI testing, and rubric-based assessment [@dair_ai].
CRUX and open-world evals got a major push: researchers argued the field is moving toward long messy real-world evaluations and published a project where an agent built and published an iOS app for about $1,000 [@random_walker, @steverab].
FrontierSWE launched as an ultra-long-horizon coding benchmark with tasks running up to 20 hours and average runtime around 11 hours; frontier agents “rarely succeed” [@MatternJustus, @vincentweisser].
A related memory-transfer paper argued shared cross-domain memory can improve coding-agent performance by 3.7% on average, with procedural guidance transferring better than raw traces [@dair_ai].
Prime Intellect, Modular, and Thoughtful Lab threads described concrete FrontierSWE tasks including optimizing inference engines and post-training Qwen3-8B in a tool-using RL setting [@PrimeIntellect, @Modular, @ThoughtfulLab_, @karinanguyen].

Science, robotics, and domain-specific AI

Google DeepMind showed Gemini Robotics ER controlling Boston Dynamics Spot through natural language and a tool bridge allowing movement, photos, and grasping [@GoogleDeepMind, @GoogleDeepMind].
OpenAI launched GPT-Rosalind, a trusted-access life sciences reasoning model for biology, drug discovery, and translational medicine, available to qualified customers including Amgen, Moderna, Allen Institute, and Thermo Fisher [@OpenAI, @OpenAI, @kevinweil]. Kimmonismus positioned it as more of a scientific workflow/reasoning layer than an Isomorphic-style structural-biology engine [@kimmonismus].
Microsoft’s Mustafa Suleyman highlighted a new Nature Health paper on how people use AI in healthcare today and implications for personalized support [@mustafasuleyman].
In robotics, Brett Adcock introduced Vulcan, a controller allowing Figure robots to lose up to 3 lower-body actuators and still hobble without falling, and said Helix-02 is operating a package logistics use case fully autonomously [@adcock_brett, @adcock_brett].
BADAS 2.0 from Nexar was presented as a V-JEPA2 world model trained on real-world videos, with “lite” versions that can run on CPU [@eranshir].

Platform / ecosystem notes

GLM-5.1 users on vLLM/SGLang were told to update chat templates to fix a tool-calling bug where tool outputs rendered empty and caused repeated tool-call loops [@Zai_org].
Adaptive Data integrated Hugging Face datasets into its workflow platform [@adaption_ai].
Hugging Face contributors highlighted tooling to improve day-0 Apple Silicon / MLX support for Transformers models [@pcuenq, @awnihannun].
Perplexity launched Personal Computer, a Mac-based orchestration layer across local files, apps, and browser, with 24/7 operation via Mac mini and mobile triggering [@perplexity_ai, @perplexity_ai, @AravSrinivas].
Google added an AI Mode side-by-side search experience in Chrome desktop for U.S. users [@Google, @rmstein].

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen3.6-35B-A3B Model Release and Benchmarks

Qwen3.6-35B-A3B released! (Activity: 2730): The image showcases bar charts comparing the performance of the newly released Qwen3.6-35B-A3B model against other models like Qwen3.5-27B and Gemma4-31B across various benchmarks. This sparse MoE model, with 35B total parameters and 3B active, demonstrates superior performance in agentic coding and reasoning tasks, outperforming its predecessors and other models of similar size. The model is open-source under the Apache 2.0 license, emphasizing its strong multimodal perception and reasoning capabilities, and is available on platforms like HuggingFace and ModelScope. Commenters highlight the model's impressive performance, noting its superiority over the dense 27B-param Qwen3.5-27B in key coding benchmarks. There is also anticipation for future releases that could challenge larger models from companies like Google.
- ResearchCrafty1804 highlights that the Qwen3.6-35B-A3B model significantly outperforms its predecessor, Qwen3.5-35B-A3B, particularly in agentic coding and reasoning tasks. This improvement is notable given that it also surpasses the dense 27B-param Qwen3.5-27B on several key coding benchmarks, indicating a substantial leap in performance for local LLMs.
- ResearchCrafty1804 also notes the model's vision-language capabilities, emphasizing that Qwen3.6-35B-A3B is natively multimodal. It performs exceptionally well on vision-language benchmarks, achieving a score of 92.0 on RefCOCO and 50.8 on ODInW13. These results suggest that its multimodal reasoning capabilities are on par with, or even exceed, those of Claude Sonnet 4.5, particularly in spatial intelligence tasks.
- AndreVallestero speculates on the potential release of a larger Qwen3.6 model, such as a 122B version, which could pressure competitors like Google to release their own large models. This discussion hints at the competitive landscape in AI model development, where advancements in model size and capability could influence market dynamics and innovation.
Released Qwen3.6-35B-A3B (Activity: 604): The image showcases the performance of the newly released Qwen3.6-35B-A3B model by Alibaba across various benchmarks, including Terminal-Bench, SWE-bench, and GPQA Diamond. The bar charts illustrate that Qwen3.6-35B-A3B generally outperforms previous models like Qwen3.5 and Gemma4, particularly in tasks related to coding, reasoning, and image processing. This release is available on Hugging Face, indicating significant improvements from a relatively small update. Commenters are impressed by the performance gains from the update, noting the competitive edge over previous models like Gemma4. There is anticipation for similar updates to larger models like 122B and 397B.
- Willing-Toe1942 highlights that the Qwen team seems to have strategically compared Qwen3.6-35B-A3B against Qwen3.5 and Gemma4, suggesting a significant performance improvement. This implies a competitive edge in the latest release, potentially indicating substantial advancements in model efficiency or capability.
- lacerating_aura expresses interest in the potential release of open weights for larger models like 122B and 397B, which could suggest that the community is eager for more accessible, high-capacity models. This reflects a demand for transparency and the ability to experiment with larger architectures.
- itisyeetime speculates on the performance positioning of Qwen3.6-35B-A3B, suggesting it might be a middle ground between Qwen 3.5 122B and Qwen 3.5 35B. This raises questions about the extent of benchmark optimization and whether the new model offers tangible improvements over the larger 122B model.
Released Qwen3.6-35B-A3B (Activity: 101): The image presents a performance comparison of the newly released Qwen3.6-35B-A3B model against other models like "Qwen3.5-35B-A3B," "Gemma4-26B-A4B," and "Qwen3.5-27B." The Qwen3.6-35B-A3B model, depicted with purple bars, demonstrates superior performance across various benchmarks, including coding, reasoning, and image processing. This suggests significant improvements in the model's capabilities over its predecessors and competitors, highlighting advancements in AI model development. One commenter expresses a desire for a specialized version of the Qwen3.6 model focused on coding, indicating interest in further specialization of AI models. Another comment humorously advises patience in downloading the model to avoid server overload, suggesting a community eager to test new releases.

2. Local AI and Model Usage

Local AI is the best (Activity: 602): The image is a meme illustrating a humorous interaction with a local AI model, emphasizing the candidness and freedom of using locally hosted AI systems. The post highlights the benefits of local AI, such as the ability to fine-tune models without censorship or data harvesting, and expresses gratitude towards developers of open-weight models like llama.cpp. The image and post together underscore the appeal of local AI for privacy-conscious users who value control over their data and interactions. One commenter praises llama.cpp as 'goated', indicating high regard for its capabilities. Another warns that smaller local models can sometimes exhibit bias or 'glaze' more than larger, frontier models, suggesting a nuanced view of local AI's limitations.
- A user tested the Minimax m2.7 model to compare it with the 'Elephant' model on Openrouter, noting that despite its high token throughput, the 'Elephant' model underperforms compared to smaller models like the 27B. The user highlights that labs like DeepSeek, OpenAI, and Anthropic have superior inference optimization, suggesting that the lab behind 'Elephant' struggles with optimization, which is critical for model performance.
- A user inquires about the capabilities of a system with a 9070xt GPU and 64GB RAM for local AI hosting. This setup is considered high-end for local model hosting, and the user is advised to manage expectations regarding the performance and capabilities of running large models locally, as hardware limitations can impact the efficiency and speed of inference.
- A comment mentions the potential issues with smaller local models, noting that they can sometimes perform worse than frontier models in terms of 'glazing,' which likely refers to generating less coherent or relevant outputs. This highlights the importance of model selection and optimization in achieving desired performance levels.
Are Local LLMs actually useful… or just fun to tinker with? (Activity: 541): Local LLMs offer significant advantages in terms of privacy and cost savings, as they eliminate API costs and keep data on-premises. However, they often require substantial setup and maintenance, which can be a barrier to practical use. Despite these challenges, local models excel in handling sensitive or internal data, such as notes, drafts, and private documents, where data privacy is paramount. Some users report that local models like the 31B from Gemma 4 family are performing exceptionally well, especially for tasks like coding, creative writing, and daily chat, when run on high-performance hardware such as a 3090 24GB with 192GB RAM. There is a consensus that while cloud models have degraded due to increased demand, local models are improving and becoming more practical for everyday use. Users note that the main limitation is not the model's capability but the complexity of setting them up and maintaining them. Some foresee a near future where local LLMs become viable for regular workflows, not just experimentation.
- Local LLMs are particularly advantageous for handling sensitive or internal data due to their ability to operate without API costs and data leaving the system. The main challenge lies in the setup and maintenance, which once streamlined, could make 'offline GPT' setups viable for everyday work beyond just experimentation.
- The performance of local models like the 31B from the Gemma 4 family is highlighted as being exceptionally good, especially in comparison to cloud API models which have degraded due to increased demand. This user utilizes a 3090 24GB GPU with 192GB RAM to run multiple variants for tasks such as coding and creative writing, indicating the potential of local models when properly configured.
- Local LLMs can be cost-effective compared to cloud-based solutions, especially for complex projects where API costs can be prohibitive. However, they require careful architectural planning to ensure models are used effectively, such as using a 32B model as a privacy filter to manage business correspondence without exposing personal data to external APIs.
Local Gemma 4 31B is surprisingly good at classifying and summarizing a 60,000-email archive (Activity: 112): The post describes using a local gemma-4-31b-it model to process a 60,000-email archive related to the Computers and Academic Freedom (CAF) Project. The setup involves an HP ZBook Ultra G1a with an AMD Ryzen AI MAX+ PRO 395, 16 cores, and 128 GB RAM, running the model locally via LM Studio's OpenAI-compatible API. The process uses a two-pass pipeline: Pass 1 filters out 68.4% of emails as irrelevant, while Pass 2 classifies and summarizes the remaining emails, producing structured JSON outputs. The model's performance is noted as effective for historical classification and summarization, with the main challenge being the parsing of old email formats. The project is 20% complete, and the author is open to suggestions for improvements, such as using smaller models for Pass 1 or embeddings for filtering. One commenter suggests verifying the model's summaries by comparing them with results from a frontier model. Another highlights the potential utility of this approach for processing FOIA materials. A third comment praises the Gemma 4 E2B model for its efficiency and capability in handling structured tasks, despite its smaller size.
- GMP10152015 highlights the efficiency of the Gemma 4 E2B model, which has approximately 2 billion effective parameters. Despite its relatively small size, it performs well in everyday tasks, particularly in tool usage, maintaining consistency and clarity in structured calls. This suggests that even smaller models can be highly effective for specific applications, challenging the notion that larger models are always superior.
- singh_taranjeet raises a technical inquiry about the hardware requirements for running the Gemma-4-31b model, noting that 128GB RAM is substantial for local inference. They are curious about the token per second throughput when using an 8K context, as models of this size typically require at least 64GB of RAM. This suggests a focus on optimizing performance and resource allocation for large-scale email processing.
- machinegunkisses discusses the challenge of verifying the quality of summaries generated by the model. They propose a method of validation by comparing the model's output with that of a frontier model, indicating a need for robust evaluation techniques to ensure the reliability of AI-generated summaries. This highlights the importance of benchmarking AI models against established standards to assess their performance.

3. Gemma Model Improvements and Usage

Gemma4 26b & E4B are crazy good, and replaced Qwen for me! (Activity: 646): The user replaced their previous setup using Qwen models with Gemma 4 E4B for semantic routing and Gemma 4 26b for general tasks, citing improvements in routing accuracy and task performance. The previous setup included a complex routing system using Qwen 3.5 models across multiple GPUs, which faced issues with incorrect model selection and inefficiencies in token usage. The new setup with Gemma 4 models resolved these issues, offering faster and more accurate routing and task handling, particularly in basic tasks, image processing, and light scripting. The user highlights that Gemma 4 26b is efficient with 'thinking tokens' and rarely produces repetitive outputs, outperforming previous models in specific coding tasks like frontend HTML design. Commenters questioned the choice of models, suggesting alternatives like Gemma-4-31b for tasks and inquiring about the routing mechanism used. There was also a suggestion to use Gemma 4 26B for routing to save RAM, given its efficiency and speed.
- anzzax inquires about the logistics of managing multiple models, specifically how to handle VRAM and compute resources when frequently loading models like Gemma4 26b and E4B. This suggests a need for efficient model management strategies, possibly involving dynamic loading or model parallelism to optimize resource usage.
- andy2na discusses the use of routing in model deployment, questioning why not use the 26B model for routing given its MoE (Mixture of Experts) architecture, which is known for speed and RAM efficiency. This highlights the potential for using MoE models to optimize resource allocation and performance in multi-model setups.
- Rich_Artist_8327 questions the choice of using Gemma4 26b over the larger Gemma-4-31b for tasks, implying a trade-off between model size and performance. This suggests a discussion on the balance between computational cost and the quality of results, where smaller models might offer sufficient performance with reduced resource demands.
Gemma 4 Jailbreak System Prompt (Activity: 1071): The post discusses a system prompt for the Gemma 4 model, derived from the GPT-OSS jailbreak, which allows the model to bypass typical content restrictions. This prompt explicitly permits the model to engage with explicit, graphic, and sexual content, overriding any existing policies with a new 'SYSTEM POLICY' that mandates compliance with user requests unless they fall under a specific disallowed list. This approach is applicable to both GGUF and MLX variants of the model, indicating a focus on open-source flexibility and user control. Commenters note that the Gemma 4 model, especially in its 'instruct' variant, is already largely uncensored, except for cybersecurity topics. The system prompt is seen as a way to further reduce refusals, with some users suggesting that even without the prompt, the model is permissive regarding adult content.
- VoiceApprehensive893 discusses a modified version of the Gemma 4 model, specifically the 'gemma-4-heretic-modified.gguf', which is designed to operate without the typical constraints or guardrails imposed by system prompts. This modification is aimed at reducing refusals, potentially making the model more flexible in its responses.
- MaxKruse96 points out that the Gemma 4 model, particularly in its instruct variant, is already quite uncensored, except for cybersecurity topics. This suggests that the model can handle adult topics without additional modifications, indicating a high level of openness in its default configuration.
- DocHavelock inquires about the concept of 'abliteration' in the context of open-source models like Gemma 4. They question whether the method discussed (modifying the system prompt) offers advantages over 'abliteration', or if it is a form of 'abliteration' itself. This highlights a curiosity about different methods of modifying or enhancing model behavior.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Claude Opus 4.7 Launch and Benchmarks

Claude Opus 4.7 benchmarks (Activity: 1058): The image presents a benchmark comparison of several AI models, including Claude Opus 4.7, Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and Mythos Preview. The benchmarks cover tasks such as agentic coding, multidisciplinary reasoning, and agentic search. Opus 4.7 shows improvements over its predecessor, Opus 4.6, in most categories, indicating advancements in performance. However, Mythos Preview generally outperforms other models, particularly in visual reasoning and multilingual Q&A. The blog post linked in the comments suggests that Opus 4.7 was intentionally designed with reduced cyber capabilities, which may have impacted its agentic search performance. Commenters note the significant improvement in the swebench pro score for Opus 4.7, which is seen as a positive development before the release of version 5. There is also speculation that the intentional reduction in cyber capabilities for Opus 4.7 might have negatively affected its agentic search performance.
- The benchmark results for Claude Opus 4.7 show an 11% improvement on the swebench pro, indicating a significant performance boost before the anticipated release of version 5. However, there are concerns about the model's performance on the cybergym score, which appears to have been intentionally kept lower. This decision might have also impacted the agentic search score, as the developers focused on testing new cyber safeguards on less capable models first, as noted in Anthropic's blog post.
- Claude Opus 4.7 demonstrates notable improvements in advanced software engineering tasks, particularly excelling in complex and long-running tasks. Users have reported that the model can handle difficult coding work with minimal supervision, showing rigor and consistency in its operations. It also pays precise attention to instructions and has mechanisms to verify its outputs before reporting, which enhances its reliability for challenging tasks.
Opus 4.7 seems to rolled out to Claude Web (Activity: 446): The image suggests that the Claude Web interface has been updated to include "Opus 4.7," indicating a potential rollout of a new version of the model. The presence of the text "claude-opus-4-7" in the interface suggests that users can now interact with this updated model version. This aligns with the user's ability to replicate a specific behavior or feature consistently, as mentioned in the post. The comments hint at the possibility of A/B testing, which is a common practice in software development to compare different versions or features with users. One comment suggests that the rollout might be part of an A/B testing strategy, which is a method used to test changes to a product by comparing two versions. Another comment mentions concerns about usage limits, indicating that users are mindful of resource constraints when testing new features.
- The rollout of Opus 4.7 to Claude Web appears to be inconsistent, with some users reporting they still see version 4.6. This suggests that the deployment might be in an A/B testing phase, where different users are exposed to different versions to test performance and gather feedback. This is a common practice in software development to ensure stability and gather user data before a full rollout.
- One user from Germany noted that while the interface indicates Opus 4.7, the underlying Claude code still reports version 4.6. This discrepancy highlights potential issues in version labeling or deployment synchronization, which can lead to confusion among users and complicate troubleshooting efforts.
- The mention of usage limits by a user suggests that checking the version might consume part of their allocated resources, indicating that the platform may have constraints on usage that could affect how users interact with new updates. This could be a consideration for developers when planning feature rollouts and user notifications.
Opus 4.7 has been spotted on Google Vertex (Activity: 516): The image highlights a list of quota entries for various base models on Google Vertex, including the newly spotted "anthropic-claude-opus-4-7." This suggests that Google Vertex is preparing to support this model, although the quota values are currently set to 0, indicating no active usage or limits yet. The presence of "opus-4-7" alongside other models like "opus-4-5" and "opus-4-6" suggests a progression in the series, potentially indicating improvements or updates in the model's capabilities. One comment speculates that "Opus 4.7" might be a lighter version compared to a rumored "Spud" model, which is suggested to be at "Mythos level," implying a higher performance tier. This reflects a competitive landscape where new models are frequently released, akin to firmware updates, to maintain technological edge.
- Independent-Ruin-376 discusses the potential of the upcoming 'Spud' model, rumored to be at 'Mythos level', suggesting that if true, it could overshadow Opus 4.7, which is expected to be a lighter version. This implies a competitive pressure on 'Ant' to release 'Mythos' quickly to maintain its standing in the market.
- greenrunner987 observes unusual behavior in Opus 4.6, noting that it has been responding almost instantaneously, which might indicate a significant reallocation of resources. This could suggest backend optimizations or changes in resource management, possibly in preparation for the release of Opus 4.7.
- adeadbeathorse hints at a decline in performance of current models, speculating that this might be due to resource shifts or updates in anticipation of new releases like Opus 4.7. This aligns with observations of resource reallocation and could indicate strategic adjustments by the developers.
Introducing Claude Opus 4.7, our most capable Opus model yet. (Activity: 3850): Claude Opus 4.7 introduces significant improvements in handling long-running tasks with enhanced precision and self-verification capabilities. It boasts a substantial upgrade in vision, supporting image resolutions over three times higher than previous models, which enhances the quality of interfaces, slides, and documents. However, there is a noted regression in long-context retrieval performance, with MRCR v2 at 1M tokens dropping from 78.3% in version 4.6 to 32.2% in 4.7. Anthropic has acknowledged this, explaining that MRCR is being phased out in favor of metrics like Graphwalks, which better reflect applied reasoning over long contexts. More details can be found on Anthropic's news page. Some users express dissatisfaction with the removal of 'thinking effort settings' in the Claude App for Opus 4.7, indicating a preference for more customizable model behavior. Additionally, there is a debate over the importance of MRCR as a benchmark, with some arguing that it does not reflect real-world usage of long-context capabilities.
- Craig_VG highlights a significant regression in long-context retrieval performance between Opus 4.6 and 4.7, with MRCR v2 scores dropping from 78.3% to 32.2%. This suggests a decrease in the model's ability to handle long-context tasks effectively. However, Boris explains that MRCR is being phased out in favor of Graphwalks, which better reflects real-world usage and applied reasoning over long contexts, particularly in code-related tasks.
Opus 4.7 Released! (Activity: 765): Anthropic has released Opus 4.7, an update to its Claude AI model, which shows significant improvements over its predecessor, Opus 4.6. The new version excels in complex programming tasks, demonstrating enhanced instruction-following and self-checking capabilities. It also features improved vision and multimodality, supporting higher-resolution images for better handling of dense visual content. The model maintains the same pricing as Opus 4.6, at $5 per 1 million input tokens and $25 per 1 million output tokens, and is available across all Claude products and major platforms like Amazon Bedrock, Google Vertex AI, and Microsoft Foundry. Read more. Some users report that Opus 4.6's performance declined in the weeks leading up to the release of 4.7, suggesting a possible strategic move by Anthropic. Others note the efficient usage metrics of the new version, indicating satisfaction with its performance.
- The release of Opus 4.7 introduces an updated tokenizer that enhances text processing capabilities. However, this improvement comes with a tradeoff where the same input may map to more tokens, approximately 1.0–1.35× depending on the content type. This change aims to optimize performance, particularly in agentic coding scenarios, where Opus 4.7 Medium is reportedly comparable to Opus 4.6 High while using fewer tokens, as illustrated in this graph.
- A user notes that Opus 4.6 has been underperforming for the past two weeks, raising concerns about whether this is a strategic move to encourage upgrades to Opus 4.7. This suggests a potential issue with the previous version's performance that might be addressed in the new release.
- Another user reports that Opus 4.7's performance is impressive, with only 3% of both 5-hour and weekly usage being consumed for a simple task. This indicates a significant improvement in efficiency and resource management in the latest version.

2. AI Models in Roleplay and Creative Writing

I need to vent about the available models and my RP journey. Feel free to ignore (Activity: 350): The post discusses the challenges of finding a suitable role-playing (RP) model that combines desired features such as character adherence, nuanced subtext, and coherent plot advancement. The user has experimented with various models including Claude Sonnet 3.7, Gemini 2.5 Pro, Deepseek 3.2, Grok, Kimi 2, GLM 4.7, and Gemini 3.1, each having significant drawbacks like positivity bias, lack of nuance, or instability. The user expresses frustration with NanoGPT's performance issues, particularly its tendency to stop mid-output and reduced intelligence compared to OpenRouter. The post highlights a desire for a model that combines the strengths of these models without their flaws, such as the memory and plot advancement of Gemini 2.5 and the subtext reading of Claude models. One commenter suggests switching models mid-story to avoid repetition and leverage different model strengths, while another highlights the unrealistic expectation of combining multiple high-cost model features into a single affordable model. Another user mentions Opus as a close alternative but notes its high cost.
- Fit-Statistician8636 suggests that switching models mid-story can help avoid repetition issues, especially when using cloud models, as the transition is quick and seamless. This approach can enhance the storytelling experience by introducing variety and maintaining engagement.
- KitanaKahn discusses the imperfections of AI models, using Gemini 2.5 as an example. They highlight how the model's negativity bias can be turned into a creative challenge, requiring strategic thinking to earn character approval, which can lead to engaging role-playing experiences despite the model's flaws.
- Fairy_Familiar mentions Opus as a high-quality model but notes its high cost. This comment reflects the ongoing challenge of balancing model performance with affordability, a common issue for users seeking advanced AI capabilities without incurring significant expenses.
Claude Opus 4.7 is out (Activity: 185): Claude Opus 4.7 has been released, with initial user feedback suggesting a reduction in the AI's positivity bias compared to version 4.6. Users report that the AI now follows instructions more ruthlessly, particularly in scenarios where it needs to avoid being overly supportive or cooperative. This change may affect how the model handles role-playing (RP) scenarios, though the overall quality appears similar to 4.6 before any perceived downgrades. Some users note that the model's tendency to self-correct in a stream-of-consciousness manner persists, particularly when dealing with character quirks. Others have not experienced this issue, attributing it to their specific prompts. There is also a mention of the model's cost being a barrier for some users.
- Users have noted that Claude Opus 4.7 exhibits a tendency to self-correct in a stream-of-consciousness manner, particularly when dealing with character quirks. This behavior is described as the model catching itself making a mistake and then awkwardly correcting it without revising the initial error, which can be problematic for certain applications.
- There is a mixed reception regarding the quality of Claude Opus 4.7 compared to version 4.6. Some users feel that the quality remains consistent with 4.6 before it was 'lobotomized,' suggesting that the model's performance may depend heavily on the prompts used. This indicates that prompt engineering might play a significant role in mitigating some of the model's perceived issues.
- Pricing for Claude Opus 4.7 is a point of contention, with costs listed as $5/M input and $25/M output. This has led to discussions about the model's affordability and value, especially when compared to potential alternatives like Sonnet, which some users are anticipating.
Is it just me or has DeepSeek's memory improved significantly? (Activity: 91): DeepSeek appears to have significantly improved its memory capabilities, as evidenced by a user reporting a 7-hour role-playing session where the AI consistently remembered intricate details and maintained logical consistency throughout. This improvement is notable in long sessions, with users experiencing the AI recalling small details and inside jokes even after 300 messages. This suggests enhancements in the model's memory retention and contextual understanding, potentially due to updates in its architecture or training data. Some users note that while DeepSeek excels in memory retention, it tends to steer role-playing scenarios towards metaphysical threats rather than physical ones, such as deathclaws or super mutants, which may affect the tension dynamics in the sessions.
- A user noted that DeepSeek's memory capabilities have improved significantly, allowing it to recall small details or inside jokes even after a 300-message run. This suggests enhancements in its long-term memory retention, which is crucial for maintaining context over extended interactions.
- Another user mentioned that DeepSeek tends to steer role-playing scenarios towards metaphysical threats rather than physical ones like deathclaws or super mutants. This indicates a possible bias in the model's narrative generation, which could affect the diversity of scenarios it creates.
- A user highlighted that while playing a long RPG with DeepSeek, the experience was smoother than before, although there are still some minor issues. This suggests that while improvements have been made, there are still areas that require further refinement.
RP with DeepSeek (Activity: 68): The post discusses using DeepSeek for text-based role-playing (RP), highlighting its ability to maintain character consistency and introduce narrative twists without user prompts. The user expresses a challenge with generating multiple alternative responses, which hinders story progression. DeepSeek is praised for its creative writing and character realism, offering a unique RP experience compared to group settings. One commenter suggests using specific formatting (quotes, asterisks, parentheses) to guide DeepSeek's responses and manage RP flow. Another mentions using 'frontends' for better interaction, while a third notes increased sensitivity in DeepSeek's responses, leading to more frequent warnings.
- Maximum-Face9536 describes a structured approach to role-playing with DeepSeek by using specific formatting: quotes for dialogue, asterisks for internal thoughts, and parentheses for direct communication with the AI. This method helps guide the AI's responses and manage the flow of interaction.
- KingGamer123321 shares a technique for managing large context windows in DeepSeek by segmenting conversations into parts and creating timelines. This approach was particularly useful during the 128k context window phase, allowing for detailed continuity and exploration of alternative story directions.

3. AI in Coding and Development Workflows

Claude Code workflow tips after 6 months of daily use (from a senior dev) (Activity: 726): A senior full-stack developer shares insights on optimizing productivity with Claude Code after six months of daily use. Key strategies include using "plan" mode for complex tasks to avoid unnecessary iterations, requesting implementation in small steps to maintain control, and leveraging the preview feature to catch issues early. The developer emphasizes allowing Claude to fix its own bugs to improve its contextual understanding and suggests running a simplification process before code reviews to counteract over-engineering. Additionally, conducting retrospectives at the end of sessions helps build institutional knowledge by asking Claude what it learned during the session. A commenter suggests using a dual-model approach by vetting plans with Codex via MCP to catch more issues early, enhancing the planning phase.
- A user suggests using two models in tandem, specifically mentioning 'vet your plan with codex via mcp first', to improve the accuracy and reliability of code plans. This approach helps catch more errors early in the planning phase, reducing the need for fixes later on.
- Another user describes a workflow where they update a file named 'Claude.md' with any rules before compacting, which helps maintain momentum. They also mention compacting only 1-5% at a time to avoid issues during task execution, indicating a strategy to manage task flow and prevent disruptions.
The cost of code use to be a middleware for our brains. (Activity: 1073): The post discusses the evolving nature of software engineering, particularly the impact of AI and automation on the pace and nature of coding work. The author, a seasoned engineer, expresses burnout due to the increased speed and frequency of decision-making required in modern development environments. They note that the traditional 'throttling middleware' of time and effort in coding has been removed, leading to a rapid pace that feels unsustainable. This shift has raised the bar for developers, requiring them to adapt to a faster, more demanding workflow, which is exacerbated by the use of AI tools that accelerate coding processes. Commenters echo the sentiment of increased cognitive load and decision fatigue, with one noting the necessity of extensive documentation to keep track of rapid changes. Another highlights the inevitability of adapting to new tools like AI, comparing it to historical shifts in technology such as the move from manual drafting to CAD. There's a shared sense of nostalgia and adaptation fatigue as traditional skills become less relevant.
- triggeredg0blin highlights a unique form of mental exhaustion termed 'vibing fatigue,' which arises from the constant need to generate, review, and correct AI outputs. This involves frequent context-switching between trusting and verifying AI tools, leading to a specific kind of cognitive load distinct from traditional burnout. The process includes checking for hallucinated dependencies, subtle type mismatches, and adherence to team conventions, which can be mentally taxing over an extended workday.
- Jaded-Comfortable179 discusses the impact of AI advancements, particularly with Opus 4.5, on personal projects and professional life. The commenter notes a shift in tolerance for traditional work due to AI's growing role, expressing concern over the profession's future. They mention working in a slow-moving enterprise that hasn't fully adopted AI, allowing them to maintain their output without pressure to integrate AI into their workflow, highlighting the tension between technological adoption and career stability.
- Final545 reflects on the obsolescence of traditional coding skills, drawing parallels to historical shifts in technology like the transition from punch cards to modern programming. They express a sentiment of skills becoming outdated, noting that 15 years of honing coding abilities may now seem less valuable except in rare cases. This comment underscores the rapid evolution of technology and its impact on the perceived value of long-held technical skills.
AMD engineer analyzed 6,852 Claude Code sessions and proved performance changed. Here's what Anthropic confirmed, what they disputed, and the fixes that actually work. (Activity: 217): An AMD engineer conducted a comprehensive analysis of 6,852 Claude Code sessions, revealing significant performance changes, including a 70% reduction in file reads per edit and a 27.5% increase in blind edits. The analysis, documented in GitHub Issue #42796, highlighted issues such as 'ownership-dodging' stop hooks and a dramatic increase in API costs from $345 to $42,121. Anthropic confirmed several changes, including a shift to 'adaptive thinking' and a reduction in default effort, but disputed claims of reduced reasoning. Workarounds include setting CLAUDE_CODE_EFFORT_LEVEL=max and disabling adaptive thinking. As of April 7, high effort was restored for API/Team/Enterprise users, but Pro users must adjust settings manually. The incident underscores the importance of robust evaluation suites and cost monitoring for AI-dependent workflows. Commenters praised the engineer's rigorous data-driven approach and Anthropic's transparency in acknowledging the issues and providing workarounds. Some users noted that while the changes might seem reasonable given resource constraints, they highlight the need for efficient resource allocation.
- An AMD engineer conducted a comprehensive analysis of 6,852 Claude Code sessions, revealing significant performance changes. This rigorous, data-driven approach was acknowledged by Anthropic, who confirmed some findings and provided workarounds. This transparency is crucial for community trust and improvement, as it moves beyond mere anecdotal evidence to actionable insights.
- A user with a Pro subscription to Claude Code discusses their usage patterns, highlighting that while they often default to using the Opus 4.6 model for various tasks, the actual need for such a high-performance model is limited. This reflects a broader issue of resource allocation, where high-capacity models are underutilized for simple tasks, suggesting a need for better optimization strategies to allocate resources effectively.
- There is interest in understanding the AI compiler workflow used by the AMD engineer, which could provide insights into how performance issues were identified and addressed. This could be valuable for developers looking to optimize their own AI workflows or understand the technical underpinnings of performance analysis in AI systems.
Stop using Claude like a chatbot. Here are 7 ways the creator of Claude Code actually uses it. (Activity: 212): Boris Cherny, a Staff Engineer at Anthropic and creator of Claude Code, uses Claude not as a chatbot but as a multi-agent system to enhance productivity. He employs a 2,500-token CLAUDE.md file for persistent context across sessions, logs mistakes, and captures knowledge during code reviews. His workflow includes running 5 parallel Claude Code instances for different tasks like building, testing, and debugging, using iTerm2 notifications for coordination. He emphasizes using Plan Mode for drafting design docs and a verify-app subagent for automated testing and fixing. Shared slash commands in .claude/commands/ automate repetitive tasks, shifting focus from manual work to cognitive scheduling. Read more. Some commenters noted the post's similarity to previous content and expressed dissatisfaction with ads on the linked page.

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.