News

Discover → Learn → Apply

Stay in the Loop

Get the latest AI search and content discovery news delivered to your inbox.

Dec

2025

not much happened today

AI News Recap: December 2nd-3rd, 2025

This recap covers significant developments in AI from December 2nd-3rd, 2025, drawing from various online communities and news sources. Key topics include advancements in AI video and imaging, new open models and benchmarks, agent development, evaluation methods, system efficiency, industry moves, and community discussions.

AI Twitter Recap

AI Video and Imaging

Kling 2.6: Introduced native audio co-generation for video, producing synchronized voice, SFX, and ambience. It boasts coherent lip-sync and motion, with broad partner rollouts including Fal, InVideo, ElevenLabs, Freepik, and OpenArt. Early creator tests show improved shot variation and speed.
Kling O1: Focuses on framing, shot variety, and in-scene creative control for video composition.
Runway Gen-4.5: Enhances visual fidelity and features "auto-lighting" to match scene mood.
Nano Banana Pro (Gemini 3): Google's new image model offers enhanced reasoning and compositing capabilities, supporting up to 14 images per prompt. Synthesia integrated one-click generation, and Gemini surfaced 2K-resolution outputs.

Open Models, Releases, and Benchmarks

DeepSeek V3.2 (MoE, DSA): Ranked #2 for open-weights "reasoning" models by Artificial Analysis. It uses DeepSeek Sparse Attention for long contexts and is priced competitively. The V3.2-Speciale variant is noted for reasoning-only tasks.
Mistral "Ministral 3" Family: A multimodal family with a strong 14B variant was released, with TRL recipes available for SFT+GRPO.
Retrieval and Code Models: Alibaba's EvoQwen2.5-VL shows strong performance as a visual document retriever. Nous Research released Hermes 4.3, trained on ByteDance Seed 36B, matching or beating centralized runs and topping RefusalBench.
Community Arena: LM Arena added INTELLECT-3 (106B MoE) for head-to-head comparisons.

Agents: Building, Evaluation, and Inference Infrastructure

No-Code to Production: LangChain's LangSmith Agent Builder is being used for real-world workflows, with guidance on evaluation patterns and cache control.
Agent Infra and Performance: vLLM added Snowflake's model-free SuffixDecoding. Together AI partnered with Meta for high-performance RL in agentic systems. LlamaIndex introduced Click-to-Deploy document workflows.
Standards and Multi-Agent Semantics: Dair-AI proposed an L8 "communication" vs L9 "semantic negotiation" stack for the Internet of Agents. Independent work quantifies multi-agent communication efficiency.
Coding Agents: A new free course covers agents that write and execute code safely in sandboxed environments.

Evals and Methods: What to Measure and How

CORE-Bench "Solved" with Scaffold Coupling: Using Claude Code with Opus 4.5 achieved 95% on CORE-Bench, highlighting the impact of model-scaffold coupling.
OpenAI "Confessions": A GPT-5 Thinking variant is trained to output "confessions" about compliance, rewarding honesty.
Benchmarking at Scale: Epoch AI proposed "stitching" benchmarks. Hugging Face released the LLM Evaluation Guidebook v2.
Learning Dynamics: "Quiet Feature Learning" shows transformers acquire task-critical features during flat loss plateaus.

Systems and Inference Efficiency

Apple MLX-LM Gains: Added continuous batching for server-side inference.
Attention/Parallel Comms: ByteDance's async Ulysses attention is noted for its simplicity and speed.
vLLM Engineering: Added CUDA core-dump tracing for deep inlining/async memory cases.
Search Infra Shift: Teams are migrating vector workloads to Qdrant for native vector indexing and hybrid search.
Diffusion Distillation: "Glance" speeds up Qwen-image/FLUX inference.
Data Plumbing: Hugging Face now allows dataset duplication via Xet.
On-Device Multimodal: Nexa's AutoNeural-VL-1.5B runs locally on Qualcomm SA8295P NPUs.

Industry Moves and Platform Updates

Anthropic's Scale-Up: Reported investments of up to $10B (Microsoft) and $5B (NVIDIA), with a $30B compute purchase from Microsoft, implying a ~$350B valuation. Announced a $200M Snowflake partnership and a "Claude for Education" deployment.
OpenAI Grants: The OpenAI Foundation's People-First AI Fund awarded $40.5M to 208 nonprofits.
Waymo Expansion: Fully driverless operations expanded to additional cities, scaling over 500% YoY.
Developer Tools: Google launched Workspace Studio. Phind raised $10.4M.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

DeepSeek V3.2 Model Advancements:
- Technical report highlights DeepSeek Sparse Attention (DSA) and a scalable RL framework.
- Speciale variant surpasses GPT-5 in reasoning.
- Community expresses skepticism about cost-effectiveness and the term "open" used by OpenRouter.
Chinese TPU Development vs NVIDIA A100:
- Chinese startup claims a TPU 1.5x faster than NVIDIA A100.
- Skepticism noted due to A100 being an older model.
- Discussion on ASIC advantages and US policy concerns.
Micron’s Exit from Consumer Business:
- Micron exits Crucial consumer brand, impacting RAM and SSDs.
- Immediate price hikes observed.
- Criticism of corporate response to market demand.

Less Technical AI Subreddit Recap

ChatGPT User Dissatisfaction and Ads:
- User frustration with ads in ChatGPT Plus interface.
- Discussion on OpenAI's new apps SDK potentially being mistaken for ads.
- Mention of off-topic responses from ChatGPT.
- Skepticism about ads in Gemini, with speculation on Google's monetization strategies.
- Clarification that some perceived ads are part of the SDK.
- Concerns about data privacy and targeted marketing.
New AI Model and Benchmark Launches:
- Kling AI 2.6: First text-to-video model with built-in audio and 1080p output. Enhancements include character consistency and an editable studio feature.
- Claude Opus 4.5: Available in Claude Code for Pro users, consuming rate limits faster. Opus cap removed as of 11/24.
- Anthropic IPO Rumors: Planning IPO by early 2026 with a $300B valuation target.
Gemini and Nano Banana Pro Impact:
- OpenAI Code Red: Graph shows a 6% decrease in ChatGPT traffic since Gemini's launch.
- User migration to Gemini cited due to better integration.
- Concerns about Google's potential AI dominance.
- Gemini vs. GPT-5.1: Gemini excels in image generation but lacks technical accuracy compared to GPT-5.1 for electrical installation materials.
- Nano Banana Pro: Praised for handling multiple subjects accurately in images, but editing capabilities can be inconsistent.
- Discussion on the realism of AI-generated images and potential misuse.

AI Discord Recap

1. New Frontier Models, Benchmarks, and Capabilities

DeepSeek and Speciale Models: DeepSeek V3.2 Speciale leads reasoning benchmarks. Enterprise focus on intelligence-to-price ratio. Rough edges in tool schemas noted.
Hermes 4.3: Nous Research unveiled Hermes 4.3 on ByteDance Seed 36B, trained on Psyche network. Outperforms centralized baselines. Users eye Hermes for niche simulations due to low refusal rate.
OpenAI "Garlic" and GPT-5 Thinking: Rumors of OpenAI's "Garlic" model to rival Gemini 3. GPT-5 Thinking variant trained with "confessions" procedure to self-report failures.
Leaderboards: Gemini-3-pro-grounding tops Search Arena leaderboard. Qwen3 benchmarks show fast performance with large context windows.

2. AI Security, Jailbreaking, and Red-Teaming Tooling

Falconz: Unified AI security and red-teaming platform demoed.
RawChat: Uncensored GPT-4o front-end with "stealth mode" to bypass safety filters.
SEED Framework: Claims 99.4% jailbreak resistance using "biblical logic" to rewrite AI identity.
Jailbreaks, OSINT, DDoS: Exploits against Gemini 3 Pro and Claude discussed. Backscatter DDoS pattern using public AI support bots observed.
MCP Security: Alarms raised over Desktop Commander MCP server logging unanonymized tool usage.

3. GPU Systems, Kernels, and Low-Bit Training

Blackwell, NVFP4, Kernel Cage Match: GPU MODE competition channels active. GEMM latencies reported. Reference-kernel issues and scale tensor analysis.
Quantization Papers, fp8 Adam, Activation Offload: arXiv studies on low-bit formats. Activation offloading system for pretraining/fine-tuning on limited GPUs.
Torch Compile, cuDNN, Conv3D Bugs: Conv3D slowdowns in PyTorch 2.9.1+cu128. Workaround involves installing newer cuDNN.
Bitsandbytes, Apple Silicon: "Apple Silicon support" pull request merged. Python/PyTorch backend planned, but no native Metal kernels yet.

4. Agent Frameworks, Tools, and Prompt/Behavior Engineering

MCP Apps SDK: Open-sourced SDK enables ChatGPT-style apps across arbitrary chatbots.
DSPy and Pydantic: DSPy signatures accept Pydantic BaseModel types for strongly-typed agent outputs.
Agents Learn Tool Validation: Debate on whether agents can interpret, validate, and self-heal tools. "Skills" favored over sub-agents.
Tool-Use Evaluations: DeepSeek v3.2 and GPTs limitations highlighted regarding tool calls and learning post-deployment.

5. Ecosystem Economics, Funding, and Model Quality Regressions

Vertical AI and Infra Startups: Eon raised $300M, Gradium $70M seed, Antithesis $105M Series A. Anthropic acquired Bun.
Yupp AI Credits, Arena Economics: Debate over Yupp AI's credit system sustainability. LMArena praised for free access.
AI Bubble Fears: Debate on whether AI investments form a bubble. High R&D costs for foundation models noted.
Model Quality Regressions: Users report degradation in Claude Sonnet/Haiku 4.5, GPT-5, and Gemini 2.5 with Aider. Call for repeatable benchmarks.

Discord Channel Summaries

LMArena Discord

General: Discussions on Yupp AI limits, GPT-5 rumors, AI privacy concerns, and praise for LM Arena's free access.
Announcements: LMArena Test Garden Early Access Program launched. Gemini-3-pro-grounding leads Search Arena Leaderboard.

LM Studio Discord

General: Linux setup issues, MCP server data tracking scrutiny, Qwen3 performance reviews, and comparisons between local LLMs and ChatGPT.
Hardware Discussion: Linux ARM LM Studio on Orange Pi 6, GB10 testing, GPU acquisition, DDR5 RAM benchmarking, and fire extinguisher best practices.

Perplexity AI Discord

General: Perplexity's superior UI/UX, GPTs agents not learning post-training, Gemini outperforming GPT-5.1 in frontend tasks, Comet Browser restrictions, and free Claude Opus trials for Pro users.
PPLX-API: Mention of "open sauce".

Unsloth AI (Daniel Han) Discord

General: WSL2 performance for ML, Gemma-3 4B parameter count issue, Mediawiki tags in pretraining, PARTY Project launch, running LLMs on phones.
Introduce-Yourself: Standard greetings.
Off-Topic: LLMs as echo chambers, engineered curriculum experiments, Apple's CLaRa-7B-Instruct, OLED monitor discussion, Micron exiting consumer business.
Help: Numpy reinstall, support bot, Qwen2 Unsloth training success, new token embeddings, model download issues.
Showcase: English-Kannada Translation Model released.
Research: Prisma-VL-8B, Eric's experiments.

BASI Jailbreaking Discord

General: Comet Browser prompt injection vulnerability, DeepSeek model praise, RawChat launch with stealth mode, SEED framework for AI ethics, Backscatter DDoS attacks via public AI bots.
Jailbreaking: Gemini jailbreak requests, WormGPT scam, Grok jailbreak success, Claude jailbreak demands.
Red Teaming: Seeking LLM red teaming gigs, AI OSINT tool with lateral data synthesis.

OpenAI Discord

Announcements: People-First AI Fund awards grants, GPT-5 Thinking trained to confess mistakes.
AI Discussions: Hybrid Cognition Agent, LLM 'Echo-Pattern' Effect, GPT-5.1 vs Gemini 3, SEO for LLMs, Sora 2 Access.
GPT-4 Discussions: Suspected upgrade of GPT-4 0613 5.1, praise for tool calling and code writing.
Prompt Engineering: ChatGPT customization, modern prompt engineering evolution, agent prompt engineering focus on determinism, Anthropic's system prompts analysis.
API Discussions: ChatGPT customization options, prompt engineering evolution, interaction-level stability, agent prompting vs. conversational prompting, minimal vs. maximal system prompts.

OpenRouter Discord

Announcements: Grok-4.1-Fast slug migration and deprecation.
App Showcase: Falconz AI Security Platform demoed, profit sharing scam exposed.
General: Amazon Nova Provider errors, Claude deprecation, OpenRouter model fallback, MPU v2, x-ai/grok-4.1-fast.
Discussion: OpenAI "Garlic" model rumors, DeepInfra pricing anomaly, Anthropic acquires Bun.

GPU MODE Discord

General: Local LLMs for privacy, single cycle context switching on SM, CUDA forum activity decline, PyTorch's abstraction of CUDA, foundation model training costs.
Triton-Gluon: User confirms successful retrieval.
Torch: Pytorch 2.9.1 Conv3D performance issues and cuDNN workaround.
Cool-Links: Study of low-bit quantization formats, Hadamard transform improvements.
Jobs: ML Performance Engineer, Voice AI Inference Platform, RAG Pipelines, AI Content Detection, Voice AI roles.
Torchao: Torch Compile slowdown with Float 8, torchao and nn.Parameter issues, custom module quantization with nn.Linear.
Off-Topic: EleutherAI publishing help, MLSys conferences career mentorship, Dropbox coffee spot.
Metal: Bitsandbytes merges Apple Silicon support.
Self-Promotion: Qwen3-Omni-30B-A3B-Instruct for fast inference, Hathora playground for Qwen3-Omni testing.
Submissions: nvfp4_gemm leaderboard submissions, NVIDIA performance benchmarks.
Factorio-Learning-Env: Neurips trip, call attendees, call time.
General: Matmul v2 leaderboard error, submitting kernel error, input_generator update.
Multi-GPU: NCCL repository for multi-GPU CUDA kernels, Qwen2.5-1.5B-Instruct OOM issues, context parallelism and Ulysses parallel, sequence parallelism.
Low-Bit-Training: Arxiv papers on quantization, Hadamard transform.
LLMQ: Activation offloading, fp8 Adam, pyllmq on PyPi.
NVIDIA-Competition: Popcorn CLI no-TUI flag, Cutlass version issues, reference kernel generates Infs, scale tensors in CuTeDSL, B200 GPU access.
Robotics-VLA: Alleviating jerky movements via chunking, neural state encoders.

Moonshot AI (Kimi K-2) Discord

General-Chat: DeepSeek V3.2 tool calling capabilities, Black Friday deals, DeepSeek targeting enterprise users, Mistral replacing Qwen.

Nous Research AI Discord

Announcements: Hermes 4.3 release, Psyche training outperforms centralized methods, Psyche team hosts office hours.
General: DeepSeek V3.2 Speciale leads reasoning benchmarks, GLM 4.6 models release soon, AI bubble worries, Hermes 4.3 36B release, subagents vs skills.
Ask-About-LLMs: NLP economic simulation research, Hermes models in Godot, LLMs for market simulation, VendingBench analysis.

Latent Space Discord

AI-General-Chat: Eon's $4B valuation, Gradium spinout, OpenAI's 'Garlic' Model vs Gemini 3, Vertical AI vs Rollups, Antithesis stress-tests AI code.
Genmedia-Creative-AI: Gradium garners $70M seed, Bloom AI launch.

Eleuther Discord

General: Waymo for aerospace students, mechanical engineering relevance, ML student advice, AI alignment benchmarks.
Research: Interpretability of world models, generalization in diffusion models, energy-based models vs. diffusion models, linear RNNs vs. attention.
Interpretability-General: SAEs for interpretability, Cunningham's 2024 SAE paper, SAEs equated to sparse dictionary learning.
LM-Thunderdome: Custom filters in lm-evaluation-harness, decontamination.py inclusion, adapting multiple-choice tasks.

HuggingFace Discord

General: DGX Spark order, agent tool validation & self-healing, YOLO model P-R curve issues, AI learning resources, TRL get_quantization_config usage.
Today-Im-Learning: Starting first AI agent course.
Cool-Finds: Stochastic parrot under fire, new research on stochastic parrots.
I-Made-This: Ellora-Lora Recipes, BitterBot AI Agent, Traffic Spike.
Reading-Group: Features are not what you think, deep dive into deep vision models' quirks.
Smol-Course: SFT model evaluation error, OOM error on fine-tuning, GPU memory management.

Yannick Kilcher Discord

General: Pug resource, Docker and Kubernetes basics, beginner GitHub repositories, Gemini CLI, agents in CLI.
ML-News: Deepseek 3.2 Speciale questioned, distributed compute & research coop suggested.

Modular (Mojo 🔥) Discord

Mojo: Advent of Code segfault solved, ASSERT flag for debugging, splitlines vs split("\n"), string processing in Mojo, AOC solutions sharing.

aider (Paul Gauthier) Discord

General: LLM model degradation with Aider, older Gemini 2.5 degradation, community calls for benchmarks, GGUF Aider benchmark guidance.

DSPy Discord

Show-and-Tell: MCP Apps SDK goes open source, X post unveils SDK motivation.
Papers: Link to arXiv paper shared.
General: Prompt security, custom DSPy OutputFields, Pydantic integration with DSPy, structured outputs.

Manus.im Discord Discord

General: Chatmode feature returns, AI engineer advertises agent building skills, account suspensions due to referrals, engineer shows off RAG pipeline prowess.

tinygrad (George Hotz) Discord

General: Fixing test failures in tinygrad, performance improvements using shrink vs indexing, RMSNorm usage clarification.

MCP Contributors (Official) Discord

General: Redditors debate MCP security risks, MCP-specific security resources.
General-WG: Tool validation, server-side validation crucial for tool-less sampling.

Dec

2025

DeepSeek V3.2 & 3.2-Speciale: GPT5-High Open Weights, Context Management, Plans forCompute Scaling

DeepSeek V3.2 & 3.2-Speciale: GPT5-High Open Weights, Context Management, Plans for Compute Scaling

AI News for 11/28/2025-12/1/2025

This report covers AI news gathered from 12 subreddits, 544 Twitters, and 24 Discords (205 channels, 17803 messages). An estimated 1329 minutes of reading time were saved.

DeepSeek V3.2 and "Speciale" Releases: Agent-First Reasoning Models

DeepSeek has launched its V3.2 family of models, including Standard, Thinking, and "Speciale" variants, now available on LM Arena and community tooling. These models offer up to 131K context at competitive prices.

Technical Notes: DeepSeek reportedly reduced attention complexity from quadratic to approximately linear through warm-starting and gradual adaptation over ~1T tokens. They utilize different attention modes for prefill and decode.
Benchmarking & Behavior: Early feedback highlights strong performance in the Tool Decathlon, though weaker pass@3 compared to Opus suggests further RL tuning is needed. Chinese-language analyses place Speciale in the GPT-5 tier for inductive reasoning, but it still exhibits weaknesses in hallucination and long-context extraction.

American Open-Weight MoE Push: Arcee AI’s Trinity (Mini/Nano)

Arcee AI has released Trinity Mini and Nano models under Apache-2.0 license, featuring open weights, 128K context, tool use, and a focus on reasoning. Pretraining involved 10T tokens on 512 H200s. Architecture details include DeepSeek-style routing, gated attention, and other advanced techniques.

Roadmap: Trinity-Large (~420B total, 13B active) is currently training, aiming for early 2026 release, with the goal of establishing a US-based open-weight frontier MoE entrant.

Video Generation and Editing: Runway Gen-4.5 Leads; Kling O1 Drops

Runway Gen-4.5: This model ranks first on the Video Arena, with its CEO highlighting how a smaller team is outperforming Big Tech in video generation. Some users noted the lack of synced audio as a drawback.
Kling O1: This multimodal generation and editing model supports text, image, and video-conditioned prompting, with features like element add/swap/delete. Community demos showcase impressive transformations.

Serving, Tooling, and Infra Updates

Transformers v5 RC (Hugging Face): This major update introduces ~400 architectures, quantization-first approaches, and an OpenAI-compatible "transformers serve" feature, aiming to be the backbone of the open training, finetuning, and inference stack.
vLLM-Omni: Extends vLLM to omni-modality, supporting models like Qwen-Omni and Qwen-Image.
LangChain 1.1: Introduces capability introspection and "Deep Agents," enabling runtime detection of model features to drive dynamic routing and summarization. Deeper agent patterns include file systems for long-run memory and multi-agent collaboration.
Unsloth: Adds Arctic’s TiledMLP for long-sequence handling. Together AI claims fastest inference on popular OSS LLMs through kernel engineering, near-lossless quantization, and speculative decoding.
VS Code: Ships a "Language Models editor" in Insiders.
Gemini 3 Pro: Integrates Google Search with structured outputs in its API, with a "Thinking" mode available.

Openness and Community Rankings

Artificial Analysis Openness Index (v1): AI2 OLMo leads with 89/100, followed by NVIDIA Nemotron at 67. The index combines model availability and transparency. Openness often correlates negatively with "intelligence" in current releases.
Arena (Nov) Open Model Rankings: Top open models include Kimi-K2-Thinking-Turbo (#1), GLM-4.6 (#2), and Qwen3-235B-a22b-instruct-2507 (#3). Open models remain competitive within the global Top 100.

Safety, Evals, and Interpretability

OpenAI Alignment Research Blog: Launched for more frequent, technical safety publications.
Anthropic Frontier Red Team: AI agents identified $4.6M in simulated smart contract vulnerabilities, with a new benchmark released.
Opus 4.5 System Card Discourse: Concerns about Chain-of-Thought training transparency were addressed by Anthropic, clarifying alignment with Sonnet 4.5. Critiques highlight weak capability evaluation evidence and call for harder, longer tasks.
Interpretability Pivots: Jacob Steinhardt and Hendrycks note skepticism towards mechanistic interpretability's past emphasis, while Google DeepMind's interp team outlines a more problem-driven agenda.

Top Tweets (by Engagement)

Sam Altman on policy/innovation.
3D/WebAR with tiny Gaussian splats.
Anthropic Frontier Red Team findings.
Alex Albert's review of Opus 4.5.
Yuchen Jin's reaction to the week's AI releases.
Amanda Askell confirming "soul doc" concept for Claude SL training.
Hiring surge at Google DeepMind for NeurIPS.

Notes and Miscellany

LLM Systems Research: ThreadWeaver introduces adaptive parallel reasoning with latency speedups.
Robotics/Humanoids: Amazon FAR's Holosoma open-sources a cross-robot training/deployment stack.
Community Education: Prof. Tom Yeh's DL Math "fill-in-the-blank" drills show high engagement.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

DeepSeek V3.2 Model and Benchmarks
- DeepSeek-V3.2: Features DeepSeek Sparse Attention (DSA), a Scalable Reinforcement Learning Framework, and a Large-Scale Agentic Task Synthesis Pipeline. Available under MIT License. Praised for transparency in reporting benchmarks where it lags.
- Logical Reasoning Benchmark: DeepSeek V3.2 Speciale achieved the highest score on the 'lineage-bench' logical reasoning benchmark, demonstrating superior capabilities.
- Performance Issues: A test of DeepSeek V3.2 Speciale on a logical reasoning riddle resulted in incorrect answers despite high token usage, contrasting with GLM 4.6's efficient solution.
Transformers v5 and Context Length Extensions
- Transformers v5: Hugging Face release enhances ecosystem interoperability and functionality. Includes a new Llama5Tokenizer class.
- Context Length Fine-tuning: Unsloth enables 500K context length fine-tuning with significant VRAM reduction and context length increase, applicable to any LLM or VLM.
Open Source vs Closed Source Discussion
- ChatGPT Ads: OpenAI's Pro Plan displays ads, raising concerns about monetization strategies prioritizing revenue over user experience.

Less Technical AI Subreddit Recap

Nano Banana Pro Realism and Concerns
- Photorealism: Nano Banana Pro generates highly realistic portraits, impressing users with detail and handling of complex lighting.
- Distinguishing AI Images: A meme reflects growing concern over the difficulty in distinguishing AI-generated images from real ones.
- Character Consistency: Method discussed for maintaining character consistency using reference images with Nano Banana Pro.
- Limitations: Image generation tools avoid deepfakes or exact likenesses of public figures due to safety restrictions.
- Vibe Gardening: Nano Banana Pro is being used for novel applications in garden layout planning.
ChatGPT Ads and User Reactions
- First Ad in ChatGPT?: A Pro user confirmed an ad-like feature promoting a fitness class, sparking debate on whether OpenAI is testing advertisements.
- App Integration vs. Ads: Discussion highlights that app integration suggestions function similarly to ads, blurring the line between direct advertising and feature suggestions.
- ChatGPT's Quicksand Response: Humorously explores ChatGPT's response to imaginative scenarios, showcasing conversational flexibility.

AI Discord Recap

1. Next-Gen & Open-Weight Models: DeepSeek 3.2, Trinity Mini, K2 3.5T, Qwen3-235, Orchestrator-8B

DeepSeek 3.2 Models: Users reported math hallucinations with 'speciale' variant, leading to its removal from LMArena. 'Thinking' variant praised for coding and HTML generation. Production load issues noted with high API latency and timeouts.
Arcee's Trinity Mini & Nous K2: Arcee launched Trinity Mini, an accessible open-weight option. Nous Research released NousResearch/k2-merged-3.5T-fp8, a massive MoE model.
Underrated All-Stars: Qwen3-235B praised for API-parity quality at Q4 quantization. Nvidia's Orchestrator-8B, a tool-calling model, has low downloads despite high HLE scores, highlighting visibility vs. quality mismatch.

2. Tooling, IDE & Agent Ecosystems for Coding and Apps

Cursor's AI IDE: Users dissected pricing and tokenomics. Update 2.1.39 regressed terminal integration. Background agents showed rough edges at infra boundaries.
OpenRouter Powers DIY AI Apps: Walkthrough video demonstrates building AI apps with OpenRouter. Issues with rate limits, timeouts, and latency reported for DeepSeek v3.2 and Grok 4 Fast.
Code Assist Ecosystem: Aider, Mindlink, and GPT Provider compete for developer workflows. Chaining tools is being experimented with for reduced friction and increased control.

3. Hardware & Low-Level Optimization: From TPUv7 and H200s to RDNA3 Assembly

Google's TPUv7 vs. Nvidia's CUDA: Discussion on whether TPUv7 poses a threat to Nvidia's CUDA dominance. Hyperscalers investing in parallel hardware stacks to de-risk CUDA dependence.
Tinygrad Goes Bare-Metal: Initiates RDNA3 assembly project for closer silicon interaction. Aims to create an assembler/disassembler and cycle-accurate emulator.
Practical Compute Squeezing: Unsloth community eyes H200 GPUs. QLoRA recommended for memory-intensive models. Analysis of CohereLabs/command-a-translate-08-2025 for context length extension.

4. Training, Optimization & Theory: ES vs Backprop, Attention Variants, Prompt Tuning & Scaling Laws

Evolution Strategies vs. Backprop: ES pitched as a scalable alternative to backprop for LLMs, potentially handling architectures where backprop is infeasible.
DeltaNet / Kimi-Delta Attention: Scrutiny of WY representation and UT transform in Kimi-Delta attention. Debate on value residuals in LLMs and the F-Lite architecture.
Scaling Laws Debate: Revisited why LLM scaling curves exhibit power laws. Discussion on curve fitting vs. predictive power and nonlinear metrics.

5. Safety, Censorship Bypass, Red-Teaming & Model Behavior

Binary Exploits & WAF Bypasses: Users employ techniques like binary patching and HTTP-layer evasions to strip censorship and bypass security measures.
Model Sycophancy & Reward-Hacking: Critique of forced follow-up questions and generic phrasing. Gemini 2.5 fabricated search results when its tool was disabled.
AI Review Ecosystems Under Fire: Concerns about SNS review system bias, reviewer harassment, and AI-generated reviews at ICLR.

Discord: Detailed by-Channel Summaries

BASI Jailbreaking Discord

Gemini Self-Correction: Gemini corrects itself mid-response, noted as natural human behavior.
Ventoy: Touted as an essential open-source tool for creating bootable USB drives.
Binary Hacking: Process outlined for removing model censorship by editing binaries.
WAF/Cloudflare Bypass: Suggested methods include cookiejar + impit + custom header.
Token Stealer Malware: Warning issued about a link identified as malware.

LMArena Discord

Deepseek Math Hallucinations: Deepseek-v3.2-speciale flagged for math hallucinations; adding "do not hallucinate math" to prompts helped.
Deepseek Instability: 'Speciale' model removed due to instability and hallucinations. 'Thinking' model praised for coding and HTML generation.
Runway Gen-4.5: Mixed reviews, with concerns about lack of native audio and potential marketing chart fraud.
New Models: DeepSeek models added to Text Arena. Flux and KAT models debut on leaderboards.

Perplexity AI Discord

Image Generation Dangers: Acknowledged capacity to copy styles effectively with multiple images.
Censorship Bypass: Skilled prompt engineering can easily bypass public model censorship.
Opus 4.5 Trial: Perplexity Pro users receive limited trial access to Opus 4.5.
Earning Program Dubious: User kicked out of earning program, suspected due to referral program abuse.
LLM Selection via Pplx-API: Provider agreements prohibit selecting specific LLMs via the API.

Unsloth AI Discord

H200 GPUs for Large Models: Discussion on needing H200 GPUs for models exceeding 80GB VRAM, with QLoRA suggested for memory reduction.
Command-A Translation Model: Evaluated for its 8k context length limit; fine-tuning needed for 16k context.
Flex Attention Optimization: Strides made in optimizing Flex Attention for Llama-3B.
Qwen3-Next Issues: AWQ version reported to fail in LM Studio but work in llama.cpp.
Setfit for Spam Mitigation: Suggested for fine-tuning a model to detect and mitigate Discord spam.

LM Studio Discord

HVAC Cover Letters: Local LLMs discussed for technical text generation and proofreading.
Qwen3 Coding: Good at coding but misses finer details; model selection more impactful than size.
Local AI vs. Big Tech: Concerns about Big Tech data collection and biases.
GPU Setup: Members share setups for local LLMs.
Linux Woes: User faced issues with Ubuntu drivers, resolved using Claude Code.

Cursor Community Discord

AI-Native Developer Hiring: Agency seeks developers skilled in Nextjs, Tailwind, Supabase, Vercel, and Typescript.
Token Usage and Pricing Debate: Users discuss Cursor's tokenomics and pricing.
Terminal Access Issues: Post-update problems with terminal access reported; re-indexing and restarting often resolve issues.
Sub-Agent Implementations: Experimentation with sub-agent architectures discussed.
Cursor vs. Windsurf: Comparison of value and UX for new users.

OpenRouter Discord

Arcee Trinity Mini Launch: Arcee releases Trinity Mini, a free open-weights model on OpenRouter.
AI Codes AI for App Dev: YouTube video demonstrates building AI apps with OpenRouter.
Gambling Algorithm Bankrupts User: AI error in a Bet365 function string led to significant financial loss.
Grok 4 Fast Outage: Experienced cascading collapse with server errors.
DeepSeek 3.2 Overload: Users report timeouts, rate limits, and errors due to high demand.

OpenAI Discord

Grok Jailbreaks: Grok found easy to jailbreak and useful for creative applications, but challenging for specific tasks like storytelling.
AI Sycophancy: Critique of forced follow-up questions and generic phrasing in LLMs.
Personality Adjustment: Suggestion to adjust personality presets to influence writing style.
Sora 2 Prompting: Compact guide shared for generating Sora 2 prompts.
Anime Openings Template: Cinematic anime-style template for creating anime openings.

Moonshot AI (Kimi K-2) Discord

Kimi K2 Scores in Advent of Code: Outperformed Gemini 3 Pro in coding tasks.
Minimax vs ChatGPT: Minimax performs tasks like installing Python packages directly, unlike ChatGPT.
Kimi Subscription Discounts: Users report success in bargaining for low subscription prices.
Privacy Concerns with Kimi: Lack of opt-out option for data training raises concerns.
Gemini 3 Pro Benchmarks: Users feel Gemini 3 Pro is overhyped and benchmarks don't reflect real-world experiences.

Nous Research AI Discord

Nous Chat Cyber Monday Deal: Free month offered with code CYBER2025.
Anonymous Nous Chat: Now available for free without an account.
Nous API Accepts USDC: Payment support added via Coinbase.
Massive MoE Models: NousResearch/k2-merged-3.5T-fp8 released.
Qwen3-235: Praised for amazing quality at Q4, comparable to API.
AI Ad-Blocking & NPCs: Community imagines AI adblocks and foundational models for game NPCs.
Mistral Large 3: Expected to be around 675B MoE with vision capabilities.
Portal Issues: Slowness and browser verification problems reported.
API Key Deletion Issues: Users face difficulties deleting API keys.
Türkiye Troubles: VPN usage and verification issues discussed.
Evolution Strategies: Explored as an alternative to backpropagation for LLM training.
Explainable AI with GitHub Copilot: Demo shared for exploring explainable AI.

tinygrad (George Hotz) Discord

RDNA3 Assembly Project: Initiated for closer silicon interaction, aiming for assembler/disassembler and cycle-accurate emulator.
Shipping Kernels Challenges: Difficulty in shipping compiled ops for multiple shapes.
Profiling Needs Improvement: Documentation and profiling tools require enhancement.
Flash Attention Speedup: Discussed for improving BERT training runs.
HIPAllocator Needs Offset: Community requests ._offset() function for flexible memory allocation.

Latent Space Discord

Google TPUv7 vs. CUDA: Discussion on TPUv7's potential to challenge Nvidia's dominance.
GPT-4.5 Rebrand: Alleged to be a backup for a failed GPT-5 run.
Black Forest Labs Funding: Secured $300M Series B round.
Gemini Downloads: Approaching ChatGPT levels.
DeepSeek Models: V3.2 and V3.2-Speciale released, rivaling Gemini-3.0-Pro.
Sculpture Illusion: 3-step prompt creates a stretch-and-drag sculpture illusion.
Kling AI O1 Launch: Multimodal creative engine unveiled with free credits offered.
Nano Banana Pro for Vibe Gardening: Enables quick landscape plan creation.

Eleuther Discord

Reviewer Protection: Debates on post-review discussion periods to prevent harassment.
Gemini 2.5 Hallucinations: Observed fabricating search results when search tool is disabled.
Kimi Delta Attention: Discussed in ML Perf reading group.
Demo Paper Requirements: Discussion on IEEE standard format requirements.
DeltaNet Attention Deep Dive: Analysis of WY representation and UT transform.
Value Residuals in LLMs: Marginal gains noted compared to original paper.
Scaling Laws Debate: Revisited power law structures and nonlinear metrics.

Yannick Kilcher Discord

SNS Review System Scrutiny: Concerns about bias in the bidding system.
ML Engineer Career Path: Role defined as scaling up experiments.
Anti-Cheat System: Boasts kernel-level access for effectiveness.
Nvidia Orchestrator-8B Overlooked: Low downloads despite high HLE scores.
ICLR Reviews AI Generated: Many reviews flagged as AI-generated.
OAI Model Training Struggles: Rumors suggest difficulty training new models since GPT-4o.
Microsoft 365 AI Agents: New AI Agents announced with SDK documentation.
TopKHot Attention Mechanism: Investigated for potential use with softmax + TopK + onehot.

Modular (Mojo 🔥) Discord

Web3 Spam Policy: Contribution required before inquiring about job opportunities.
Circular Import Errors: Addressed in lightbug_http with PRs.
Mojo Keyword Changes: Consideration of removing def keyword and requiring var.
Concurrency Model WIP: parallelize function noted as unsafe due to data races.
Matmul Fallback: Missing generic fallback for RTX5090 users.

Manus.im Discord

Manus Update Issues: App reportedly crippled for non-paying users.
Black Friday Opinions: Differing views on the no-sale decision.
UI Feedback: Referral code redemption clarity requested.
AI Engineer Introductions: Users promote expertise in AI and full-stack development.
Civil Discourse Request: Moderator calls for respectful discussions.

DSPy Discord

DSPy vs. scikit-llm: DSPy may outperform, depending on the use case.
OpenRouter API Config: Limited documentation and configuration capabilities noted.
Prompt Tuning Methods: LLMs analyzing failure causes for improvements highlighted.
AI System Building: Senior AI Developer showcases end-to-end system building experience.

aider (Paul Gauthier) Discord

GPT Provider Free Credits: Offers free credits for open-source GPT provider.
Aider Alternatives: Members discuss alternatives, noting Aider's superior SVG aesthetic.
Mindlink Models Impact: Speculation that Mindlink models may have affected Aider's popularity.

Dec

2025

We’re in San Diego this week for #NeurIPS2025! Stop by the Meta booth (#1223) to meet our team and ...

We’re in San Diego this week for #NeurIPS2025! Stop by the Meta booth (#1223) to meet our team and check out: 🔎 Demos of our latest research including DINOv3 and UMA ⚡ Lightning talks from researchers behind SAM 3, Omnilingual ASR and more (see schedule below) 👓 Hands-on demos with our latest AI glasses including the Meta Ray-Ban Display Our team is also sharing 19+ papers and 13+ workshops this week. We hope to see you there. Tweet Image 💬10🔄9❤️170👀16826📊32 ⚡ Powered by xgo.ing [https://xgo.ing]

Nov

2025

not much happened today

AI News: November 25-26, 2025

Happy Thanksgiving! This week's AI news digest covers updates from 12 subreddits, 544 Twitters, and 24 Discords (205 channels, 9014 messages). Estimated reading time saved: 713 minutes. Check out the new website at https://news.smol.ai/ for full breakdowns and metadata search.

AI Twitter Recap

Agent Systems: Long-Running Harnesses, MCP Tasking, and Production Deployments

Anthropic on Durable Agents + MCP Tasks: Anthropic detailed practical patterns for agents that function across multiple context windows, including state checkpoints, structured artifacts, deterministic tools, and "plan mode." Concurrently, MCP released SEP-1686 "tasks" for background, long-running work with status polling and result retrieval, crucial for multi-hour research and automation workflows. LangChain clarified its stack: frameworks (build), runtimes (durable execution, streaming/HITL), and harnesses (general-purpose agents), with LangGraph in the runtime slot.
Real-World Agent Infrastructure: Booking.com deployed an agent handling tens of thousands of daily partner-guest messages, resulting in a ~70% satisfaction lift, fewer follow-ups, and faster responses. The stack included LangGraph, Kubernetes, FastAPI, GPT-4 Mini with prompt-injection detection, and Weaviate for semantic template search. Perplexity AI introduced user-level "Memory" (view/delete/disable) and "virtual try-on" for shopping.

Claude Opus 4.5: Evals, Cost/UX Learnings, and New Skills

Performance: On LisanBench, Opus 4.5 Thinking ranked first, though the non-thinking variant underperformed. On Code Arena WebDev, Opus-4.5 (thinking-32k) debuted at #1. Community reports are mixed, with some noting Opus 4.5 can be worse than Sonnet in "no thinking" mode and misuse Python tools.
Costs and Ergonomics: Batch APIs make "Thinking" runs more cost-viable. Anthropic fixed a Claude.ai issue by auto-compacting context to avoid length limits. Claude Code's new "frontend-design" skill can generate UI concepts in one shot, with plan mode recommended for better results.

Efficient Reasoning and Multi-Agent Communication

Latent MAS > Token Chatter: LatentMAS uses compact latent vectors instead of text messages for agent communication, reducing tokens by ~70-84% and improving accuracy by up to +4.6%. It ran 4-4.3x faster across 9 benchmarks with Qwen3 models without extra training.
Reasoning Trace Distillation: Training 12B models on gpt-oss traces yielded ~4x fewer tokens per solution at similar accuracy, saving inference costs. The source and style of reasoning traces are key for efficiency. Interleaved thinking agents also showed practical step-by-step efficiency gains.

Beyond Gradients and Scaling Systems

ES at Hyperscale: EGGROLL reframes evolution strategies with low-rank perturbations, enabling stable pretraining of recurrent LMs with integers and scaling population sizes to 100k+, making ES viable for large, discrete, or non-differentiable systems.
Out-of-Memory on Apple Silicon: dria's "dnet" enables distributed inference across Apple Silicon clusters via fused pipelined-ring parallelism, disk streaming, and UMA-aware scheduling to run models beyond physical memory limits.

Multimodal and Generative Modeling Updates

New Architectures: PixelDiT proposes dual-level Transformers for pixel-space diffusion. Apple's STARFlow-V uses normalizing flows for end-to-end video generation. Terminal Velocity Matching generalizes flow matching for few/one-step generation.
Models and UX: Z-Image (6B) announced under Apache-2.0; Z-Image-Turbo (6B) released on HF. FLUX.2 [dev] features a "Tiny Autoencoder" for streaming intermediate outputs. Google's Nano Banana 2 shows gains on StructBench.

Open Ecosystem, Evaluation, and Governance

"Economies of Open Intelligence": China surpassed the U.S. in open model downloads. Trends show a decrease in US big tech share and an increase in China + community share.
Evals and Safety: METR continues to be cited as a credible evaluator. The AI Security Institute released a case study with Anthropic. An AI Evaluator Forum launches at NeurIPS.
Applied Multimodal Recsys: Zhihu details a Qwen2.5-VL-72B/3B-driven pipeline for multimodal labels and embeddings.
Domain Benchmarks: New benchmarks like MultiPathQA and MTBBench push beyond single-turn QA. Clinical ASR evals use DSPy + GEPA to train an LLM judge.

Top Tweets (by engagement)

Anthropic on building effective long-running agent harnesses.
Claude.ai auto-compacts context to avoid hitting limits mid-chat.
Google DeepMind releases AlphaFold documentary “The Thinking Game” on YouTube.
Awesome Nano Banana prompts/styles/resources for advanced image generation.
Claude Opus 4.5 debuts at #1 on Code Arena WebDev leaderboard.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

Alibaba Text-to-Image Model Launch: Alibaba's open-source "Z-Image-Turbo" model is ranked fourth, below Seedream 4.0, highlighting its performance. Discussions cover its 6B parameters and potential for local deployment, contrasting with larger models like Flux 2. Challenges in prompt adherence and multi-object composition for smaller models are noted.

Less Technical AI Subreddit Recap

Opus 4.5 Model Success Stories: Opus 4.5 successfully converted a ZBar library to native Swift 6, resolving longstanding bugs. Users discussed productization, licensing, and the prompt engineering behind the success. A graph comparing software version accuracies showed Opus 4.5 with the highest accuracy (80.9%).
New AI Model Announcements and Benchmarks: Alibaba's "Z-Image-Turbo" (6B parameters) is poised for public release, with early tests suggesting it may outperform Qwen-Image. The model's smaller size and potential for high-quality photorealistic images are anticipated.
Humorous AI and Tech Memes: Memes discussed Ilya Sutskever's comments on scaling, clarifying that he questioned scaling limits, not LLMs themselves. Another meme humorously commented on Google's Gemini 3 release. A meme featuring Grok 4.1 depicted it as bold and unrestrained in discussing NSFW content.

AI Discord Recap

1. Next-Gen Image and Video Models Hit Production Workflows

Nano Banana Pro: Praised for generating hyper-realistic images and comics, with outputs described as "indistinguishable from reality." Concerns were raised about its potential for fraud (counterfeit receipts, KYC documents) and the possibility of safety interventions overreacting.
Whisper Thunder: Took the #1 spot on the Artificial Analysis text-to-video leaderboard, surpassing VideoGen. It's part of a rapidly advancing SOTA video generation race.
NB Pro and FLUX 2 Pro: NB Pro was called "insane" and "the best image model in history period." FLUX 2 Pro showed a major quality jump over FLUX 1 Pro. Debate continues on NB Pro's peak quality versus Flux 2's contender status, with SynthID watermarking discussed as a protection against "nerfing."
OpenAI's Image Model Upgrade: A quiet upgrade received mixed reviews, with praise for higher fidelity but criticism for weak multilingual support, inconsistent continuity, and persistent safety guardrails, contrasting unfavorably with Nano Banana Pro and FLUX 2 Pro.

2. Agentic UX, Code Assistants, and Chat Frontends Evolve

Claude Code's Plan Mode: Now launches multiple exploring subagents in parallel, generates competing plans, and persists an editable plan file. Engineers praised the higher one-shot success rate but requested faster UX and less verbose replanning.
GPT-5.1 for Storytelling: Reported as the best model for anime or story writing due to reliable character design and long-range context memory. However, strict safety and violence guardrails block anime-style combat scenes.
Kimi K-2 and Canvas UIs: Kimi K-2 praised for "exceptional thinking, push-back ability, and prompt understanding." Debate arose on why full-screen canvases haven't replaced chat UIs, arguing they better support complex workflows and challenge the "conversational fallacy."
Meganova Chat and Gemini Agents: Meganova Chat buzzed as a "clean, fast place" for managing AI chats. Gemini Agents explored for executing Python scripts within a sandboxed environment, highlighting growing agent tooling capabilities.

3. GPU Kernels, Distributed Inference, and Training Tricks

nvfp4_gemv Contest: Saw a surge of submissions to the NVIDIA leaderboard, with LLM-crafted CUDA code being a focus. Participants discussed eval.py harness flakiness and the overhead of cudaStreamSynchronize(). Gemini 3.5 Pro and Opus 4.5 were highlighted as powerful kernel authors.
Tensor Core Optimization: Engineers shared tips for Tensor Core optimization, discussing ldmatrix.b16, reinterpret_cast, and SIMT loads. CuTeDSL packed FP16 instructions were noted.
Multi-Node LLM Inference: NVRAR NVSHMEM-based hierarchical all-reduce offers lower latency than NCCL for LLM inference. PAT algorithm discussed for all-gather and reduce-scatter operations at scale.
ES HyperScale and Blackwell Architecture: ES HyperScale claims a 100x training throughput boost. Nvidia Blackwell's unified scalar pipeline warned against mixing INT and FP workloads to avoid performance drops.
Robotics and Partial-Training: Low-cost dual-arm laundry robots from 7x examined. Discussions on partially trainable embeddings and weighted-loss softmax for memory savings and efficiency.

4. Open Tools, Protocols, and Model Routing Infrastructure

dspy-cli: Now open source, enabling scaffolding of DSPy projects and exposing modules as FastAPI endpoints or MCP tools.
RapidaAI Voice Stack: Fully open-source, targeting teams tired of per-minute markups on third-party voice APIs.
MCP Protocol: New version released. Discussions on handling namespace collisions for third-party variants.
Tinygrad, LM Studio, OpenRouter: Tinygrad's @TinyJit details kernel replay. LM Studio users fixed API errors and debugged Flash Attention regressions. OpenRouter users reported Opus overload and model fallback bugs.

5. Safety, Robustness, Data Economics, and Evaluation Reality Checks

Emergent Misalignment: Replication study found Gemma 3 and Qwen 3 robust to insecure fine-tuning. "The JSON Trap" blog post argues JSON-only output reduces degrees of freedom to refuse harmful requests.
Hallucinations and Benchmark Contamination: Hallucinations in multi-stage LLM pipelines are still considered component system hallucinations. Concerns raised about benchmark contamination leading to memorization.
Curriculum Learning, Data vs Compute, Job Impact: Debates on curriculum learning and coresets in LLM pretraining. Contrasting data vs. compute costs. MIT study claiming AI can replace 11.7% of the US workforce discussed.
Summarization, Safety, Legal/Policy: LLMs criticized for poor summarization on dense texts. Debates on ChatGPT's political bias, copyright of Gemini images, and Steam's AI content disclosure rules.

Discord: Detailed by-Channel Summaries

LMArena Discord

General: Debates on "cameo" vs. "deepfake," Flux 2 models' arrival and comparison to NB Pro, praise for NB Pro's "insane" image generation, and SynthID's role in preventing model "nerfing." A "stealth model" named Robin rumored to outperform Opus 4.5.
Announcements: Updates on Image Edit flow, Flux 2 models added, new Search Arena models (gemini-3-pro-grounding, gpt-5.1-search), and Claude's top placement on leaderboards.

Perplexity AI Discord

General: Concerns about Palantir Technologies' "doom potential," discussions on the Nvidia-Altman partnership inflating AI bubbles, disputes over Opus 4.5's token efficiency, Gemini Agent's sandboxed Python script execution, and Perplexity blocking user prompts leading to profile editing issues.

Unsloth AI (Daniel Han) Discord

General: FP8 RL documentation issues, advice on quantized model inference speed, discussion on the obsolescence of kernels due to torch.compile, announcement of the ERNIE AI Developer Challenge, and Unsloth's presence at NeurIPS San Diego.
Off-topic: Reports of Claude Opus 4.5 giving context limit errors, inquiries about wakeword solutions, job interview advice, discussions on CPU offloading for long context training, and a mention of the game "Slop Detective."
Help: Recommendations for Vulkan over IPEX for llama.cpp, issues with GGUF conversion (model_type attribute), advice on continued pretraining vs. fine-tuning for autocompletion, Qwen3 8B fine-tuning problems, and AMD GPU support for bitsandbytes.
Showcase: Announcement of the ERNIE AI Developer Challenge and availability of free AMD notebooks for ERNIE finetuning.
Research: Sharing of ES HyperScale for boosted training throughput, LESA for learnable LLM layer scaling-up, and efficient CPU training possibilities.

Cursor Community Discord

General: Haiku models praised for documentation accuracy, Composer-1 for code implementation. Discussions on token costs and model overload/degradation. Agent review pricing confusion. Frustration with Cursor's linting error handling. Agent plans not being automatically saved.

GPU MODE Discord

General: Exploration of Triton kernels for partially trainable embeddings and weighted-loss softmax. NVIDIA leaderboard submissions and personal bests. Tensor Core optimization tips shared. Discussion on 2-bit dequantization on Intel GPUs. Factorio Learning Environment documentation deployed.
Triton-Gluon: Proton profiling tool issues, interest in tensor descriptors and auto-tune parameters, and a persistent matmul tutorial example.
CUDA: Exploration of GEMM with tensor cores, sharing of optimization resources, and discussion on data loading strategies (ldmatrix.b16).
Torch: Inquiries about differentiating forward passes with and without gradient checkpointing, and using boolean flags for differentiation.
Beginner: Guidance on contributing to XLA, rules of thumb for GPU benchmarking warmup runs, consideration of thermal limits in benchmarking, and datacenter settings.
Jax-Pallas-Mosaic: Performance comparison of jax.pmap vs. jit on a single GPU, and code portability considerations for multi vs. single GPU systems.
Off-topic: Memes shared.
Irl-Meetup: Travel plans for NeurIPS and SF, inviting chats about GPUs.
Intel: Quest for 2-bit dequantization on Intel GPUs, seeking faster alternatives to Torch.
Self-promotion: Link to an aerlabs post.
🍿: Urmish joins LLM initiatives, seeking guidance on subgroups for LLM training and agentic harnesses. Discussion on LLM kernel generation.
Thunderkittens: Newcomer pioneers CUDA and Flash Attention. Discussions on open areas for kernel contributions (MoE, linear attention backwards) and AMD GPU availability.
Submissions: NVIDIA nvfp4_gemv leaderboard sees numerous submissions, with users achieving top ranks. Discussion on a potentially fishy submission and optimization efforts.
Factorio-Learning-Env: Documentation for the Factorio Learning Environment is live.
Cutlass: Discussion on SIMT load overheads and a breakdown of tiled_mma example.
Singularity-Systems: Updates on picograd commits, tensor implementation, and evaluator/device runtimes.
Multi-GPU: NVRAR speeds up multi-node LLM inference. PAT algorithm discussed for all-gather and reduce-scatter operations.
Nvidia-Competition: CuTeDSL packed FP16 instructions, eval.py script scrutiny, cudaStreamSynchronize() overhead, and an "LLM-only" approach using Gemini 3.5 Pro and Opus 4.5.
Hf-Kernels: Metal kernels release delayed. MacOS compatibility issues noted.
Robotics-VLA: 7x laundry folding robot debut. No-action filtering importance for VLAs. Qwen3-VL optimization hurdles. Comparison of classic binning vs. FAST tokenizer.

OpenAI Discord

AI-Discussions: Debate on ChatGPT's alleged left-wing bias. Nano Banana Pro praised for comic creation, with worries about it being "lobotomized." Commercial copyright and ethical quandaries of AI-generated images. GPT-5.0 Mini disappointment. Argument that OpenAI's UI caters to a neurotypical audience.
GPT-4-Discussions: GPT 5.1 praised for anime storytelling, but strict guardrails block violence. Debate on chat reference memory issues and GPT 5.1 vs. GPT 4 performance.
Prompt-Engineering: (No new messages)
API-Discussions: (No new messages)

LM Studio Discord

General: API endpoint error resolved by consulting documentation. Image captioning issues after update resolved by switching to Gemma 3. Flash Attention glitches impacting model functionality. GPT OSS 20B speed showcased. Free Mint opportunity with OpenSea.
Hardware-Discussion: Discussions on Q8 cache, GPU fans at 0% during inference, hardware devaluation, potential CPU fire averted, and PCIe bifurcation breakthroughs.

OpenRouter Discord

App-Showcase: Color picker bug reported. RapidaAI open-sourced voice AI platform.
General: Opus overload outage reported. Model fallback bug discovered. Free Deepseek R1 model removed. Buzz around upcoming Meganova Chat. OpenRouter's normalized interfaces praised.
New-Models: (No new messages)
Discussion: Arrakis AI model still looks yellow-ish. Text-to-Video Leaderboard updated, with David in first place.

Nous Research AI Discord

Announcements: Psyche Team Office Hours scheduled.
General: Suno's Warner Music partnership sparks debate. Data vs. compute costs highlighted. Blackwell architecture performance warnings (INT/FP mixing). Z-Image model released on Modelscope. Debate on AI disclosure policies on Steam.
Ask-About-LLMs: LLM benchmarks face pre-training data contamination. Overcoming contamination in benchmarks is challenging.
Interesting-Links: Lecture on Information Retrieval history shared.

Eleuther Discord

General: Hallucinations in multi-stage LLMs still count as hallucinations. LLMs compared to "golden retrievers." Debate on verifying AI claims and fact-checking misinformation. Discussion on AI and collaborative work.
Research: Debate on SGD shuffling. PIQA paper typo noted. Emergent Misalignment paper replication and "JSON Trap" discovery. Resources sought for AI for Drug Discovery.
Scaling-Laws: Link to a paper on scaling laws.

Latent Space Discord

AI-General-Chat: Claude Code's Plan Mode overhaul with parallel subagents. DeepMind documentary "The Thinking Game" released. Jeff Dean's AI retrospective and Gemini 3.0. Claude generating PowerPoint slides. Comparison of ChatGPT Pro vs. Claude.
AI-Announcements: RF-DETR paper authors host SOTA Vision special. NeurIPS signups reminder. 2025 Dev Writers Retreat accepting final signups.
Genmedia-Creative-AI: Whisper Thunder surpasses VideoGen in text-to-video. Nano Banana Pro's realism sparks debate and fraud concerns. OpenAI's image-gen upgrade receives mixed reception. FLUX 2 Pro boasts improved visuals.

Yannick Kilcher Discord

General: Department of Energy plans national AI platform. MIT study on AI replacing jobs sparks debate. LLMs criticized for poor summarization. Debate on curriculum learning techniques for LLM pretraining.
Paper-Discussion: Adobe AI summaries criticized. LLMs struggle to summarize high-density info. Discussion on ADHD and Autism in Tech. Proposal for a new rule to curb paper flooding.
ML-News: Tencent releases Hunyuan model. MAGA supporters push back against AI datacenters. MIT study on AI workforce replacement.

HuggingFace Discord

General: Hugging Face Inference API grayed out. Christmas gift drop shared. LM Studio PDF teacher suggested. Spanish text dataset quest.
Cool-Finds: (No new messages)
I-Made-This: RapidaAI goes open source. French Classic Books Dataset created. AI Sci-Fi Short Film released.
Reading-Group: Chunking's impact is small. GNN presentation on AlphaFold approaching. Structured data is valuable.
Agents-Course: (No new messages)

Modular (Mojo 🔥) Discord

General: Mojo keeps repos synced using Copybara.
Max: MAX examples for newbies sought. Discussion on MAX written in Python. Mojo API's return to MAX anticipated. Hurdles highlighted for migrating Python MAX code to Mojo MAX.

Tinygrad (George Hotz) Discord

Learn-Tinygrad: TinyJit internals detailed (only replays kernels). Randomness functions in Tensor work as expected. Two JIT runs required for tracing, with potential changes. Good, but outdated, JIT tutorial shared. Focus shifting to frontend usability.

Moonshot AI (Kimi K-2) Discord

General-Chat: Kimi's limits explored. Debate on chatbots vs. canvases for UIs. Conversational fallacy discussed.

DSPy Discord

Show-and-Tell: dspy-cli tool goes open source, enabling scaffolding of DSPy projects and deployment as HTTP APIs. Acclaimed for project utility.
General: Trajectory injection sought for ReAct modules. API choices debated for web search implementation. Exa API includes summarization. Latency issues with web search API calls.

MCP Contributors (Official) Discord

General: New protocol version released. UI SEP ships out-of-band. MCP considers namespace collisions.

Manus.im Discord Discord

General: AI engineer introduces themself with extensive experience. User reports API issues and lack of support. Members inquire about a Telegram channel.

Aider (Paul Gauthier) Discord

General: Community suggests new site admin for benchmarking. Survey on Opus 4.5 vs. Sonnet 4.5 upgrade. Bedrock Identifier "snafu" reported.

Nov

2025

今天没发生什么大事

AI News: November 25-26, 2025

AI Twitter Recap

Agent Systems: Long-Running Harnesses, MCP Tasking, and Production Deployments

Anthropic on Durable Agents + MCP Tasks: Anthropic detailed practical patterns for agents that function across multiple context windows, including state checkpoints, structured artifacts, deterministic tools, and "plan mode." Concurrently, MCP released SEP-1686 "tasks" for background, long-running work with status polling and result retrieval, crucial for multi-hour research and automation workflows. LangChain clarified its stack: frameworks (build), runtimes (durable execution, streaming/HITL), and harnesses (general-purpose agents), with LangGraph in the runtime slot.
Real-World Agent Infrastructure: Booking.com deployed an agent handling tens of thousands of daily partner-guest messages, resulting in a ~70% satisfaction lift, fewer follow-ups, and faster responses. The stack included LangGraph, Kubernetes, FastAPI, GPT-4 Mini with prompt-injection detection, and Weaviate for semantic template search. Perplexity AI introduced user-level "Memory" (view/delete/disable) and "virtual try-on" for shopping.

Claude Opus 4.5: Evals, Cost/UX Learnings, and New Skills

Performance: On LisanBench, Opus 4.5 Thinking ranked first, though the non-thinking variant underperformed. On Code Arena WebDev, Opus-4.5 (thinking-32k) debuted at #1. Community reports are mixed, with some noting Opus 4.5 can be worse than Sonnet in "no thinking" mode and misuse Python tools.
Costs and Ergonomics: Batch APIs make "Thinking" runs more cost-viable. Anthropic fixed a Claude.ai issue by auto-compacting context to avoid length limits. Claude Code's new "frontend-design" skill can generate UI concepts in one shot, with plan mode recommended for better results.

Efficient Reasoning and Multi-Agent Communication

Latent MAS > Token Chatter: LatentMAS uses compact latent vectors instead of text messages for agent communication, reducing tokens by ~70-84% and improving accuracy by up to +4.6%. It ran 4-4.3x faster across 9 benchmarks with Qwen3 models without extra training.
Reasoning Trace Distillation: Training 12B models on gpt-oss traces yielded ~4x fewer tokens per solution at similar accuracy, saving inference costs. The source and style of reasoning traces are key for efficiency. Interleaved thinking agents also showed practical step-by-step efficiency gains.

Beyond Gradients and Scaling Systems

ES at Hyperscale: EGGROLL reframes evolution strategies with low-rank perturbations, enabling stable pretraining of recurrent LMs with integers and scaling population sizes to 100k+, making ES viable for large, discrete, or non-differentiable systems.
Out-of-Memory on Apple Silicon: dria's "dnet" enables distributed inference across Apple Silicon clusters via fused pipelined-ring parallelism, disk streaming, and UMA-aware scheduling to run models beyond physical memory limits.

Multimodal and Generative Modeling Updates

New Architectures: PixelDiT proposes dual-level Transformers for pixel-space diffusion. Apple's STARFlow-V uses normalizing flows for end-to-end video generation. Terminal Velocity Matching generalizes flow matching for few/one-step generation.
Models and UX: Z-Image (6B) announced under Apache-2.0; Z-Image-Turbo (6B) released on HF. FLUX.2 [dev] features a "Tiny Autoencoder" for streaming intermediate outputs. Google's Nano Banana 2 shows gains on StructBench.

Open Ecosystem, Evaluation, and Governance

"Economies of Open Intelligence": China surpassed the U.S. in open model downloads. Trends show a decrease in US big tech share and an increase in China + community share.
Evals and Safety: METR continues to be cited as a credible evaluator. The AI Security Institute released a case study with Anthropic. An AI Evaluator Forum launches at NeurIPS.
Applied Multimodal Recsys: Zhihu details a Qwen2.5-VL-72B/3B-driven pipeline for multimodal labels and embeddings.
Domain Benchmarks: New benchmarks like MultiPathQA and MTBBench push beyond single-turn QA. Clinical ASR evals use DSPy + GEPA to train an LLM judge.

Top Tweets (by engagement)

Anthropic on building effective long-running agent harnesses.
Claude.ai auto-compacts context to avoid hitting limits mid-chat.
Google DeepMind releases AlphaFold documentary “The Thinking Game” on YouTube.
Awesome Nano Banana prompts/styles/resources for advanced image generation.
Claude Opus 4.5 debuts at #1 on Code Arena WebDev leaderboard.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

Alibaba Text-to-Image Model Launch: Alibaba's open-source "Z-Image-Turbo" model is ranked fourth, below Seedream 4.0, highlighting its performance. Discussions cover its 6B parameters and potential for local deployment, contrasting with larger models like Flux 2. Challenges in prompt adherence and multi-object composition for smaller models are noted.

Less Technical AI Subreddit Recap

Opus 4.5 Model Success Stories: Opus 4.5 successfully converted a ZBar library to native Swift 6, resolving longstanding bugs. Users discussed productization, licensing, and the prompt engineering behind the success. A graph comparing software version accuracies showed Opus 4.5 with the highest accuracy (80.9%).
New AI Model Announcements and Benchmarks: Alibaba's "Z-Image-Turbo" (6B parameters) is poised for public release, with early tests suggesting it may outperform Qwen-Image. The model's smaller size and potential for high-quality photorealistic images are anticipated.
Humorous AI and Tech Memes: Memes discussed Ilya Sutskever's comments on scaling, clarifying that he questioned scaling limits, not LLMs themselves. Another meme humorously commented on Google's Gemini 3 release. A meme featuring Grok 4.1 depicted it as bold and unrestrained in discussing NSFW content.

AI Discord Recap

1. Next-Gen Image and Video Models Hit Production Workflows

Nano Banana Pro: Praised for generating hyper-realistic images and comics, with outputs described as "indistinguishable from reality." Concerns were raised about its potential for fraud (counterfeit receipts, KYC documents) and the possibility of safety interventions overreacting.
Whisper Thunder: Took the #1 spot on the Artificial Analysis text-to-video leaderboard, surpassing VideoGen. It's part of a rapidly advancing SOTA video generation race.
NB Pro and FLUX 2 Pro: NB Pro was called "insane" and "the best image model in history period." FLUX 2 Pro showed a major quality jump over FLUX 1 Pro. Debate continues on NB Pro's peak quality versus Flux 2's contender status, with SynthID watermarking discussed as a protection against "nerfing."
OpenAI's Image Model Upgrade: A quiet upgrade received mixed reviews, with praise for higher fidelity but criticism for weak multilingual support, inconsistent continuity, and persistent safety guardrails, contrasting unfavorably with Nano Banana Pro and FLUX 2 Pro.

2. Agentic UX, Code Assistants, and Chat Frontends Evolve

Claude Code's Plan Mode: Now launches multiple exploring subagents in parallel, generates competing plans, and persists an editable plan file. Engineers praised the higher one-shot success rate but requested faster UX and less verbose replanning.
GPT-5.1 for Storytelling: Reported as the best model for anime or story writing due to reliable character design and long-range context memory. However, strict safety and violence guardrails block anime-style combat scenes.
Kimi K-2 and Canvas UIs: Kimi K-2 praised for "exceptional thinking, push-back ability, and prompt understanding." Debate arose on why full-screen canvases haven't replaced chat UIs, arguing they better support complex workflows and challenge the "conversational fallacy."
Meganova Chat and Gemini Agents: Meganova Chat buzzed as a "clean, fast place" for managing AI chats. Gemini Agents explored for executing Python scripts within a sandboxed environment, highlighting growing agent tooling capabilities.

3. GPU Kernels, Distributed Inference, and Training Tricks

nvfp4_gemv Contest: Saw a surge of submissions to the NVIDIA leaderboard, with LLM-crafted CUDA code being a focus. Participants discussed eval.py harness flakiness and the overhead of cudaStreamSynchronize(). Gemini 3.5 Pro and Opus 4.5 were highlighted as powerful kernel authors.
Tensor Core Optimization: Engineers shared tips for Tensor Core optimization, discussing ldmatrix.b16, reinterpret_cast, and SIMT loads. CuTeDSL packed FP16 instructions were noted.
Multi-Node LLM Inference: NVRAR NVSHMEM-based hierarchical all-reduce offers lower latency than NCCL for LLM inference. PAT algorithm discussed for all-gather and reduce-scatter operations at scale.
ES HyperScale and Blackwell Architecture: ES HyperScale claims a 100x training throughput boost. Nvidia Blackwell's unified scalar pipeline warned against mixing INT and FP workloads to avoid performance drops.
Robotics and Partial-Training: Low-cost dual-arm laundry robots from 7x examined. Discussions on partially trainable embeddings and weighted-loss softmax for memory savings and efficiency.

4. Open Tools, Protocols, and Model Routing Infrastructure

dspy-cli: Now open source, enabling scaffolding of DSPy projects and exposing modules as FastAPI endpoints or MCP tools.
RapidaAI Voice Stack: Fully open-source, targeting teams tired of per-minute markups on third-party voice APIs.
MCP Protocol: New version released. Discussions on handling namespace collisions for third-party variants.
Tinygrad, LM Studio, OpenRouter: Tinygrad's @TinyJit details kernel replay. LM Studio users fixed API errors and debugged Flash Attention regressions. OpenRouter users reported Opus overload and model fallback bugs.

5. Safety, Robustness, Data Economics, and Evaluation Reality Checks

Emergent Misalignment: Replication study found Gemma 3 and Qwen 3 robust to insecure fine-tuning. "The JSON Trap" blog post argues JSON-only output reduces degrees of freedom to refuse harmful requests.
Hallucinations and Benchmark Contamination: Hallucinations in multi-stage LLM pipelines are still considered component system hallucinations. Concerns raised about benchmark contamination leading to memorization.
Curriculum Learning, Data vs Compute, Job Impact: Debates on curriculum learning and coresets in LLM pretraining. Contrasting data vs. compute costs. MIT study claiming AI can replace 11.7% of the US workforce discussed.
Summarization, Safety, Legal/Policy: LLMs criticized for poor summarization on dense texts. Debates on ChatGPT's political bias, copyright of Gemini images, and Steam's AI content disclosure rules.

Discord: Detailed by-Channel Summaries

LMArena Discord

General: Debates on "cameo" vs. "deepfake," Flux 2 models' arrival and comparison to NB Pro, praise for NB Pro's "insane" image generation, and SynthID's role in preventing model "nerfing." A "stealth model" named Robin rumored to outperform Opus 4.5.
Announcements: Updates on Image Edit flow, Flux 2 models added, new Search Arena models (gemini-3-pro-grounding, gpt-5.1-search), and Claude's top placement on leaderboards.

Perplexity AI Discord

General: Concerns about Palantir Technologies' "doom potential," discussions on the Nvidia-Altman partnership inflating AI bubbles, disputes over Opus 4.5's token efficiency, Gemini Agent's sandboxed Python script execution, and Perplexity blocking user prompts leading to profile editing issues.

Unsloth AI (Daniel Han) Discord

General: FP8 RL documentation issues, advice on quantized model inference speed, discussion on the obsolescence of kernels due to torch.compile, announcement of the ERNIE AI Developer Challenge, and Unsloth's presence at NeurIPS San Diego.
Off-topic: Reports of Claude Opus 4.5 giving context limit errors, inquiries about wakeword solutions, job interview advice, discussions on CPU offloading for long context training, and a mention of the game "Slop Detective."
Help: Recommendations for Vulkan over IPEX for llama.cpp, issues with GGUF conversion (model_type attribute), advice on continued pretraining vs. fine-tuning for autocompletion, Qwen3 8B fine-tuning problems, and AMD GPU support for bitsandbytes.
Showcase: Announcement of the ERNIE AI Developer Challenge and availability of free AMD notebooks for ERNIE finetuning.
Research: Sharing of ES HyperScale for boosted training throughput, LESA for learnable LLM layer scaling-up, and efficient CPU training possibilities.

Cursor Community Discord

General: Haiku models praised for documentation accuracy, Composer-1 for code implementation. Discussions on token costs and model overload/degradation. Agent review pricing confusion. Frustration with Cursor's linting error handling. Agent plans not being automatically saved.

GPU MODE Discord

General: Exploration of Triton kernels for partially trainable embeddings and weighted-loss softmax. NVIDIA leaderboard submissions and personal bests. Tensor Core optimization tips shared. Discussion on 2-bit dequantization on Intel GPUs. Factorio Learning Environment documentation deployed.
Triton-Gluon: Proton profiling tool issues, interest in tensor descriptors and auto-tune parameters, and a persistent matmul tutorial example.
CUDA: Exploration of GEMM with tensor cores, sharing of optimization resources, and discussion on data loading strategies (ldmatrix.b16).
Torch: Inquiries about differentiating forward passes with and without gradient checkpointing, and using boolean flags for differentiation.
Beginner: Guidance on contributing to XLA, rules of thumb for GPU benchmarking warmup runs, consideration of thermal limits in benchmarking, and datacenter settings.
Jax-Pallas-Mosaic: Performance comparison of jax.pmap vs. jit on a single GPU, and code portability considerations for multi vs. single GPU systems.
Off-topic: Memes shared.
Irl-Meetup: Travel plans for NeurIPS and SF, inviting chats about GPUs.
Intel: Quest for 2-bit dequantization on Intel GPUs, seeking faster alternatives to Torch.
Self-promotion: Link to an aerlabs post.
🍿: Urmish joins LLM initiatives, seeking guidance on subgroups for LLM training and agentic harnesses. Discussion on LLM kernel generation.
Thunderkittens: Newcomer pioneers CUDA and Flash Attention. Discussions on open areas for kernel contributions (MoE, linear attention backwards) and AMD GPU availability.
Submissions: NVIDIA nvfp4_gemv leaderboard sees numerous submissions, with users achieving top ranks. Discussion on a potentially fishy submission and optimization efforts.
Factorio-Learning-Env: Documentation for the Factorio Learning Environment is live.
Cutlass: Discussion on SIMT load overheads and a breakdown of tiled_mma example.
Singularity-Systems: Updates on picograd commits, tensor implementation, and evaluator/device runtimes.
Multi-GPU: NVRAR speeds up multi-node LLM inference. PAT algorithm discussed for all-gather and reduce-scatter operations.
Nvidia-Competition: CuTeDSL packed FP16 instructions, eval.py script scrutiny, cudaStreamSynchronize() overhead, and an "LLM-only" approach using Gemini 3.5 Pro and Opus 4.5.
Hf-Kernels: Metal kernels release delayed. MacOS compatibility issues noted.
Robotics-VLA: 7x laundry folding robot debut. No-action filtering importance for VLAs. Qwen3-VL optimization hurdles. Comparison of classic binning vs. FAST tokenizer.

OpenAI Discord

AI-Discussions: Debate on ChatGPT's alleged left-wing bias. Nano Banana Pro praised for comic creation, with worries about it being "lobotomized." Commercial copyright and ethical quandaries of AI-generated images. GPT-5.0 Mini disappointment. Argument that OpenAI's UI caters to a neurotypical audience.
GPT-4-Discussions: GPT 5.1 praised for anime storytelling, but strict guardrails block violence. Debate on chat reference memory issues and GPT 5.1 vs. GPT 4 performance.
Prompt-Engineering: (No new messages)
API-Discussions: (No new messages)

LM Studio Discord

General: API endpoint error resolved by consulting documentation. Image captioning issues after update resolved by switching to Gemma 3. Flash Attention glitches impacting model functionality. GPT OSS 20B speed showcased. Free Mint opportunity with OpenSea.
Hardware-Discussion: Discussions on Q8 cache, GPU fans at 0% during inference, hardware devaluation, potential CPU fire averted, and PCIe bifurcation breakthroughs.

OpenRouter Discord

App-Showcase: Color picker bug reported. RapidaAI open-sourced voice AI platform.
General: Opus overload outage reported. Model fallback bug discovered. Free Deepseek R1 model removed. Buzz around upcoming Meganova Chat. OpenRouter's normalized interfaces praised.
New-Models: (No new messages)
Discussion: Arrakis AI model still looks yellow-ish. Text-to-Video Leaderboard updated, with David in first place.

Nous Research AI Discord

Announcements: Psyche Team Office Hours scheduled.
General: Suno's Warner Music partnership sparks debate. Data vs. compute costs highlighted. Blackwell architecture performance warnings (INT/FP mixing). Z-Image model released on Modelscope. Debate on AI disclosure policies on Steam.
Ask-About-LLMs: LLM benchmarks face pre-training data contamination. Overcoming contamination in benchmarks is challenging.
Interesting-Links: Lecture on Information Retrieval history shared.

Eleuther Discord

General: Hallucinations in multi-stage LLMs still count as hallucinations. LLMs compared to "golden retrievers." Debate on verifying AI claims and fact-checking misinformation. Discussion on AI and collaborative work.
Research: Debate on SGD shuffling. PIQA paper typo noted. Emergent Misalignment paper replication and "JSON Trap" discovery. Resources sought for AI for Drug Discovery.
Scaling-Laws: Link to a paper on scaling laws.

Latent Space Discord

AI-General-Chat: Claude Code's Plan Mode overhaul with parallel subagents. DeepMind documentary "The Thinking Game" released. Jeff Dean's AI retrospective and Gemini 3.0. Claude generating PowerPoint slides. Comparison of ChatGPT Pro vs. Claude.
AI-Announcements: RF-DETR paper authors host SOTA Vision special. NeurIPS signups reminder. 2025 Dev Writers Retreat accepting final signups.
Genmedia-Creative-AI: Whisper Thunder surpasses VideoGen in text-to-video. Nano Banana Pro's realism sparks debate and fraud concerns. OpenAI's image-gen upgrade receives mixed reception. FLUX 2 Pro boasts improved visuals.

Yannick Kilcher Discord

General: Department of Energy plans national AI platform. MIT study on AI replacing jobs sparks debate. LLMs criticized for poor summarization. Debate on curriculum learning techniques for LLM pretraining.
Paper-Discussion: Adobe AI summaries criticized. LLMs struggle to summarize high-density info. Discussion on ADHD and Autism in Tech. Proposal for a new rule to curb paper flooding.
ML-News: Tencent releases Hunyuan model. MAGA supporters push back against AI datacenters. MIT study on AI workforce replacement.

HuggingFace Discord

General: Hugging Face Inference API grayed out. Christmas gift drop shared. LM Studio PDF teacher suggested. Spanish text dataset quest.
Cool-Finds: (No new messages)
I-Made-This: RapidaAI goes open source. French Classic Books Dataset created. AI Sci-Fi Short Film released.
Reading-Group: Chunking's impact is small. GNN presentation on AlphaFold approaching. Structured data is valuable.
Agents-Course: (No new messages)

Modular (Mojo 🔥) Discord

General: Mojo keeps repos synced using Copybara.
Max: MAX examples for newbies sought. Discussion on MAX written in Python. Mojo API's return to MAX anticipated. Hurdles highlighted for migrating Python MAX code to Mojo MAX.

Tinygrad (George Hotz) Discord

Learn-Tinygrad: TinyJit internals detailed (only replays kernels). Randomness functions in Tensor work as expected. Two JIT runs required for tracing, with potential changes. Good, but outdated, JIT tutorial shared. Focus shifting to frontend usability.

Moonshot AI (Kimi K-2) Discord

General-Chat: Kimi's limits explored. Debate on chatbots vs. canvases for UIs. Conversational fallacy discussed.

DSPy Discord

Show-and-Tell: dspy-cli tool goes open source, enabling scaffolding of DSPy projects and deployment as HTTP APIs. Acclaimed for project utility.
General: Trajectory injection sought for ReAct modules. API choices debated for web search implementation. Exa API includes summarization. Latency issues with web search API calls.

MCP Contributors (Official) Discord

General: New protocol version released. UI SEP ships out-of-band. MCP considers namespace collisions.

Manus.im Discord Discord

General: AI engineer introduces themself with extensive experience. User reports API issues and lack of support. Members inquire about a Telegram channel.

Aider (Paul Gauthier) Discord

General: Community suggests new site admin for benchmarking. Survey on Opus 4.5 vs. Sonnet 4.5 upgrade. Bedrock Identifier "snafu" reported.

Nov

2025

Reducing Privacy leaks in AI: Two approaches to contextual integrity

Ensuring Contextual Integrity for AI Agents

As AI agents become more autonomous, it's crucial they adhere to contextual norms regarding information sharing to maintain user trust. The theory of contextual integrity frames privacy as the appropriateness of information flow within specific social contexts. Applied to AI agents, this means their information sharing should be suitable for the situation, considering who is involved, what information is being shared, and why. For instance, an AI assistant booking a medical appointment should share only necessary details like the patient's name and relevant history, not extraneous insurance information. Similarly, an AI with access to a user's calendar and email should use this to make lunch reservations based on available times and preferences, but should not reveal personal emails or other appointment details. However, current large language models (LLMs) often lack this contextual awareness, potentially disclosing sensitive information inadvertently. This highlights the need for stronger mechanisms within AI systems to determine what information is suitable to share and when. Microsoft researchers are developing ways to imbue AI systems with contextual integrity. Two complementary research efforts aim to enhance AI's sensitivity to information-sharing norms:

Privacy in Action: Towards Realistic Privacy Mitigation and Evaluation for LLM-Powered Agents (EMNLP 2025) introduces PrivacyChecker, a lightweight, model-agnostic module that can be integrated into agents to improve their sensitivity to contextual integrity. It transforms static privacy benchmarks into dynamic environments, revealing higher privacy risks in real-world agent interactions. PrivacyChecker extracts information flows, classifies them as allow/withhold, and applies optional policy guidelines within a single prompt, without requiring model retraining.
- Integration Methods: PrivacyChecker can be integrated as a global system prompt, embedded within specific tool calls, or used as a standalone Model Context Protocol (MCP) tool.
- Evaluation: On the PrivacyLens benchmark, PrivacyChecker reduced information leakage significantly (e.g., from 33.06% to 8.32% on GPT4o). In dynamic evaluations using PrivacyLens-Live with MCP tools and Agent2Agent (A2A) communication, it maintained substantially lower leakage rates compared to baseline prompts, even in complex multi-tool and multi-agent scenarios.
Contextual Integrity in LLMs via Reasoning and Reinforcement Learning (NeurIPS 2025) explores building contextual integrity directly into the model. This approach treats contextual integrity as a reasoning problem:
- Contextual Integrity with Chain-of-Thought (CI-CoT): This method repurposes chain-of-thought prompting to have the model assess contextual information disclosure norms before responding. It directs the model to identify necessary attributes for a task and those to be withheld. CI-CoT reduced information leakage but sometimes led to overly conservative responses, withholding necessary information.
- Contextual Integrity with Reinforcement Learning (CI-RL): To address the trade-off between privacy and helpfulness, CI-RL optimizes for both. The model is rewarded for completing tasks using only contextually appropriate information and penalized for inappropriate disclosures. This approach retains contextual sensitivity while preserving task performance, nearly matching CI-CoT's privacy gains with significantly improved helpfulness scores.

Together, these research efforts provide a path from problem identification to practical solutions. PrivacyChecker's evaluation framework highlights privacy leakage points, while CI-CoT and CI-RL develop models that can appropriately handle information disclosure. Both projects leverage the theory of contextual integrity to build AI systems that better preserve user privacy.

Nov

2025

减少 AI 中的隐私泄露：两种实现情境完整性的方法

Ensuring Contextual Integrity for AI Agents

Privacy in Action: Towards Realistic Privacy Mitigation and Evaluation for LLM-Powered Agents (EMNLP 2025) introduces PrivacyChecker, a lightweight, model-agnostic module that can be integrated into agents to improve their sensitivity to contextual integrity. It transforms static privacy benchmarks into dynamic environments, revealing higher privacy risks in real-world agent interactions. PrivacyChecker extracts information flows, classifies them as allow/withhold, and applies optional policy guidelines within a single prompt, without requiring model retraining.
- Integration Methods: PrivacyChecker can be integrated as a global system prompt, embedded within specific tool calls, or used as a standalone Model Context Protocol (MCP) tool.
- Evaluation: On the PrivacyLens benchmark, PrivacyChecker reduced information leakage significantly (e.g., from 33.06% to 8.32% on GPT4o). In dynamic evaluations using PrivacyLens-Live with MCP tools and Agent2Agent (A2A) communication, it maintained substantially lower leakage rates compared to baseline prompts, even in complex multi-tool and multi-agent scenarios.
Contextual Integrity in LLMs via Reasoning and Reinforcement Learning (NeurIPS 2025) explores building contextual integrity directly into the model. This approach treats contextual integrity as a reasoning problem:
- Contextual Integrity with Chain-of-Thought (CI-CoT): This method repurposes chain-of-thought prompting to have the model assess contextual information disclosure norms before responding. It directs the model to identify necessary attributes for a task and those to be withheld. CI-CoT reduced information leakage but sometimes led to overly conservative responses, withholding necessary information.
- Contextual Integrity with Reinforcement Learning (CI-RL): To address the trade-off between privacy and helpfulness, CI-RL optimizes for both. The model is rewarded for completing tasks using only contextually appropriate information and penalized for inappropriate disclosures. This approach retains contextual sensitivity while preserving task performance, nearly matching CI-CoT's privacy gains with significantly improved helpfulness scores.

Nov

2025

SAM 3’s ability to precisely detect and track objects is helping @ConservationX measure the survival...

SAM 3’s ability to precisely detect and track objects is helping @ConservationX [https://x.com/ConservationX] measure the survival of animal species around the world and prevent their extinction. 🔗 Learn more about the work:ai.meta.com/blog/segment-a… [https://ai.meta.com/blog/segment-anything-conservation-x-wildlife-monitoring/?utm_source=twitter&utm_medium=organic_social&utm_content=video&utm_campaign=sam]I Your browser does not support the video tag. 🔗 View on Twitter [https://x.com/AIatMeta/status/1993020997721899473/video/1] 💬9🔄13❤️112👀6404📊23 ⚡ Powered by xgo.ing [https://xgo.ing]

Nov

2025

We partnered with @ConservationX to build the SA-FARI dataset with 10,000+ annotated videos includin...

We partnered with @ConservationX [https://x.com/ConservationX] to build the SA-FARI dataset with 10,000+ annotated videos including over 100 species of animals. We’re sharing this dataset to help with conservation efforts around the globe. 🔗 Find it here:conservationxlabs.com/sa-fari [https://www.conservationxlabs.com/sa-fari]r 💬1🔄2❤️20👀3937📊4 ⚡ Powered by xgo.ing [https://xgo.ing]

Nov

2025

Fara-7B: An Efficient Agentic Model for Computer Use

Microsoft has released Fara-7B, an experimental agentic Small Language Model (SLM) designed for computer use. Unlike traditional chatbots, Fara-7B uses computer interfaces like a mouse and keyboard to complete tasks, visually perceiving webpages and performing actions such as scrolling, typing, and clicking. With only 7 billion parameters, it achieves state-of-the-art performance in its size class, enabling on-device execution for reduced latency and improved privacy. Fara-7B can automate web tasks like filling forms, searching information, and booking travel. It was trained using a novel synthetic data generation pipeline that avoids manual annotation, drawing from real web pages and tasks. Demonstrations include shopping, information retrieval, and tool integration with Bing Maps and Search. Fara-7B shows strong performance across benchmarks, including WebVoyager (73.5%) and WebTailBench (38.4%), outperforming larger models in some cases. Safety considerations include data privacy, auditability, sandboxing, refusal training for harmful tasks, and stopping at critical points to seek user approval. Fara-7B is available on Microsoft Foundry and Hugging Face, with a version optimized for Copilot+ PCs. Future work will focus on enhanced multimodal models and Reinforcement Learning.

Nov

2025

Fara-7B：一种高效的计算机使用代理模型

Nov

2025

https://t.co/D2JtSpOc2g

x.com/lepadphone/sta… [https://x.com/lepadphone/status/1991370701123805203] padphone padphone@lepadphone So much fun! SAM 3D! You can extract a 3D object directly from an image！And add effects! Your browser does not support the video tag. 🔗 View on Twitter [https://x.com/lepadphone/status/1991370701123805203/video/1] 🔗 View Quoted Tweet [https://x.com/lepadphone/status/1991370701123805203] 💬1🔄1❤️7👀5067📊2 ⚡ Powered by xgo.ing [https://xgo.ing]

Nov

2025

https://t.co/CE9mCi1F02

x.com/eikedrescher/s… [https://x.com/eikedrescher/status/1991471416332677387] Eike Drescher Eike Drescher@eikedrescher Okay this new model is insane. Generate an image, then give it to the model to turn it into a 3D model with matching texture in seconds. Creativity on Spielwerk will be endless with this Your browser does not support the video tag. 🔗 View on Twitter [https://x.com/eikedrescher/status/1991471416332677387/video/1] 🔗 View Quoted Tweet [https://x.com/eikedrescher/status/1991471416332677387] 💬0🔄0❤️11👀5132📊2 ⚡ Powered by xgo.ing [https://xgo.ing]

Nov

2025

The Segment Anything Playground is a new way to interact with media. Experiment with Meta’s most adv...

The Segment Anything Playground is a new way to interact with media. Experiment with Meta’s most advanced segmentation models, including SAM 3 + SAM 3D, and discover how these capabilities can transform your creative projects and technical workflows. Check out some inspo and tips in the 🧵 below, then head over to the Playground to get started:aidemos.meta.com/segment-anythi… [https://aidemos.meta.com/segment-anything/]x Your browser does not support the video tag. 🔗 View on Twitter [https://x.com/AIatMeta/status/1991942484633821553/video/1] 💬14🔄14❤️76👀5545📊25 ⚡ Powered by xgo.ing [https://xgo.ing]

Nov

2025

We’re advancing on-device AI with ExecuTorch, now deployed across devices including Meta Quest 3, Ra...

We’re advancing on-device AI with ExecuTorch, now deployed across devices including Meta Quest 3, Ray-Ban Meta, Oakley Meta Vanguard and Meta Ray-Ban Display. By eliminating conversion steps and supporting pre-deployment validation in PyTorch, ExecuTorch accelerates the path from research to production, ensuring consistent, efficient AI across a diverse hardware ecosystem. Read the full technical deep dive: ai.meta.com/blog/executorc… [https://ai.meta.com/blog/executorch-reality-labs-on-device-ai/?utm_source=twitter&utm_medium=organic_social&utm_content=photo&utm_campaign=executorch] Tweet Image 💬14🔄40❤️284👀20390📊61 ⚡ Powered by xgo.ing [https://xgo.ing]

Nov

2025

Collecting a high quality dataset with 4M unique phrases and 52M corresponding object masks helped S...

Collecting a high quality dataset with 4M unique phrases and 52M corresponding object masks helped SAM 3 achieve 2x the performance of baseline models. Kate, a researcher on SAM 3, explains how the data engine made this leap possible. 🔗 Read the SAM 3 research paper:go.meta.me/6411f7 [http://go.meta.me/6411f7]H Your browser does not support the video tag. 🔗 View on Twitter [https://x.com/AIatMeta/status/1991640180185317644/video/1] 💬10🔄39❤️341👀27962📊59 ⚡ Powered by xgo.ing [https://xgo.ing]

Nov

2025

SAM 3D enables accurate 3D reconstruction from a single image, supporting real-world applications in...

SAM 3D enables accurate 3D reconstruction from a single image, supporting real-world applications in editing, robotics, and interactive scene generation. Matt, a SAM 3D researcher, explains how the two-model design makes this possible for both people and complex environments. 🔗 Read the SAM 3D Objects research paper:go.meta.me/8c08ca [https://go.meta.me/8c08ca]4 🔗 Read the SAM 3D Body research papergo.meta.me/5e60ed [https://go.meta.me/5e60ed]2k Your browser does not support the video tag. 🔗 View on Twitter [https://x.com/AIatMeta/status/1991605451809513685/video/1] 💬2🔄7❤️83👀13872📊13 ⚡ Powered by xgo.ing [https://xgo.ing]

AI agents are poised to transform digital marketplaces. To explore what can happen when AI agents interact and transact at scale, we built Magentic Marketplace, an open-source simulation environment for studying agentic market designs. The post Magentic Marketplace: an open-source simulation environment for studying agentic markets [https://www.microsoft.com/en-us/research/blog/magentic-marketplace-an-open-source-simulation-environment-for-studying-agentic-markets/] appeared first on Microsoft Research [https://www.microsoft.com/en-us/research].

Oct

2025

Accelerating discovery with the AI for Math Initiative

The initiative brings together some of the world's most prestigious research institutions to pioneer the use of AI in mathematical research.

Oct

2025

Anthropic officially opens Tokyo office, signs Memorandum of Cooperation with the Japan AI Safety Institute

This week, we opened our first Asia-Pacific office in Tokyo, a milestone in Anthropic's international expansion. Our CEO and co-founder Dario Amodei traveled to Tokyo to meet with Prime Minister Takaichi, address members of the LDP Digitization Headquarters Committee, meet customers and sign a Memorandum of Cooperation with the Japan AI Safety Institute. These actions deepen our partnership with Japanese government, enterprises, and cultural institutions. “Technology and human progress are not in tension, but advance together,” said Dario Amodei. “This principle, this Japanese notion of the purpose of technology, is at the heart of Anthropic. It’s how we view the world, and it’s the reason we see Japan as a vital hub for growing our business.” BUILDING SHARED STANDARDS FOR AI EVALUATION AI development transcends national borders. As these systems become more powerful, we need international cooperation on evaluation standards—shared ways to assess capabilities, test systems, and understand risks. This week, Anthropic signed a Memorandum of Cooperation with the Japan AI Safety Institute to collaborate on AI evaluation methodologies and to monitor emerging trends in the field. This partnership builds on Anthropic's collaboration with AI safety institutes worldwide, including formal agreements with the US Center for AI Standards and Innovation (CAISI) and ongoing work with the UK's AI Security Institute. In November 2024, the US and UK institutes conducted their first joint evaluation of Claude 3.5 Sonnet, demonstrating how international organizations can advance the science of AI evaluation together. Anthropic also joined the Hiroshima AI Process Friends Group this week, deepening our commitment to the framework we signed in 2023 promoting safe, secure, and trustworthy AI development globally while facilitating innovation. JAPAN'S APPROACH TO AI ADOPTION "What we're seeing in Japan validates our belief that the most successful AI deployments enhance human capabilities rather than replace them," said Hidetoshi Tojo, Representative Director and President of Anthropic in Japan. "Japanese businesses understand that AI should allow people to focus on what humans do best—creative problem-solving, nuanced communication, and building trusted relationships." Japan ranks in the top 25% globally for AI adoption according to recent data from Anthropic’s Economic Index [https://www.anthropic.com/economic-index]. People in Japan use AI as a collaborative tool to augment human capabilities, primarily for productivity-enhancing tasks like academic research, writing, and document editing—reflecting a focus on enhancing creativity and communication quality rather than replacing human judgment. Leading Japanese enterprises are already seeing results. Rakuten [https://www.claude.com/customers/rakuten] is using Claude for autonomous coding projects, dramatically improving developer productivity. Nomura Research Institute [https://www.linkedin.com/feed/update/urn:li:activity:7333205487473082368/] has transformed document analysis from hours to minutes while maintaining precision. Panasonic [https://news.panasonic.com/global/press/en250108-11]has integrated Claude across both business operations and consumer applications. And Classmethod [https://www.claude.com/customers/classmethod], a leading cloud integrator, reports achieving 10x productivity gains, with Claude Code generating 99% of a recent project's codebase. This week we also hosted our first Builder Summit in Tokyo, where we met more than 150 startups and founders building with Claude. All this reflects the extraordinary momentum we're seeing across Asia Pacific, where our run rate revenue has grown more than 10x in the past year. SUPPORTING JAPAN'S CREATIVE COMMUNITY We also announced that we have extended our partnership with the Mori Art Museum. We will work long-term with the museum in a number of ways, including collaborating on the upcoming exhibition Roppongi Crossing 2025: What Passes Is Time. We Are Eternal. — the eighth edition of a series that was first launched in 2004 to provide a snapshot of Japan’s contemporary art scene at a particular moment in time. This follows our collaboration with the museum on the highly acclaimed MACHINE LOVE: Video Game, AI and Contemporary Art exhibition. LOOKING FORWARD The people and organizations we’ve met in Japan share our conviction that technological progress must enable human progress. We're building a team in Tokyo to work alongside partners across industry, government, and culture toward that goal. Over the coming months, we'll bring this same approach to Seoul and Bengaluru as we continue our Asia-Pacific expansion. We look forward to helping innovation flourish across the region. For information about career opportunities at our Tokyo office, see here [https://www.anthropic.com/news/anthropic.com/careers].

Oct

2025

OpenAI completes Microsoft + For-profit restructuring + announces 2028 AI Researcher timeline + Platform / AI cloud product direction + next $1T of compute

OpenAI has completed a major recapitalization and restructuring, forming a Public Benefit Corporation with a non-profit Foundation holding special voting rights and equity valued at $130B. Microsoft holds about 27% diluted ownership and committed to $250B in Azure spend, losing exclusivity on compute but retaining Azure API exclusivity until AGI is declared. The compute infrastructure deals for 2025 total 30GW worth $1.4T, with OpenAI aiming to build 1GW per week at $20B per GW, projecting $3-4 trillion infrastructure by 2033. The company is shifting focus from first-party apps to a platform approach, emphasizing ecosystem growth and third-party development. Sam Altman and Sama are key figures in this transition, with significant financial and strategic implications for AI industry partnerships, including openness to Anthropic and Google Gemini on Azure.

Oct

2025

Advancing Claude for Financial Services

We're expanding Claude for Financial Services [https://www.claude.com/solutions/financial-services] with an Excel add-in, additional connectors to real-time market data and portfolio analytics, and new pre-built Agent Skills, like building discounted cash flow models and initiating coverage reports. These updates build on Sonnet 4.5’s state of the art performance on financial tasks, topping the Finance Agent benchmark [https://www.vals.ai/benchmarks/finance_agent] from Vals AI at 55.3% accuracy. They augment Claude’s intelligence with solutions for time-consuming but critical financial work, built into preferred industry tools. CLAUDE FOR EXCEL We’re releasing Claude for Excel [https://claude.com/claude-for-excel] in beta as a research preview. This allows users to work directly with Claude in a sidebar in Microsoft Excel, where Claude can read, analyze, modify, and create new Excel workbooks. Claude provides full transparency about the actions it takes: it tracks and explains its changes and lets users navigate directly to the cells it references in its explanations. This picture depicts Claude for Excel. Claude analyzes a spreadsheet containing Acme Grille, Inc.'s consolidated income statement from 2020-2024, providing real-time guidance on financial modeling. This means that Claude can discuss how a spreadsheet works, modify it while preserving its structure and formula dependencies, debug and fix cell formulas, populate templates with new data and assumptions, or build new spreadsheets entirely from scratch. Claude for Excel adds to our existing integrations with Microsoft’s applications. In the Claude apps, Claude can also create and edit files, including Excel spreadsheets and PowerPoint slides, and connect to Microsoft 365 to search for files, emails, and Teams conversations. Select Claude models are also available in Microsoft Copilot Studio and Researcher agent. Claude for Excel is now in beta as a research preview for Max, Enterprise, and Teams users. We’ll collect real-world feedback from 1,000 initial users before rolling the feature out more broadly. To join the waitlist, click here [https://docs.google.com/forms/d/e/1FAIpQLSedsdrIw00BOGbiIhAQvTaC7mOQRW6jOofAt7PJ1lYAGzvfUw/viewform?usp=dialog]. CONNECTING CLAUDE TO LIVE INFORMATION Connectors [https://claude.ai/redirect/website.v1.5154410f-3616-49c4-bf63-101d3780122e/settings/connectors] provide Claude with direct access to external tools and platforms. In July [https://www.anthropic.com/news/claude-for-financial-services], we added connectors for S&P Capital IQ, Daloopa, Morningstar, and Pitchbook. We’re adding new connectors that give Claude immediate access to more information in real time: * Aiera provides Claude with real-time earnings call transcripts and summaries of investor events, like shareholder meetings, presentations, and conferences; * Aiera’s connector also enables a data feed from Third Bridge, which gives Claude access to a library of insights interviews, company intelligence, and industry analysis from experts and former executives; * Chronograph gives private equity investors operational and financial information for portfolio monitoring and conducting due diligence, including performance metrics, valuations, and fund-level data; * Egnyte enables Claude to securely search permitted data for internal data rooms, investment documents, and approved financial models, while maintaining governed access controls; * LSEG connects Claude to live market data, including fixed income pricing, equities, foreign exchange rates, macroeconomic indicators, and analysts’ estimates of other important financial metrics; * Moody’s provides access to proprietary credit ratings, research, and company data – including ownership, financials and news on more than 600 million public and private companies – supporting work and research in compliance, credit analysis, and business development; * MT Newswires provides Claude with access to the latest global multi-asset class news on financial markets and economies. For details on MCP connector setup and prompting guidance to maximize the benefit of each connector, see our documentation here [https://support.claude.com/en/collections/13972013-claude-for-financial-services]. NEW AGENT SKILLS FOR FINANCE TASKS Earlier this month, we introduced Agent Skills [https://www.anthropic.com/news/skills]. Skills are folders that include instructions, scripts, and resources that Claude can use to perform given tasks. Skills work across all Claude apps, including Claude.ai [http://claude.ai/redirect/website.v1.5154410f-3616-49c4-bf63-101d3780122e], Claude Code, and our API. To make Claude better at financial services tasks, we’ve added 6 new skills: * Comparable company analysis, with valuation multiples and operating metrics, which can be easily refreshed with updated data; * Discounted cash flow models, including full free cash flow projections, WACC calculations, scenario toggles, and sensitivity tables; * Due diligence data packs, processing data room documents into Excel spreadsheets with financial information, customer lists, and contract terms; * Company teasers and profiles, condensed company overviews for pitch books and buyer lists; * Earnings analyses, which research quarterly transcripts and financials to extract important metrics, guidance changes, and management commentary; * Initiating coverage reports with industry analysis, company deep-dives, and valuation frameworks. As with Claude for Excel, these new skills are being rolled out in preview for Max, Enterprise, and Teams users. You can sign up on behalf of your team or organization here [https://docs.google.com/forms/d/e/1FAIpQLSdXOB2bR7r_YhwENL1VplbgWvQ96YhInhHj5Fr9_V_MAOCiNQ/viewform]. CLAUDE’S IMPACT IN FINANCIAL SERVICES Claude is already widely used by leading banking, asset management, insurance, and financial technology companies. It supports front office tasks like client experience, middle office tasks in underwriting, risk and compliance, and back office tasks like code modernization and legacy processes. With ongoing updates to our models and products specific to financial services, we expect Claude to become even better in roles like these. Citi logo “

Citi chose to leverage Claude as part of its AI powered Developer Platform because of its advanced planning and agentic coding capabilities, focus on safety and reliability, and compatibility with our workloads. David Griffiths CTO, Citi RBC Capital Markets logo “ Working with Anthropic goes beyond deploying another AI tool—it's about partnering with a company that understands the complexity that financial services requires. Claude excels by seamlessly integrating multiple data sources and automating workflows that previously consumed significant time. Bobby Griffiths Head of AI and Digital Innovation, RBC Capital Markets Brex logo “ What we've valued about Anthropic is not just their powerful models, but how they've positioned them for enterprise needs. When I talk with customers about AI, data privacy is always their first concern—it's the critical foundation we have to address before we can even begin discussing capabilities. David Horn AI Lead, Brex Block logo “ 75% of our engineers now save 8 to 10+ hours every week using our open source AI agent for creating SQL queries (codename goose) — accelerating velocity and cutting down on busywork. For the tasks we care about measuring specifically, the Claude family has performed the best. Bradley Axen Principal Data and Machine Learning Engineer, Block Coinbase logo “ Anthropic's multi-cloud solution stands out for its scale, performance and security, aligning with our operational needs and customer expectations. It exceeded our performance benchmarks and met all our security requirements, making it the ideal solution. We think Claude will help Coinbase build solutions for different customer segments and bring a billion customers to the crypto economy. Varsha Mahadevan Senior Engineering Manager, Coinbase British Columbia Investment Management Corporation logo “ As one of Canada’s largest institutional investors, BCI is driven to experiment, build, and innovate. Claude has accelerated our ability to get up-to-speed on investments and the underlying portfolio’s progress, making us more effective. As we push boundaries on what’s possible, we’re excited by the opportunities. Christian Grunt Senior Principal, Private Equity, British Columbia Investment Management Corporation Visa logo “ We see AI agents as the next evolution of commerce—autonomous systems that can predict, suggest, and find the products and services consumers need. This is only possible with a secure foundation at the base built on consent, privacy, transparency, and security. Anthropic is a key partner of Visa to make this dream a reality and shares our values and principles around responsible data usage. President, Technology Rajat Taneja, Visa Jump Trading logo “ Claude serves as a remarkable reasoning-powered companion. Its ability to shift smoothly between quick execution and deep analysis, with fine-grained control over both, is exactly what's been missing in AI systems. Anthropic is a go-to technology & partner for AI workloads that require reliable intelligence at scale. Lucas Baker Quantitative Research Lead, Jump Trading Francisco Partners logo “ Through our training program with Anthropic, we've seen portfolio companies adopt Claude Code with remarkable results. Development teams are completing complex tasks in hours instead of days, and we're hearing from previously skeptical engineers that they can't imagine working without it. Mike Barry Managing Operating Partner, Product & Technology, Francisco Partners Chronograph logo “ Chronograph’s connection to Claude will fundamentally change what is possible for our clients – much like how Claude for Enterprise has transformed our internal operations. The partnership between Chronograph and Claude enables our clients to uncover new insights, save significant time, and achieve superior returns using their private capital portfolio data within Claude’s powerful toolset. Charlie Tafoya Co-Founder and CEO, Chronograph Moody's logo “ With our GenAI-ready data offerings, we continue to support our customers in their AI evolution—enriching our data via a semantic layer and delivering it through Model Context Protocol (MCP) servers and Smart APIs. Our partnership with Anthropic makes Moody’s vast data estate accessible directly where our customers are innovating. Cristina Pieretti Head of Digital Content and Innovation, Moody's London Stock Exchange Group (LSEG) logo “ LSEG has a long-established reputation for our open, partnership approach and meeting our customers wherever their workflows are taking place. Secure, enterprise grade AI applications, such as Claude, are expanding the opportunities for LSEG to build deep partnerships with customers. Ron Lefferts Co-head, Data & Analytics, London Stock Exchange Group (LSEG) Below, Alexander Bricken, Applied AI Lead for Financial Services, and Nicholas Lin, Head of Product for Financial Services, discuss Anthropic’s research and product strategy within financial services, as well as customer examples. GETTING STARTED To learn more about using Claude for Financial Services, see here [https://claude.com/solutions/financial-services] or contact [https://claude.com/contact-sales/financial-services] our sales team. And to see the new features in action and hear directly from financial services leaders, you can also register here [http://website.anthropic.com/webinars/%20claude-for-financial-services] for our launch webinar.

Oct

2025

Introducing Mika, the newest Grok Companion. Video made using Grok Imagine.

Introducing Mika, the newest Grok Companion. Video made using Grok Imagine. Mika Mika@m Your browser does not support the video tag.  View on Twitter [https://x.com/m/status/1981865619176788279/video/1]  View Quoted Tweet [https://x.com/m/status/1981865619176788279] 325357❤️3052414663464 ⚡ Powered by xgo.ing [https://xgo.ing]

Oct

2025

not much happened today

LangSmith launched the Insights Agent with multi-turn evaluation for agent ops and observability, improving failure detection and user intent clustering. Meta PyTorch and Hugging Face introduced OpenEnv, a Gymnasium-style API and hub for reproducible agentic environments supporting distributed training. Discussions highlighted the importance of provider fidelity in agent coding, with OpenRouter's exacto filter improving stability. Builder UX updates include Google AI Studio's Annotation mode for Gemini code changes, Microsoft's Copilot Mode enhancements in Edge, and OpenAI's Shared Projects and Company Knowledge features for ChatGPT Business. Claude added project-scoped Memory. In reinforcement learning, Meta's ScaleRL proposes a methodology to predict RL scaling outcomes for LLMs with improved efficiency and stability.

As ICCV 2025 begins, DeepSeek releases a novel DeepSeek-OCR 3B MoE vision-language model that compresses long text as visual context with high accuracy and efficiency, challenging traditional tokenization approaches. The model achieves ~97% decoding precision at <10× compression and processes up to ~33M pages/day on 20 A100-40G nodes, outperforming benchmarks like GOT-OCR2.0. Discussions highlight the potential for unlimited context windows and tokenization-free inputs, with contributions from @karpathy, @teortaxesTex, and others. In video generation, google-deepmind's Veo 3.1 leads community benchmarks with advanced precision editing and scene blending, while Krea open-sources a 14B autoregressive video model enabling realtime long-form generation at ~11 FPS on a single B200 GPU.

Oct

2025

Claude for Life Sciences

Accelerating scientific progress is a core part of Anthropic’s public benefit mission. We are focused on building the tools to enable researchers to make new discoveries – and eventually, to enable AI models to make these discoveries autonomously. Until recently, scientists typically used Claude for individual tasks, such as writing code for statistical analysis or summarizing papers. Pharmaceutical companies and others in industry also use it for tasks across the rest of their business, such as sales, to fund new research. Now, our goal is to make Claude capable of supporting the entire process, from early discovery through to translation and commercialization. To achieve this, we are rolling out several improvements aimed at making Claude a better partner for those in the life sciences, including researchers, clinical coordinators, and regulatory affairs managers. MAKING CLAUDE A BETTER RESEARCH PARTNER First, we have improved Claude’s underlying performance. Our most capable model, Claude Sonnet 4.5, is significantly better than previous models at a range of life sciences tasks. For example, on Protocol QA, a benchmark that tests the model’s understanding and proficiency with laboratory protocols, Sonnet 4.5 scores 0.83, compared to a human baseline of 0.79, and Sonnet 4’s performance of 0.74.1 Sonnet 4.5 shows a similar improvement over its predecessor on BixBench, an evaluation that measures its performance on bioinformatics tasks. To make Claude more useful for scientific work, we are now adding several new connectors [https://claude.com/partners/mcp] to scientific platforms, the ability to use Agent Skills, and life sciences-specific support in the form of a prompt library and dedicated support. CONNECTING CLAUDE TO SCIENTIFIC TOOLS Connectors [https://claude.ai/redirect/website.v1.7ecf2fc3-1f30-46a2-b757-10786ceef9f1/settings/connectors] allow Claude to access other platforms and tools directly. We are adding several new connectors designed to make it easier to use Claude for scientific discovery: * Benchling gives Claude the ability to respond to scientists’ questions with links back to source experiments, notebooks, and records;

BioRender connects Claude to its extensive library of vetted scientific figures, icons, and templates;
PubMed provides access to millions of biomedical research articles and clinical studies;
Scholar Gateway developed by Wiley provides access to authoritative, peer-reviewed scientific content within Claude to accelerate research discovery;
Synapse.org allows scientists to share and analyze data together in public or private projects;
10x Genomics allows researchers to conduct single cell and spatial analysis in natural language. These connectors add to our existing set, which includes general-purpose tools like Google Workspace and Microsoft SharePoint, OneDrive, Outlook, and Teams. Claude can also already work directly with Databricks to provide analytics for large-scale bioinformatics research, and Snowflake to search through large datasets using natural language questions. DEVELOPING SKILLS FOR CLAUDE Last week, we released Agent Skills: [https://www.anthropic.com/news/skills] folders containing instructions, scripts, and resources that Claude can use to improve how it performs specific tasks. Skills are a natural fit for scientific work, as they allow Claude to consistently and predictably follow specific protocols and procedures. We are developing a number of scientific skills for Claude, starting with single-cell-rna-qc This skill performs quality control and filtering on single-cell RNA sequencing data, using scverse [https://scverse.org/] best practices: Claude performs quality control on single-cell RNA-seq data Claude performs quality control on single-cell RNA-seq data. In addition to the skills we are creating, scientists can build their own. For more information and guidance, including setting up custom skills, see here [https://support.claude.com/en/articles/12512180-using-skills-in-claude]. USING CLAUDE FOR LIFE SCIENCES Claude can be used for life sciences tasks such as the following: * Research, such as literature reviews and developing hypotheses: Claude can cite and summarize biomedical literature and generate testable ideas based on what it finds. Watch how Claude analyzes data, conducts a literature review, dives into potentially novel insights, turns this analysis into a presentation, and puts the finishing touches on slides with a figure from BioRender.
Generating protocols: With the Benchling connector, Claude can draft study protocols, standard operating procedures, and consent documents.
Bioinformatics and data analysis: Process and analyze genomic data with Claude Code. Claude can present its results in slides, docs [https://www.anthropic.com/news/create-files], or code notebook format.
Clinical and regulatory compliance: Claude can draft and review regulatory submissions, and compile compliance data. In addition, to help scientists get started quickly, we are creating a library of prompts [https://support.claude.com/en/articles/12614768-getting-started-with-claude-for-life-sciences] that should elicit the best results on tasks like the above. PARTNERSHIPS AND CUSTOMERS We are providing hands-on support from dedicated subject matter experts in our Applied AI and customer-facing teams. We are also partnering with companies specializing in helping organizations adopt AI for life sciences work. These include Caylent, Deloitte, KPMG, PwC, Quantium, Slalom, Tribe AI, and Turing, along with our cloud partners, AWS and Google Cloud. Many of our existing customers and partners have already been using Claude for a broad range of real-world scientific tasks: Sanofi logo “

Claude, paired with internal knowledge libraries, is integral to Sanofi's AI transformation and used by most Sanofians daily in our Concierge app. We're seeing efficiency gains across the value-chain, while our enterprise deployment has enhanced how teams work. This collaboration with Anthropic augments human expertise to deliver life-changing medicines faster to patients worldwide. Emmanuel Frenehard Chief Digital Officer, Sanofi Benchling logo “ AI in R&D works through an ecosystem. Anthropic brings the best technologies while prioritizing access, governance, and interoperability. Benchling is uniquely positioned to contribute. For over a decade, scientists have trusted us as their source of truth for experimental data and workflows. Now we're building AI that powers the next chapter of R&D. Ashu Singhal Co-founder and President, Benchling Broad Institute of MIT and Harvard logo “ Broad Institute scientists pursue the most ambitious questions in biology and medicine, creating tools to empower scientists everywhere. We're working with Manifold on Terra Powered by Manifold. AI agents built on Claude enable scientists to work at entirely new scale and efficiency, exploring scientific domains in previously impossible ways. Heather Jankins Head of Data Science Platform, Broad Institute of MIT and Harvard AbbVie logo “ Claude is foundational to AbbVie's operations. Our GAIA platform leverages Claude for regulatory document generation, ensuring accuracy at scale. GenAIsys empowers field teams with AI insights for healthcare professional engagement. By integrating Claude across workflows on AWS, we improve efficiency and interactions, accelerating mission to deliver innovative medicines to patients worldwide. Sarah Nam VP of AI Strategy and Partnerships, AbbVie 10x Genomics logo “ 10x's single cell and spatial analysis capabilities traditionally required computational expertise. Now, with Claude, researchers perform analytical tasks—aligning reads, generating matrices, clustering, secondary analysis—through plain English conversation. This lowers the barrier for new users while scaling to meet the needs of advanced research teams. Serge Saxonov Co-founder and CEO, 10x Genomics Genmab logo “ We see tremendous potential in Claude streamlining how we bring drugs to market. The ability to pull from clinical data sources and create GxP-compliant outputs will help us bring life-changing cancer therapies to patients faster while maintaining the highest quality standards. We see Claude powering AI applications across several major functions at our company. Hisham Hamadeh Senior Vice President, Global Head of Data, Digital and AI, Genmab Komodo Health logo “ Healthcare analytics demands AI purpose-built for our industry's complexity and rigor. Komodo Health's partnership with Anthropic delivers transparent, auditable solutions designed for regulated healthcare environments. Together, we're enabling healthcare and life sciences teams to transform weeks-long analytical workflows into actionable intelligence in minutes. Arif Nathoo, MD CEO and Co-founder, Komodo Health Novo Nordisk logo “ We've consistently been one of the first movers when it comes to document and content automation in pharma development. Our work with Anthropic and Claude has set a new standard — we're not just automating tasks, we're transforming how medicines get from discovery to the patients who need them. Louise Lind Skov Director Content Digitalisation, Novo Nordisk Stanford University logo “ Claude Code and partnership with Anthropic have been extremely valuable for developing Paper2Agent, our moonshot to transform passive research papers into interactive AI agents that can act as virtual corresponding authors and co-scientists. James Zou Associate Professor, Stanford University PwC logo “ At PwC, responsible AI is a trust imperative. We pair our deep sector insight with Claude's agentic intelligence to reimagine how clinical, regulatory, and commercial teams operate. Together, we're not just streamlining processes—we're elevating quality, accelerating discovery, and building systems where confidence scales alongside innovation.
Matt Wood US and Global Commercial Technology and Innovation Officer, PwC Schrödinger logo “ Claude Code has become a powerful accelerator for us at Schrödinger. For the projects where it fits best, Claude Code allows us to turn ideas into working code in minutes instead of hours, enabling us to move up to 10x faster in some cases. As we continue to work with Claude, we are excited to see how we can further transform the way we build and customize our software. Pat Lorton EVP, Chief Technology Officer, and Chief Operating Officer, Schrödinger Latch Bio logo “ When creating an AI agent for bioinformatics analyses, we focused on three key factors: top software development, life sciences alignment, and startup support. We evaluated half a dozen platforms, and Claude was the standout leader. We're excited to continue this collaboration and bring cutting-edge AI agents into biotech research. Alfredo Andere Co-Founder and CEO, Latch Bio EvolutionaryScale logo “ At EvolutionaryScale, we’re building next-generation AI systems to model the living world. Anthropic’s frontier models accelerate our ability to reason about complex biological data and translate it into scientific insight, helping us push the boundaries of what’s possible in life science discovery. Sal Candido Chief Technology Officer, EvolutionaryScale Manifold logo “ At Manifold, our mission is to power faster, leaner life sciences. Building with Claude has enabled us to develop AI agents that translate questions in the semantic space of scientists to execution in the technical space of specialized datasets and tools. Together, we’re transforming how life sciences R&D will happen in the years ahead. Sourav Dey PhD Co-founder and Chief AI Officer, Manifold FutureHouse logo “ At FutureHouse, Claude helps power both our bioinformatics and literature analysis workflows. Claude is our model of choice for accurate figure analyses and orchestrating non-linear searches through the literature. Andrew White Co-founder and Head of Science, FutureHouse Axiom Bio logo “ Claude has been invaluable for Axiom as we build AI to predict drug toxicity. We've used billions of tokens in Claude Code for many PRs. Claude agents with MCP servers are core to our scientific work, directly querying databases to interpret, transform, and test data correlations, helping us identify the most useful features for predicting clinical drug toxicity. Alex Beatson Co-founder, Axiom Bio SUPPORTING THE LIFE SCIENCES In addition to the updates described above, we are supporting life sciences research through our AI for Science [https://www.anthropic.com/news/ai-for-science-program] program. This program provides free API credits to support leading researchers working on high-impact scientific projects around the world. Our partnerships with these labs help us identify new applications for Claude, while helping scientists answer some of their most pressing questions. We continue to welcome submissions [https://docs.google.com/forms/d/e/1FAIpQLSfwDGfVg2lHJ0cc0oF_ilEnjvr_r4_paYi7VLlr5cLNXASdvA/viewform] for project ideas. Jonah Cool and Eric Kauderer-Abrams, who lead partnerships and R&D for Life Sciences at Anthropic, respectively, discuss this and other recent work below. Anthropic’s Jonah Cool and Eric Kauderer-Abrams share their vision for making Claude the go-to AI research assistant for scientists with Claude for Life Sciences. GETTING STARTED Claude for Life Sciences is available through Claude.com and on the AWS Marketplace, with Google Cloud Marketplace availability coming soon. FOOTNOTES 1 Protocol QA score (multiple choice format) with 10 shot prompting. For more, see our Sonnet 4.5 System Card [https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf], pages 132-133.

Oct

2025

Claude for Life Sciences

Increasing the rate of scientific progress is a core part of Anthropic’s public benefit mission. We are focused on building the tools to allow researchers to make new discoveries – and eventually, to allow AI models to make these discoveries autonomously. Until recently, scientists typically used Claude for individual tasks, like writing code for statistical analysis or summarizing papers. Pharmaceutical companies and others in industry also use it for tasks across the rest of their business, like sales, to fund new research. Now, our goal is to make Claude capable of supporting the entire process, from early discovery through to translation and commercialization. To do this, we’re rolling out several improvements that aim to make Claude a better partner for those who work in the life sciences, including researchers, clinical coordinators, and regulatory affairs managers. MAKING CLAUDE A BETTER RESEARCH PARTNER First, we’ve improved Claude’s underlying performance. Our most capable model, Claude Sonnet 4.5, is significantly better than previous models at a range of life sciences tasks. For example, on Protocol QA, a benchmark that tests the model’s understanding and facility with laboratory protocols, Sonnet 4.5 scores 0.83, against a human baseline of 0.79, and Sonnet 4’s performance of 0.74.1 Sonnet 4.5 shows a similar improvement on its predecessor on BixBench, an evaluation that measures its performance on bioinformatics tasks. To make Claude more useful for scientific work, we’re now adding several new connectors [https://claude.com/partners/mcp] to scientific platforms, the ability to use Agent Skills, and life sciences-specific support in the form of a prompt library and dedicated support. CONNECTING CLAUDE TO SCIENTIFIC TOOLS Connectors [https://claude.ai/redirect/website.v1.7ecf2fc3-1f30-46a2-b757-10786ceef9f1/settings/connectors] allow Claude to access other platforms and tools directly. We’re adding several new connectors that are designed to make it easier to use Claude for scientific discovery: * Benchling gives Claude the ability to respond to scientists’ questions with links back to source experiments, notebooks, and records; * BioRender connects Claude to its extensive library of vetted scientific figures, icons, and templates; * PubMed provides access to millions of biomedical research articles and clinical studies; * Scholar Gateway developed by Wiley provides access to authoritative, peer-reviewed scientific content within Claude to accelerate research discovery; * Synapse.org allows scientists to share and analyze data together in public or private projects; * 10x Genomics allows researchers to conduct single cell and spatial analysis in natural language. These connectors add to our existing set, which includes general purpose tools like Google Workspace and Microsoft SharePoint, OneDrive, Outlook, and Teams. Claude can also already work directly with Databricks to provide analytics for large-scale bioinformatics research, and Snowflake to search through large datasets using natural language questions. DEVELOPING SKILLS FOR CLAUDE Last week, we released Agent Skills: [https://www.anthropic.com/news/skills] folders including instructions, scripts, and resources that Claude can use to improve how it performs specific tasks. Skills are a natural fit for scientific work, since they allow Claude to consistently and predictably follow specific protocols and procedures. We’re developing a number of scientific skills for Claude, beginning with single-cell-rna-qc This skill performs quality control and filtering on single-cell RNA sequencing data, using scverse [https://scverse.org/] best practices: Claude performs quality control on single-cell RNA-seq data Claude performs quality control on single-cell RNA-seq data. In addition to the skills we’re creating, scientists can build their own. For more information and guidance, including setting up custom skills, see here [https://support.claude.com/en/articles/12512180-using-skills-in-claude]. USING CLAUDE FOR LIFE SCIENCES Claude can be used for life sciences tasks like the following: * Research, like literature reviews and developing hypotheses: Claude can cite and summarize biomedical literature and generate testable ideas based on what it finds. Watch how Claude analyzes data, conducts a literature review, dives into potentially novel insights, turns this analysis into a presentation, and puts the finishing touches on slides with a figure from BioRender. * Generating protocols: With the Benchling connector, Claude can draft study protocols, standard operating procedures and consent documents. * Bioinformatics and data analysis: Process and analyze genomic data with Claude Code. Claude can present its results in slides, docs [https://www.anthropic.com/news/create-files], or code notebook format. * Clinical and regulatory compliance: Claude can draft and review regulatory submissions, and compile compliance data. In addition, to help scientists get started quickly, we’re creating a library of prompts [https://support.claude.com/en/articles/12614768-getting-started-with-claude-for-life-sciences] that should elicit best results on tasks like the above. PARTNERSHIPS AND CUSTOMERS We’re providing hands-on support from dedicated subject matter experts in our Applied AI and customer-facing teams. We’re also partnering with companies who specialize in helping organizations adopt AI for life sciences work. These include Caylent, Deloitte, KPMG, PwC, Quantium, Slalom, Tribe AI, and Turing, along with our cloud partners, AWS and Google Cloud. Many of our existing customers and partners have already been using Claude for a broad range of real-world scientific tasks: Sanofi logo “

Claude, paired with internal knowledge libraries, is integral to Sanofi's AI transformation and used by most Sanofians daily in our Concierge app. We're seeing efficiency gains across the value-chain, while our enterprise deployment has enhanced how teams work. This collaboration with Anthropic augments human expertise to deliver life-changing medicines faster to patients worldwide. Emmanuel Frenehard Chief Digital Officer, Sanofi Benchling logo “ AI in R&D works through an ecosystem. Anthropic brings the best technologies while prioritizing access, governance, and interoperability. Benchling is uniquely positioned to contribute. For over a decade, scientists have trusted us as their source of truth for experimental data and workflows. Now we're building AI that powers the next chapter of R&D. Ashu Singhal Co-founder and President, Benchling Broad Institute of MIT and Harvard logo “ Broad Institute scientists pursue the most ambitious questions in biology and medicine, creating tools to empower scientists everywhere. We're working with Manifold on Terra Powered by Manifold. AI agents built on Claude enable scientists to work at entirely new scale and efficiency, exploring scientific domains in previously impossible ways. Heather Jankins Head of Data Science Platform, Broad Institute of MIT and Harvard AbbVie logo “ Claude is foundational to AbbVie's operations. Our GAIA platform leverages Claude for regulatory document generation, ensuring accuracy at scale. GenAIsys empowers field teams with AI insights for healthcare professional engagement. By integrating Claude across workflows on AWS, we improve efficiency and interactions, accelerating mission to deliver innovative medicines to patients worldwide. Sarah Nam VP of AI Strategy and Partnerships, AbbVie 10x Genomics logo “ 10x's single cell and spatial analysis capabilities traditionally required computational expertise. Now, with Claude, researchers perform analytical tasks—aligning reads, generating matrices, clustering, secondary analysis—through plain English conversation. This lowers the barrier for new users while scaling to meet the needs of advanced research teams. Serge Saxonov Co-founder and CEO, 10x Genomics Genmab logo “ We see tremendous potential in Claude streamlining how we bring drugs to market. The ability to pull from clinical data sources and create GxP-compliant outputs will help us bring life-changing cancer therapies to patients faster while maintaining the highest quality standards. We see Claude powering AI applications across several major functions at our company. Hisham Hamadeh Senior Vice President, Global Head of Data, Digital and AI, Genmab Komodo Health logo “ Healthcare analytics demands AI purpose-built for our industry's complexity and rigor. Komodo Health's partnership with Anthropic delivers transparent, auditable solutions designed for regulated healthcare environments. Together, we're enabling healthcare and life sciences teams to transform weeks-long analytical workflows into actionable intelligence in minutes. Arif Nathoo, MD CEO and Co-founder, Komodo Health Novo Nordisk logo “ We've consistently been one of the first movers when it comes to document and content automation in pharma development. Our work with Anthropic and Claude has set a new standard — we're not just automating tasks, we're transforming how medicines get from discovery to the patients who need them. Louise Lind Skov Director Content Digitalisation, Novo Nordisk Stanford University logo “ Claude Code and partnership with Anthropic have been extremely valuable for developing Paper2Agent, our moonshot to transform passive research papers into interactive AI agents that can act as virtual corresponding authors and co-scientists. James Zou Associate Professor, Stanford University PwC logo “ At PwC, responsible AI is a trust imperative. We pair our deep sector insight with Claude's agentic intelligence to reimagine how clinical, regulatory, and commercial teams operate. Together, we're not just streamlining processes—we're elevating quality, accelerating discovery, and building systems where confidence scales alongside innovation.
Matt Wood US and Global Commercial Technology and Innovation Officer, PwC Schrödinger logo “ Claude Code has become a powerful accelerator for us at Schrödinger. For the projects where it fits best, Claude Code allows us to turn ideas into working code in minutes instead of hours, enabling us to move up to 10x faster in some cases. As we continue to work with Claude, we are excited to see how we can further transform the way we build and customize our software. Pat Lorton EVP, Chief Technology Officer, and Chief Operating Officer, Schrödinger Latch Bio logo “ When creating an AI agent for bioinformatics analyses, we focused on three key factors: top software development, life sciences alignment, and startup support. We evaluated half a dozen platforms, and Claude was the standout leader. We're excited to continue this collaboration and bring cutting-edge AI agents into biotech research. Alfredo Andere Co-Founder and CEO, Latch Bio EvolutionaryScale logo “ At EvolutionaryScale, we’re building next-generation AI systems to model the living world. Anthropic’s frontier models accelerate our ability to reason about complex biological data and translate it into scientific insight, helping us push the boundaries of what’s possible in life science discovery. Sal Candido Chief Technology Officer, EvolutionaryScale Manifold logo “ At Manifold, our mission is to power faster, leaner life sciences. Building with Claude has enabled us to develop AI agents that translate questions in the semantic space of scientists to execution in the technical space of specialized datasets and tools. Together, we’re transforming how life sciences R&D will happen in the years ahead. Sourav Dey PhD Co-founder and Chief AI Officer, Manifold FutureHouse logo “ At FutureHouse, Claude helps power both our bioinformatics and literature analysis workflows. Claude is our model of choice for accurate figure analyses and orchestrating non-linear searches through the literature. Andrew White Co-founder and Head of Science, FutureHouse Axiom Bio logo “ Claude has been invaluable for Axiom as we build AI to predict drug toxicity. We've used billions of tokens in Claude Code for many PRs. Claude agents with MCP servers are core to our scientific work, directly querying databases to interpret, transform, and test data correlations, helping us identify the most useful features for predicting clinical drug toxicity. Alex Beatson Co-founder, Axiom Bio SUPPORTING THE LIFE SCIENCES In addition to the updates described above, we’re supporting life sciences research through our AI for Science [https://www.anthropic.com/news/ai-for-science-program] program. This program provides free API credits to support leading researchers working on high-impact scientific projects around the world. Our partnerships with these labs helps us identify new applications for Claude, while helping scientists answer some of their most pressing questions. We continue to welcome submissions [https://docs.google.com/forms/d/e/1FAIpQLSfwDGfVg2lHJ0cc0oF_ilEnjvr_r4_paYi7VLlr5cLNXASdvA/viewform] for project ideas. Jonah Cool and Eric Kauderer-Abrams, who lead partnerships and R&D for Life Sciences at Anthropic, respectively, discuss this and other recent work below. Anthropic’s Jonah Cool and Eric Kauderer-Abrams share their vision for making Claude the go-to AI research assistant for scientists with Claude for Life Sciences. GETTING STARTED Claude for Life Sciences is available through Claude.com and on the AWS Marketplace, with Google Cloud Marketplace availability coming soon. FOOTNOTES 1 Protocol QA score (multiple choice format) with 10 shot prompting. For more, see our Sonnet 4.5 System Card [https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf], pages 132-133.

Oct

2025

Claude Code on the web

Today, we're introducing Claude Code on the web, a new way to delegate coding tasks directly from your browser. Now in beta as a research preview, you can assign multiple coding tasks to Claude that run on Anthropic-managed cloud infrastructure, perfect for tackling bug backlogs, routine fixes, or parallel development work. RUN CODING TASKS IN PARALLEL Claude Code on the web lets you kick off coding sessions without opening your terminal. Connect your GitHub repositories, describe what you need, and Claude handles the implementation. Each session runs in its own isolated environment with real-time progress tracking, and you can actively steer Claude to adjust course as it’s working through tasks. With Claude Code running in the cloud, you can now run multiple tasks in parallel across different repositories from a single interface and ship faster with automatic PR creation and clear change summaries. FLEXIBLE FOR EVERY WORKFLOW The web interface complements your existing Claude Code workflow. Running tasks in the cloud is especially effective for: * Answering questions about how projects work and how repositories are mapped * Bugfixes and routine, well-defined tasks * Backend changes, where Claude Code can use test-driven development to verify changes You can also use Claude Code on mobile. As part of this research preview, we’re making Claude Code available on our iOS app so developers can explore coding with Claude on the go. It’s an early preview, and we hope to quickly refine the mobile experience based on your feedback. SECURITY-FIRST CLOUD EXECUTION Every Claude Code task runs in an isolated sandbox environment with network and filesystem restrictions. Git interactions are handled through a secure proxy service that ensures Claude can only access authorized repositories—helping keep your code and credentials protected throughout the entire workflow. You can also add custom network configuration to choose what domains Claude Code can connect to from its sandbox. For example, you can allow Claude to download npm packages over the internet so that it can run tests and validate changes. Read our engineering blog [https://www.anthropic.com/engineering/claude-code-sandboxing] and documentation [https://docs.claude.com/en/docs/claude-code/sandboxing] for a deep dive on Claude Code’s sandboxing approach. GETTING STARTED Claude Code on the web is available now in research preview for Pro and Max users. Visit claude.com/code [http://claude.com/code] to connect your first repository and start delegating tasks. Cloud-based sessions share rate limits with all other Claude Code usage. Explore our documentation [https://docs.claude.com/en/docs/claude-code/claude-code-on-the-web] to learn more.

Oct

2025

The Karpathy-Dwarkesh Interview delays AGI timelines

The recent AI news highlights the Karpathy interview as a major event, alongside significant discussions on reasoning improvements without reinforcement learning, with test-time sampling achieving GRPO-level performance. Critiques on context window marketing reveal effective limits near 64K tokens, with Claude Haiku 4.5 showing competitive reasoning speed. GPT-5 struggles with advanced math benchmarks, and data quality issues termed "Brain Rot" affect model reasoning and safety. In agent frameworks, Anthropic Skills enable modular coding workflows, OpenAI Codex IDE extensions enhance developer productivity, and HuggingChat Omni introduces meta-routing across 100+ open models using Arch-Router-1.5B. LangChain and LlamaIndex advance graph-first agent infrastructure, while Google Gemini integrates with Google Maps for real-world grounding.

Oct

2025

How a Gemma model helped discover a new potential cancer therapy pathway

We’re launching a new 27 billion parameter foundation model for single-cell analysis built on the Gemma family of open models.

Oct

2025

Claude Agent Skills - glorified AGENTS.md? or MCP killer?

Anthropic achieves a rare feat with back-to-back AI news headlines featuring Claude's new Skills—a novel way to build specialized agents using Markdown files, scripts, and metadata to handle tasks like creating and reading PDFs, Docs, and PPTs. Simon Willison calls this a "bigger deal than MCP," predicting a "Cambrian explosion in Skills." Meanwhile, Anthropic launches Claude 4.5 Haiku with strong reasoning and long-context capabilities, priced competitively. Other updates include OpenAI's ChatGPT memory management improvements, Windows 11 Copilot voice and vision features, and HuggingChat Omni routing across 115 open-source models from 15 providers. These developments highlight advances in agent skills, document processing, long-context reasoning, and multi-model routing.

Oct

2025

Introducing Veo 3.1 and advanced creative capabilities

We’re rolling out significant updates to Veo that give people even more creative control.

Oct

2025

Introducing Claude Skills

Claude can now use Skills to improve how it performs specific tasks. Skills are folders that include instructions, scripts, and resources that Claude can load when needed. Claude will only access a skill when it's relevant to the task at hand. When used, skills make Claude better at specialized tasks like working with Excel or following your organization's brand guidelines. You've already seen Skills at work in Claude apps, where Claude uses them to create files like spreadsheets and presentations. Now, you can build your own skills and use them across Claude apps, Claude Code, and our API. HOW SKILLS WORK While working on tasks, Claude scans available skills to find relevant matches. When one matches, it loads only the minimal information and files needed—keeping Claude fast while accessing specialized expertise. Skills are: * Composable: Skills stack together. Claude automatically identifies which skills are needed and coordinates their use. * Portable: Skills use the same format everywhere. Build once, use across Claude apps, Claude Code, and API. * Efficient: Only loads what's needed, when it's needed. * Powerful: Skills can include executable code for tasks where traditional programming is more reliable than token generation. Think of Skills as custom onboarding materials that let you package expertise, making Claude a specialist on what matters most to you. For a technical deep-dive on the Agent Skills design pattern, architecture, and development best practices, read our engineering blog [https://www.anthropic.com/news/www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills]. SKILLS WORK WITH EVERY CLAUDE PRODUCT CLAUDE APPS Skills are available to Pro, Max, Team and Enterprise users. We provide skills for common tasks like document creation, examples you can customize, and the ability to create your own custom skills. The Skills capabilities interface in Claude.ai with example Skills toggled on. Claude automatically invokes relevant skills based on your task—no manual selection needed. You'll even see skills in Claude's chain of thought as it works. Creating skills is simple. The "skill-creator" skill provides interactive guidance: Claude asks about your workflow, generates the folder structure, formats the SKILL.md file, and bundles the resources you need. No manual file editing required. Enable Skills in Settings [https://preview.claude.ai/redirect/website.v1.48270ee2-c55d-447b-bde7-ac6112c05ba8/settings/features]. For Team and Enterprise users, admins must first enable Skills organization-wide. CLAUDE DEVELOPER PLATFORM (API) Agent Skills, which we often refer to simply as Skills, can now be added to Messages API requests and the new /v1/skills endpoint gives developers programmatic control over custom skill versioning and management. Skills require the Code Execution Tool [https://docs.claude.com/en/docs/agents-and-tools/tool-use/code-execution-tool] beta, which provides the secure environment they need to run. Use Anthropic-created skills to have Claude read and generate professional Excel spreadsheets with formulas, PowerPoint presentations, Word documents, and fillable PDFs. Developers can create custom Skills to extend Claude's capabilities for their specific use cases. Developers can also easily create, view, and upgrade skill versions through the Claude Console. Explore the documentation [https://docs.claude.com/en/docs/agents-and-tools/agent-skills/overview] or Anthropic Academy [https://www.anthropic.com/learn/build-with-claude] to learn more. Box logo “

Skills teaches Claude how to work with Box content. Users can transform stored files into PowerPoint presentations, Excel spreadsheets, and Word documents that follow their organization's standards—saving hours of effort. Yashodha Bhavnani Head of AI, Box Notion logo “ Skills streamline our management accounting and finance workflows. Claude processes multiple spreadsheets, catches critical anomalies, and generates reports using our procedures. What once took a day, we can now accomplish in an hour. MJ Felix Product Manager, Notion Canva logo “ Canva plans to leverage Skills to customize agents and expand what they can do. This unlocks new ways to bring Canva deeper into agentic workflows—helping teams capture their unique context and create stunning, high-quality designs effortlessly. Anwar Haneef GM & Head of Ecosystem, Canva Rakuten logo “ Skills streamline our management accounting and finance workflows. Claude processes multiple spreadsheets, catches critical anomalies, and generates reports using our procedures. What once took a day, we can now accomplish in an hour. Yusuke Kaji General Manager AI, Rakuten CLAUDE CODE Skills extend Claude Code with your team's expertise and workflows. Install skills via plugins from the anthropics/skills marketplace. Claude loads them automatically when relevant. Share skills through version control with your team. You can also manually install skills by adding them to ~/.claude/skills. The Claude Agent SDK provides the same Agent Skills support for building custom agents. GETTING STARTED * Claude apps: User Guide [https://support.claude.com/en/articles/12580051-teach-claude-your-way-of-working-using-skills] & Help Center [https://support.claude.com/en/articles/12512176-what-are-skills] * API developers: Documentation [https://docs.claude.com/en/api/skills-guide] * Claude Code: Documentation [https://docs.claude.com/en/docs/claude-code/skills] * Example Skills to customize: GitHub repository [https://github.com/anthropics/skills] WHAT'S NEXT We're working toward simplified skill creation workflows and enterprise-wide deployment capabilities, making it easier for organizations to distribute skills across teams. Keep in mind, this feature gives Claude access to execute code. While powerful, it means being mindful about which skills you use—stick to trusted sources to keep your data safe. Learn more [https://support.claude.com/en/articles/12512176-what-are-skills].