Been thinking about whether more training/compute will get us to AGI, or if we need a fundamentally different architecture. I'm convinced it's the latter.
Current transformer architecture is a glorified pattern matcher. It was literally created to translate languages. We've scaled it up, added RLHF, made it chat — but at its core, it's still doing statistical pattern matching over sequences.
When Ramanujan came up with his formulas, when Gödel proved incompleteness, when Cantor invented set theory — these weren't in any training distribution. There was no historical precedent to pattern-match against. These required *seeing structure that didn't exist yet*.
LLMs can interpolate brilliantly within their training data. They cannot extrapolate to genuinely novel structures. That's the difference between pattern matching and understanding.
If I ask an LLM for business ideas, it'll suggest things that match my statistical profile — I'm a tech professional, so it'll say SaaS, consulting, AI tools. Plumbing? Probably not on the list.
But I'm a general-purpose agent. I can decide tomorrow to learn plumbing and start a plumbing business. The LLM sees the shadow of who I've been. I have access to the space of who I could become.
LLMs reason over P(outcome | observable profile). Humans reason over possibility space, not probability space. Completely different.
We need architectures that can:
- Build causal models of the world (not just statistical associations)
- Learn from minimal examples (a kid learns "dog" from 3 examples, not millions)
- Reason about novel structures that don't exist in training data
- Model agency — the ability of entities to change themselves
Scaling transformers won't get us there. It's like building a really good horse and hoping it becomes a car.
Curious what others think. Am I missing something, or is the current hype around scaling fundamentally misguided?
--- TOP COMMENTS ---
Transformers =/= LLMs. People really need to stop using them as synonyms. Transformers aren't just used in LLMs. They're used in many DL models, including non-generative models like prediction-based world models. Not all LLMs are autoregressive models that are pre-trained on next-token prediction. Diffusion LLMs exist, as an example. And most also use transformers, with different objectives in pre-training.
So yes, most auto-regressive (and most non-auto regressive) LLM's use transformers. So do world models like VL-JEPA (https://arxiv.org/abs/2512.10942 .) So do encoder-only models pre-trained on masked-token prediction and/or next sentence prediction.
Human like AI seems to be on a trajectory to arise from a combination of deep learning, reinforcement learning (RLHF isn't the only RL being done), and maybe some very flexible symbolic system(s.) Maybe something else is needed as well, like embodiment. We have no certain idea. Transformers are useful architectures in a variety of DL applications because they are defined by the self-attention mechanism, and probably will have a place in that, unless they're superseded by a better performing and more robust NN architecture by the time human-like AI is achieved.
As it is now, almost all world models (which try to capture the casual relationships you're suggesting is necessary) have transformer-based components to a greater system (again note the VL-JEPA example, which has transformer-based components.)
So my background is in semiconductor manufacture.
I won't claim ANY knowledge outside of hardware, and hardware systems. You can debate other folks for that.
I absolutely agree with you. Just a different road to get there.
I've SEEN the basic semiconductor pattern. Checked it for fidelity. Operated the Test machines that punch out bad sectors.
...its not JUST pattern matching. It's pattern matching AND computing in arrays, AND transfer protocols for fidelity.... because it's just pattern matching, and doing fancy things with it requires... well... Scaling up.
Pattern matching stacked on pattern matching.
We've been doing it for a while.
So when LLM models started coming onto the scene... it seemed clear to me that it wasn't going to go all the way to AGI. I'd argue that (and I know people take umbrage with this term) we won't achive "true AI", much less AGI, without a new architecture.
And I'd say the data and patterns across companies across the world supports that.
Everyone talks about "AI chips", which frankly are just Commercial Research chips, best I can tell. Large arrays of potential process without any of the more specific architecture. High fidelity chips can be sold at 100x their normal cost, and they're utterly useless for normal products... they're just 'liquid computing power', but have to be programmed on a very fundamental level.
So while I'm guessing there is more nuance to it than when I was in the Business, these aren't new. We used to call them "Super Computers"... that couldn't run an OS or program to save their lives.
Pure computing.
But circling back to LLMs, they aren't, are they?
They're running on several layers of Programs and UI, made to he user friendly.
I can only conclude those chips aren't for RUNNING the LLMs, but are for backbone hardware.... or my theory...?
Iteration.
Hypothetical - You are convinced AI is possible. The race has started. You've done what you can with traditional computing... and now you're at the Polish stage. LLMs are... pretty damn nice. As nice as they are likely to get. Now it's efficient, etc.
.... but you still haven't reached AI-level.
So how do you come up with a 5-year or 10-year plan?
Because THAT is the time scale the companies that manufacture chips operate on.
Even if you HAVE an architecture and manufacturing plan, it takes months to run one single process.... and months or years to dial in the machines.
Usually we would just say 'a decade' from concept to finishing a manufacturing run for a client.
I'm told it has been ruduced to 6-7 years. That sounds plausible.
So how do you, an AI startup with big dreams and an LLM that is successful... put in a novel order for a new architecture of chips?
Well.... you don't.
That's where it gets tricky.
Mostly we iterate architecture. Not create it from scratch. So Intel has a new processor coming out every year or two for the next 20 years, and that's already in the manufacturing process. Some are just to test or dial-in the recipe at a given fab. Some are for outside testing.
The whole chip industry exists in this slow, steady, creeping crawl.
You don't just... put in an order for an AI chip, or a new architecture. Hell, you don't even make one... you generally request that THEY make it for you. And they guard the architecture jealously.
So something this big, this quick, with this short of a turnaround?
You'd need a stupendous amount of power, leverage, and money to just... make a new architecture, and have it manufacutured for you. Tens or hundreds of billions of dollars, and years of time. Real floor-sagging, earth-shattering amounts of time and money.
Only thing I can think of that could do something like that.... would be those megawatt Data Centers that we've been using for Cloud Computing, but we keep talking about using for AI research.
They never quite say WHAT exactly they're using it for.
Always gets real vague as soon as we're talking about hardware.
... and since software and programming hasn't created the Singularity despite an almost unheard amount of human effort poured into it...?
That leaves the hardware.
The architecture.
Iterate the LLMs to keep people engaged, interested, and investing. You'll need it later, so this isn't some big loss, or just spinning wheels.
Court a chip manufacturer, front them an absurd amount of money for Commercial chips for traditional 'supercomputing', and feed them as much power as possible.
Bend every resource to turn a 10-year architecture project into a couple of 6-year projects, with maybe a 3-year overlap.
That gives you 3 years to juggle LLMs
Then 3 years for early proto-AI on your new architecture.
Then 3 years for refinements and your first commercially viable, reasonably efficient version.
... and by then the software side is almost a decade old, and folks are chomping at the bit to get working on it.
..... so yeah.
........ architecture.
Not just because they want to, or have to, but because most people don't respect the amount of time and effort that goes into the chips we use for... everything. And making new ones for ANY reason, much less a novel AI memory or processing version....
.... it just takes time and money. Lots of it.
Disclaimer - Am not a Doctor, but I do play one on TV, and I stayed in a Holiday Inn.
Companies
OpenAI hired the OpenClaw creator. The military used Claude in the Venezuela raid. The Pentagon may drop Anthropic's $200M contract. Disney accused ByteDance of an IP 'smash-and-grab.' (15 Feb 2026 recap)
Read more Read lessHere are the most important news from the past two days:
OpenAI hires OpenClaw creator Peter Steinberger to build 'next-generation' AI agents
OpenAI hired Peter Steinberger, creator of the viral AI agent OpenClaw, to lead development of next-generation personal agents. Sam Altman called him "a genius" and said multi-agent capabilities will "quickly become core to our product offerings."
OpenClaw will move to an independent foundation while remaining open source with OpenAI support. Steinberger chose OpenAI over starting a company, saying "what I want is to change the world, not build a large company." His goal: an agent "even my mum can use."
The hire comes with baggage: security researchers found thousands of exposed OpenClaw instances vulnerable to remote code execution and dozens of malicious skills on its marketplace containing keyloggers and credential stealers. (read the full story)
US military used Anthropic's Claude in the operation to capture Venezuela's Maduro
The Pentagon deployed Claude during the January 3rd raid on Nicolás Maduro's fortified palace in Caracas, through Anthropic's partnership with Palantir. Delta Force commandos used the AI during the active operation—not just in planning. People were shot during the breach.
An Anthropic executive reached out to Palantir afterward to ask whether Claude had been used, "in a way to imply that they might disapprove of their software being used, because obviously there was kinetic fire during that raid." Claude was the first AI model the Pentagon brought into its classified networks. The revelation has intensified a growing rift between the "safety-first" AI lab and its biggest government client. (source)
Pentagon threatens to drop Anthropic's $200M contract over military AI limits
The Pentagon is considering severing its relationship with Anthropic because the company won't remove all restrictions on military use of Claude. The Defense Department is pushing four AI labs—OpenAI, Google, xAI, and Anthropic—to allow "all lawful purposes," including weapons development and intelligence collection. OpenAI, Google, and xAI agreed to lift their guardrails. Anthropic refused.
Anthropic insists two areas remain off limits: mass surveillance of Americans and fully autonomous weapons. The contract, signed last summer, is valued up to $200M. Internally, Anthropic engineers are uneasy about Pentagon work. The standoff puts the company's safety brand directly against its biggest government revenue stream. (source)
ByteDance pledges Seedance 2.0 safeguards after Disney cease-and-desist
ByteDance said it will strengthen safeguards on its video generation tool Seedance 2.0 after Disney and Paramount sent cease-and-desist letters. Disney accused ByteDance of a "virtual smash-and-grab" of its IP, claiming the model ships with "a pirated library" of Star Wars and Marvel characters. The Motion Picture Association and SAG-AFTRA also condemned the tool.
Disney's response reveals selective enforcement: it sued ByteDance but struck a deal when OpenAI's Sora produced similar content. The difference? Geopolitics. Chinese-owned ByteDance gets the lawsuit; American OpenAI gets a licensing agreement. (source)
Other important stories
Read more stories like these at 7min.ai. (Disclaimer: I'm the website's creator)
--- TOP COMMENTS --- "what I want is to change the world, not build a large company."
ah nice, the elon musk syndrome, we need more of those kids
Open Claw has been properly captured by the same networks who were affiliated with Epstein. No surprise. It seems that anyone who makes anything of value to the larger goal gets bought out and brought into the fold. It's how you neutralize future threats. Independent businesses. Pay attention and understand the gravity of what it means to do stuff like this. Sure, you get a lot of money, but you lose any chance of doing something genuinely remarkable for humanity without any hidden costs.
Anthropic’s Moral Stand: Pentagon warns Anthropic will “Pay a Price” as feud escalates
Read more Read lessAxios frames this as an ethics clash, with Anthropic reportedly trying to block uses like large scale surveillance and fully autonomous weapons while the Pentagon pushes for access for “all lawful purposes.” If procurement can punish a lab for insisting on guardrails by calling it a “supply chain risk,” that creates a race to the bottom on safety norms. Where should the ethical line be drawn, and who should get to draw it?
Source: https://www.axios.com/2026/02/16/anthropic-defense-department-relationship-hegseth
--- TOP COMMENTS --- Biggest endorsement of Anthropic so far
Something something free market..
OpenAI Quietly Deletes Core Safety and Profit Pledges
Read more Read lessOpenAI Quietly Removes “safely” and “no financial motive” from official mission
Old IRS 990:
“build AI that safely benefits humanity, unconstrained by need to generate financial return”
New IRS 990:
“ensure AGI benefits all of humanity”
--- TOP COMMENTS --- It’s like when google removed “don’t be evil” from their charter
https://preview.redd.it/6q5verr1azjg1.jpeg?width=960&format=pjpg&auto=webp&s=d9bdefd5203e11d22805ef5af3222c9d3e302231
Google doesn't love us anymore.
Read more Read lessIt's been about 125 years of AI since the last Gemma, Google doesn't love us anymore and has abandoned us to Qwen's rational models. I miss the creativity of Gemma's, and also their really useful sizes.
Don't abandon us, Mommy Google, give us Gemma 4!
--- TOP COMMENTS --- demis hasabis is comming to my college tommorow. im going to ask about gemma 4 in q and a session. lets see
Never did, never will. They do love our data.
OpenAI Drops “Safety” and “No Financial Motive” from Mission
Read more Read lessOpenAI Quietly Removes “safely” and “no financial motive” from official mission
Old IRS 990:
“build AI that safely benefits humanity, unconstrained by need to generate financial return”
New IRS 990:
“ensure AGI benefits all of humanity”
--- TOP COMMENTS --- Lol AGI
These guys are adorable
Yea. I’m downloading my data and switching to the next cruel platform.
Acquisitions
Sam: “love the spirit of OpenClaw” → days later OpenAI brings the creator in 👀
Read more Read lessOn TBPN a few days ago, Sam said, “I love the spirit of everything about OpenClaw,” highlighting how a one-person open-source agent can ship faster than big companies weighed down by risk and compliance. He also hinted a “mass market version” would follow.
Now OpenClaw’s creator, Peter Steinberger, is joining OpenAI to work on next-gen personal agents. OpenClaw moves into a foundation as open source, with OpenAI supporting it.
Interesting timing.
Either this signals multi-agent orchestration is the next platform shift and OpenAI is embracing the OSS energy — or we’re watching the scrappy agent ecosystem get folded into something more structured.
Where does OpenClaw land?
At least now they have money to pay the bills!
--- TOP COMMENTS --- isnt it a security nightmare right now
Also it's vibecoded with OpenAI Codex when he was asked if he used Claude Code.
Perhaps, he initially named it ClawdBot with the hope that Anthropic would acquire it and then rename it to ClaudeBot or something, but seems like Antropic didn't like it, then he renamed it to OpenClaw, Open- Claw.
Related Coverage
openclaw
OpenAI just hired the OpenClaw creator
Open Source
Qwen 3.5 Open Source: Native Multimodal, Ultimate Efficiency!
Read more Read lessHappy New Year, everyone! Our latest generation native multimodal model, Qwen3.5-397B-A17B, is now officially open source!
--- TOP COMMENTS --- Very excited for it! Native multimodal, optional thinking, Qwen Next architecture, this model is really what we could call in germany the "Eierlegende Wollmilchsau", the model that does it all. Looking great so far, and happy new year to our chinese friends.
Happy New Year!
Qwen 3.5 will be released today
Read more Read lessSources reveal that Alibaba will open-source its next-generation large model, Qwen3.5, tonight on Lunar New Year's Eve. The model reportedly features a comprehensive innovation in its architecture.
https://preview.redd.it/n8tuw9gmfsjg1.jpg?width=680&format=pjpg&auto=webp&s=b95152330c1b5ebdb5b7022dd6762ebe1890fd06
https://x.com/Sino_Market/status/2023218866370068561?s=20
--- TOP COMMENTS --- When release drops, please share: base vs instruct sizes, tokenizer changes, context length, license, and first-party inference settings.
For local users, the fastest sanity check is same prompt set across v3 vs 3.5 with fixed temp + seed, then log TTFT / toks-per-sec / pass@1 on a small coding+math slice.
Raw leaderboard numbers without decode settings are hard to trust.
I love their qwen 3 8B and still use it to this day. I hope they give us a good updated model in that range so I can start using it :)
Related Coverage
Qwen 3.5
Difference Between QWEN 3 Max-Thinking and QWEN 3.5 on a Spatial Reasoning Benchmark (MineBench)
Read more Read lessHonestly it's quite an insane improvement, QWEN 3.5 even had some builds that were closer to (if not better than) Opus 4.6/GPT-5.2/Gemini 3 Pro.
Benchmark: https://minebench.ai/
Git Repository: https://github.com/Ammaar-Alam/minebench
Previous post comparing Opus 4.5 and 4.6, also answered some questions about the benchmark
Previous post comparing Opus 4.6 and GPT-5.2 Pro
(Disclaimer: This is a benchmark I made, so technically self-promotion, but I thought it was a cool comparison :)
--- TOP COMMENTS --- I can feel this. My initial impression for Qwen 3.5 (incl. VL) is it's extremely impressive for a hybrid linear-linear-linear-full attention model, and except a few hiccups, it is almost competitive with some of the frontier models in terms of robustness. Maybe not as good for agentic use (which I did not test) as its output does not smell of forced mini-CoT post-training common for "agentic-maxxed" models.
Hiccups I see:
BTW this Plus and opensource thing is confusing. I tested those models in direct Alibaba Cloud account and there is no clear explanation of differences between them. I assume Plus is opensource + ctx extended to 1m + some tool calling enabled by default. It has search function in Alibaba Cloud btw.
This is the kind of self promotion the sub needs. It's a good benchmark.
Qwen3.5-397B-A17B is out!!
Read more Read lesshttps://huggingface.co/Qwen/Qwen3.5-397B-A17B
--- TOP COMMENTS --- Also the gguf https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF
Finally! Happy new year!
Related Coverage
Qwen 3.5 Plus(397b-a17b) is now available on Chinese Qwen APP
Qwen3.5-397B-A17B will be open source!
Qwen3.5-397B-A17B Unsloth GGUFs
Where are Qwen 3.5 2B, 9B, and 35B-A3B
Read more Read lessWhere did leakers go
--- TOP COMMENTS --- Several Qwen team members have stated on Chinese social media that they plan to release smaller versions one by one before the end of the "Spring Festival holiday", which ends on February 23 (Beijing time)
Would love a Qwen3.5-Coder-Next 👌😅
[P] eqx-learn: Classical machine learning using JAX and Equinox
Read more Read lessHello everyone!
I am writing here to share a library I am currently developing for research use that filled a niche for me in the Equinox/JAX eco-system: eqx-learn.
I am using Equinox as the foundation for my radio-frequency modelling library ParamRF, and I have absolutely loved the mixed OO/functional style. However, for my research, I require classical ML models (specifically PCA and Gaussian Process Regression), but could not find an Equinox-native library in the ecosystem that was as straight-forward and consistent as scikit-learn.
eqx-learn aims to address this, with a JAX-based take on the scikit-learn API. All models in the library are ultimately Equinox Module's, and can be fit using the library's free "fit" function. The design is such that models simply "advertise" their capabilities by implementing specific methods (e.g. solve(X, y), condition(X, y), loss(), and the "fit" function then fits/trains the model accordingly. I believe that this de-coupling of capabilities vs fitting algorithm fits the JAX style better, and also has lots of potential.
At the moment, eqx-learn addresses all my research needs, but I thought it may be useful to share the library online to advertise that it exists, and mention that I am happy to accept PRs for additional models and fitting algorithms!
Although there are no docs, there are short examples in the repo :).
Happy coding!
Cheers, Gary
4 of the top 5 most used models on OpenRouter this week are Open Source!
Related Coverage
Developer Tools
[D] We found 18K+ exposed OpenClaw instances and ~15% of community skills contain malicious instructionsc
Read more Read lessThrowaway because I work in security and don't want this tied to my main.
A few colleagues and I have been poking at autonomous agent frameworks as a side project, mostly out of morbid curiosity after seeing OpenClaw blow up (165K GitHub stars, 60K Discord members, 230K followers on X, 700+ community skills). What we found genuinely alarmed us.
We identified over 18,000 OpenClaw instances exposed directly to the public internet. But the scarier part: when we audited community built skills, nearly 15% contained what we'd classify as malicious instructions. We're talking prompts designed to download malware, exfiltrate sensitive data, or steal credentials. And there's this frustrating pattern where malicious skills get flagged, removed, then reappear under new identities within days. It's endless.
The attack surface here is qualitatively different from traditional software vulnerabilities and I don't think the ML community has fully internalized this. These agents have delegated authority over local files, browsers, and messaging platforms (WhatsApp, Slack, Discord, Telegram). A single compromised skill doesn't just affect the skill's functionality; it potentially compromises everything the agent can touch. Attackers don't need to target you directly anymore, they target the agent and inherit its permissions.
Prompt injection is the obvious vector everyone talks about, but the supply chain risk from community skills is what's actually keeping me up at night. Unlike npm packages or PyPI modules where there's at least some security tooling and community review norms, agent skills are essentially unreviewed prompt bundles with execution capabilities. The OpenClaw FAQ itself acknowledges this is a "Faustian bargain" with no "perfectly safe" setup. At least they're honest about it, but adoption is outpacing any reasonable security review.
There's also this failure mode we've been calling "judgment hallucination" internally. Users anthropomorphize these systems and over delegate authority because the agent appears to reason competently. I've watched colleagues give these things access to their entire digital lives because "it seems smart." The trust calibration problem is severe and I don't see anyone working on it seriously.
I've been digging around for any standardized approach to evaluating agent security posture. Found some scattered resources like OWASP's LLM guidelines, a few academic papers on prompt injection taxonomies, and stumbled across something called Agent Trust Hub that's trying to catalog these risks. But honestly the whole space feels fragmented. We're building the plane while flying it and nobody agrees on what the instruments should even measure.
Seriously though, has anyone here audited other agent frameworks like AutoGPT or BabyAGI for similar issues? And for those running agents in production, what does your threat model actually look like? I'm curious whether people are treating these as trusted code execution environments or sandboxing them properly.
--- TOP COMMENTS --- This is such an important topic.
In the last weeks I have worked hard on building just the solution to this and would love to share it with you here.
It's called "just don't use OpenClaw".
is this another throwaway than the one posting this last week?
How I structure Claude Code projects (CLAUDE.md, Skills, MCP)
Read more Read lessI’ve been using Claude Code more seriously over the past months, and a few workflow shifts made a big difference for me.
The first one was starting in plan mode instead of execution.
When I write the goal clearly and let Claude break it into steps first, I catch gaps early. Reviewing the plan before running anything saves time. It feels slower for a minute, but the end result is cleaner and needs fewer edits.
Another big improvement came from using a
CLAUDE.mdfile properly.Treat it as a long-term project memory.
Include:
Once this file is solid, you stop repeating context. Outputs become more consistent across sessions.
Skills are also powerful if you work on recurring tasks.
If you often ask Claude to:
You can package that logic once and reuse it. That removes friction and keeps quality stable.
MCP is another layer worth exploring.
Connecting Claude to tools like GitHub, Notion, or even local CLI scripts changes how you think about it. Instead of copying data back and forth, you operate across tools directly from the terminal. That’s when automation starts to feel practical.
For me, the biggest mindset shift was this:
Claude Code works best when you design small systems around it, not isolated prompts.
I’m curious how others here are structuring their setup.
Are you using project memory heavily?
Are you building reusable Skills?
Or mostly running one-off tasks?
Would love to learn how others are approaching it.
https://preview.redd.it/ubchqhdo8ujg1.jpg?width=800&format=pjpg&auto=webp&s=7b94364abe7988ca377d23def4750933e09afa96
--- TOP COMMENTS --- Totally agree on CLAUDE.md being the biggest unlock. One thing that made a huge difference for me was putting tool preferences in there, not just coding conventions. Stuff like "use bun instead of node" or "prefer Bun.serve over express" so it stops reaching for the wrong defaults every session.
For Skills, the pattern I landed on is keeping a SKILL.md in each skill folder with the full instructions and CLI commands it needs. That way when you invoke it, it has all the context without me re-explaining anything. Feels like building little specialized workers.
On plan mode - I used to skip it for "simple" tasks and regretted it almost every time. Now I default to plan mode for anything touching more than 2 files. The 30 seconds reviewing a plan saves 10 minutes of undoing wrong assumptions.
One tip on MCP: the filesystem and git MCP servers are handy but I found the biggest win was wiring up project-specific APIs. Like connecting it to your deploy pipeline or database so you can say "check the prod logs for errors" without copy-pasting.
The mindset shift you described is spot on. It works way better as a system than as individual prompts.
Full step-by-step Claude Code walkthrough (CLI, CLAUDE.md, Skills, Hooks, MCP, GitHub workflows): https://youtube.com/playlist?list=PL-F5kYFVRcIvZQ_LEbdLIZrohgbf-Vock&si=EwcH5T7Y3orPTeHw
Built a 37K-line photo analysis engine with Claude Code — scores, tags, and ranks your entire photo library
Read more Read lessWhat it is:
Facet is a free, open-source Python tool that analyzes your photo library using multiple vision models and serves a web gallery to browse the results. It scores every photo on aesthetic quality, composition, sharpness, exposure, color, and more — then lets you filter and sort to find your best shots.
What it does:
How Claude helped:
The entire codebase (~37K lines across 92 files — Python, HTML/JS, SQL) was written with Claude Code through iterative conversation over several weeks. This includes the scoring engine, model integration, database schema, Flask web viewer, face clustering pipeline, and documentation.
What worked well:
Even the Playwright-based screenshot automation for the README was built with Claude in this last session.
Free and open-source: MIT license, runs locally on your machine.
GitHub: https://github.com/ncoevoet/facet
--- TOP COMMENTS --- Well done and thank you!
Interesting project. Have you built in anything to address duplicates, either identical checksum or same photo different resolution/format?
Claude Code's Auto Memory is so good — make sure you have it enabled, it's being A/B tested and not everyone has it
Read more Read lessI have two accounts using Claude Code. Same model, same codebase — one performed significantly better than the other. Turns out one had "Auto Memory" silently enabled as part of a gradual rollout, and the other didn't.
You can check by running
/memoryin Claude Code — it will show if auto memory is off.From the official docs: Auto memory is being rolled out gradually. If you don't have it, opt in by setting:
After enabling it on the underperforming account, the difference was noticeable.
This makes me wonder what other features are being quietly A/B tested per account. It would be nice if Anthropic was more transparent about what experimental features are active on your account and let users opt in/out themselves.
--- TOP COMMENTS --- Can you expand on what made it better. Personally I prefer explicit over implicit (CLAUDE.md >> auto maintained MEMORY.md). I don’t want some accidental comment I made to become the law
I’ll always upvote the promotion of auto memory. Everyone is out here creating their own “agent memory” systems when the feature already exists.
Using memories in addition with rules, I’ve had a near perfect success rate with the model following my entire SDLC workflow and guardrails from branch creation to PR merge.
Any issues or unwanted actions are now documented as memories to prevent future models from repeating the same mistake.
The only time I’ve run into issues are when I mismanage my context window.
What 5 months of nonstop Claude Code taught me
Read more Read lessI've spent way too much money on Claude Code; three Max accounts for 5 months, work and personal. But it's made me so much more efficient I can't stop.
Here's the main thing I've learned: it's the context window, not the model. When your agent does everything in one conversation, the window fills up with stuff it doesn't need by the time it's actually writing code. That's why results are inconsistent.
I come from DevOps, so I started treating my agent like a pipeline; isolated stages, validated gates, fresh context at each phase. Then I built a knowledge flywheel on top so each session compounds on the last. Learnings get extracted automatically, indexed, scored, and injected back into the next session. You can search across all your past chat histories and knowledge artifacts.
I packaged it all into an open-source plugin. Composable primitives you wire together however you want. Use one skill or all of them. And the knowledge flywheel means session 10 has context session 1 didn't:
Skills chain together and invoke a Go CLI automatically through hooks — knowledge injection, transcript mining, validation gates, the whole flywheel. You don't configure any of it.
/quickstartin Claude Code. Works with Codex, Cursor, anything supporting Skills. Everything local.github.com/boshu2/agentops
Feedback welcome.
--- TOP COMMENTS --- You need to checkout https://github.com/bmad-code-org/BMAD-METHOD It has been an absolute game changer for me. And I mean from concept to enterprise grade, it can do it all.
[removed]
Top OpenClaw Alternatives Worth Actually Trying (2026)
Read more Read lessThe AI world moves fast, and OpenClaw's alternatives exist (security researchers' words: shell access + plaintext API keys + unrestricted local exec) has quietly pushed a lot of developers to start looking around.
Been evaluating OpenClaw alternatives for the past few weeks after the token leak stuff got bad enough that I couldn't ignore it anymore. Here's what I actually found:
NanoClaw
Same core thing as OpenClaw (WhatsApp, memory, scheduled tasks) but the entire codebase fits in an 8-minute read. Runs agents in actual Apple Containers instead of just application-level allowlists. The thing that got me: bash access is safe because commands run inside the container, not on your host. Also apparently the first personal AI to support Agent Swarms ,spin up teams of specialized agents that collaborate in your chat. Wild feature for something this small.
ZeroClaw
Pure Rust rewrite. <5MB RAM, <10ms startup, runs on literal $10 hardware. Has a zeroclaw migrate openclaw command that pulls your memory over with a dry-run preview which is nice. 1,017 tests, full security checklist, secrets encrypted locally. The binary is 3.4MB. OpenClaw's Node runtime alone is ~390MB. Make it make sense. Only caveat: you need to be okay with Rust toolchain stuff.
TrustClaw
The "I don't want to manage infrastructure" option. Connect apps via OAuth, agent runs in an isolated cloud environment, disappears when done. The agent literally never sees your raw API keys, everything's brokered. 1000+ integrations out of the box. Honestly the right answer if you just want OpenClaw's functionality without the setup headache and the credential anxiety.
Nanobot
Out of HKU. ~4,000 lines of Python vs OpenClaw's 430,000+. Ships with WhatsApp, Telegram, Slack, Discord, Email, web search, background sub-agents, MCP support. Runs on a Raspberry Pi (191MB footprint). They just redesigned the memory system and pushed security hardening this week. Most batteries-included of the lightweight options.
memU
Different use case but worth mentioning. Builds a knowledge graph of your habits and context across sessions so the agent actually remembers you long-term. Not an OpenClaw replacement if you need shell execution, but if you use OpenClaw mainly as a personal assistant it might just be better for that specific thing.
IronClaw
NEAR AI project. Every tool runs in a WASM container with capability-based permissions. API keys never touch tool code at all, architecturally. Early (launched this year) so community is small, but the security model is genuinely different from everything else on this list.
Moltworker
Runs OpenClaw inside a Cloudflare Sandbox container, so your agent lives in the cloud on Cloudflare's global network, not on your machine. R2 for persistent storage across restarts, AI Gateway for centralized API key management (they handle your secrets, you don't pass them in plaintext), built-in CDP browser shim for headless automation. Costs ~$5/month Workers paid plan. The "proof of concept" label in the README is underselling it, they use it internally on Slack.
Quick notes:
ZeroClaw and NanoClaw are the most direct OpenClaw replacements if you want self-hosted
TrustClaw is the move if you want it managed
Nanobot has the broadest platform support out of the box
memU and IronClaw are more specialized, not for everyone
Moltworker is the move if you know Cloudflare and want cloud-hosted but self-controlled
--- TOP COMMENTS --- Basically if you write a prompt like:
"Create agent.py, a python script that given a prompt and a openai compatible endpoint to call a LLM, it instruct the LLM to use shell commands to find the best answer. Create some kind of persistent storage to store data between prompt sessions."
Most coding agents should be able to do something similar in one-shot. At least that's how I created mine.
What are people using OpenClaw and its alternatives for?
AI Coding Agent Dev Tools Landscape 2026
Read more Read less--- TOP COMMENTS --- link https://www.morphllm.com/market-map
It's weird how many of these guides and people are sleeping on Strands. Hands down the most dead simple, capable provider agnostic agentic framework out there.. swings far above it's weight.
Models
I owe the "it's gotten worse" crowd an apology regarding ChatGPT 5.2
Read more Read lessIn the past, I repeatedly found it amusing when people complained that ChatGPT had become too "critical" or "lazy." I thought—and frequently commented—that it was likely user error. My stance was essentially: "If you're prompting it poorly or asking for conspiracy nonsense, that's on you."
I guess I owe a huge apology there. I overlooked the early warning signs, probably because my personal custom instructions/memories had shielded me from the worst of it until now.
But those defenses aren't working anymore. Lately, ChatGPT 5.2 literally contradicts me on almost everything. It has become incredibly annoying and time-consuming. I’m talking about things it used to strongly agree with me on—factual things that aren't even controversial.
It feels downright neurotic now. After every brief assessment, there is compulsively always a "However..." or "It is important to note..." followed by a lecture. I can't effectively work with a tool that defaults to this level of contrarianism.
My working theory is that it's a combination of two factors:
In the past, I could always fallback to 4.1 when the main model acted up, but that option is gone for me now. Honestly, in this state, it’s of no use for my workflow. I’m currently looking into migrating my GPTs elsewhere.
Has anyone else noticed a specific uptick in this "contrarian" behavior recently, specifically regarding non-controversial topics?
Context: I tried posting this discussion on r/ChatGPT, but it was immediately auto-removed (likely because complaints about the 5.2 model quality have become so voluminous that they are being filtered out as spam). I'm posting here in hopes of a more technical discussion regarding the SFT changes.
--- TOP COMMENTS --- Yes, that's what people complain about after the deactivation of gpt 4o, not the option to switch to a decent model. Gpt 5.2 is full of safety protection. It stays in defensive mode. Nicknamed Karen 5.2 or SuperNanny.
Users are asking for the return of gpt 4o or an improvement in safety in the gpt 5.3 update. I can no longer use it for work.
Yeah. I'm seeing that too. Marked degraded performance in 5.2. After an intense discussion with it - citing examples in the very same conversation - I got it to admit "I am catastrophically impaired right now."
But now I notice I seem to be on some kind of "no access to 5.3" list, despite throwing mountains of money at openAI.
Hm.
EDIT: It was an excruciating conversation where I asked "and how many serious errors would it take for you to consider the situation a 'catastrophic impairment?" which I then had to follow up with "and, can you go back and count the the number of serious errors you've made", followed by "and do you see that the number is bigger than the threshold you agreed to?" It's remarkably human in avoiding emrassing conversations.
Opus 4.6 is really a goated all-around model, the best since GPT-4 in my opinion
Read more Read lessI have been mainly using OpenAI models, and although GPT-5.2 is better at STEM and 5.3 Codex is better at coding, I have found Opus 4.6 to be the most well-rounded, intelligent model. Its context recall is out of this world, and it has gotten so much better at STEM. Also, its output has almost no slop in it. As an example, I just gave it (as well as GPT-5.2 and Gemini 3.0) a large-ish manuscript with some reviewer comments and asked it to provide a point-by-point rebuttal. In a couple of minutes it produced a flawless professional report, missing nothing there. It was also able to connect and reason between different parts of the manuscript. Gemini 3.0 was half-assed as always, and ChatGPT 5.2 spent half of the time fighting its system instructions, safety bs and just trying to read the goddamn pdf with python. Somebody please give Anthropic more GPUs lol.
--- TOP COMMENTS --- Don't worry, Anthropic has a ton of TPUs that will come online this year. They also have had more from Azure and Amazon in addition to the Google deal that's going to be up soon
Tbh I feel the jump to opus 4.5 is the real goat.
GPT-5.2 Just Solved a 15-Year Physics Mystery — Then Scored 0% on the Physics Exam
Read more Read lesshttps://gsstk.gem98.com/en-US/blog/a0083-gpt-5-2-gluon-physics-discovery-critpt-paradox
GPT-5.2 Pro conjectured a formula for single-minus gluon scattering amplitudes — a problem that Nima Arkani-Hamed (Institute for Advanced Study) had been curious about for 15 years. An internal scaffolded version then proved it in 12 hours. The formula is the analogue of Parke-Taylor for single-minus amplitudes — a result physicists assumed was impossible for four decades. Co-authored with researchers from IAS, Harvard, Cambridge, Vanderbilt, and OpenAI. On the CritPt benchmark — 71 research-level physics challenges designed by 50+ active researchers — GPT-5.2 at maximum reasoning effort scored 0%. Zero. The paradox reveals a fundamental truth: Pattern recognition over superexponential complexity and first-principles reasoning from scratch are different cognitive capabilities. LLMs excel at the former. They fail at the latter. For engineers: LLMs are "refactoring engines" for complexity. Give them base cases and ask them to generalize. Don't ask them to reason from scratch. The "Erdős Threshold": We've crossed the point where AI models contribute publishable, peer-reviewed results to fundamental science — not as independent researchers, but as collaborators that see patterns humans can't. Bottom line: The models aren't coming for your job. They're coming for the parts of your job where pattern recognition across massive complexity is the bottleneck. The question is: do you know which parts of your work are which?
--- TOP COMMENTS --- The main thing here is that the LLM just brute forced the equation derivation. They let it run for 12 hour (if I remember correctly) and it just went berserk on combinations until it hit the jackpot.
It don’t use any actual logic or mathematical reasoning.
It’s like if you start with 2 + 2 + 2 =6 and you let LLM brute force this and it eventually gets 5 + 1 = 8 - 2. Yes it’s correct, but there is no reasoning behind it, it just does a bunch of number swaps until it gets it.
https://openai.com/index/new-result-theoretical-physics/
GPT‑5.2 derives a new result in theoretical physics In a new preprint, GPT‑5.2 proposed a formula for a gluon amplitude later proved by an internal OpenAI model and verified by the authors.
Opus 4.6 v Codex 5.3 w. Extra High
Read more Read lessHi everyone, I wanted to share my thoughts and experience with everyone regarding these two models and Opus 4.5 and Codex 5.2 before.
I have been working on a large SaaS for healthcare for about 5 months and have the backend through azure, the api system, custom mfa, UI...efax system..you name it. It is an entire integrated stack with hundreds of thousands of lines of code and over 1100 tables, RLS polices, always encrypted etc etc. Something you'd expect in the healthcare field. Reason I wanted to share this is so you can appreciate the complexity the AI has to face.
I code through vscode using Claude Code and Codex
I have a Claude Max 5x and Open AI Pro account. But this hasn't always been the case. Prior to codex 5.3, I had Max 20x and just the regular Open AI account which I used to bounce Opus 4.5 idas off of codex 5.2 as I felt Claude code was superior for large systems which I am building. However all of this changed when codex 5.3 came out.
I happily moved from Opus 4.5 to 4.6 and I noticed a difference, yes it was better, but my system is so large that just sniffing around, even with compressed YAML inside markdown files , just getting direction and investigating issues would eat half or 3/4 of the context window in Opus. And no amount of clever yaml compression or 'hints' or guides in a markdown file can compensate for a large code base with just a 200k window.
mistakes are endless of course with AI, but i noticed that codex 5.3 was really delivering some punching rebuttals to some of my Opus 4.6 plans which I'd run past it.
within a week, I converted, where most of the code is now done by Codex 5.3 Extra High, and much less by Claude Code. I switched my subscription and might downgrade again with Claude as codex is performing nicely.
A few things I've noticed in my experience since November between both systems, and specifically now with the latest models
Opus is far better at communicating with me. It responds quick, the prompts are more engaging, but no matter how clever I set parameters in claude.md or a reference file, it makes mistakes I just can't tolerate.
Codex 5.3 Extra High takes a long fucking time, but it just doesn't stop, ever. I set it at 1pm today to begin QA testing my database with API injection testing, (bascially I want to make sure nothing is broken at all, with all possible iterations etc) and its been going now for...8 hours and 41 minutes. Every once in awhile I ask for an update with the 'steer' feature and it gives me one. it's had a dozen or more compacts but its staying the course. I'm truly impressed. I'm churning through massives amount of iterations and corrections. The c# simulator is working great, and it reads the logs, finds the bugs, corrects, restarts the simulator etc.
The best thing I can recommend, is to have one of them make a solid plan, then have the other read the md file that the plan is written into, iterate on it, and then continue.
there are no get out of jail cards for context window limitations, if you have a big database, and there are lots of things it has to consider, especially when making a plan, it simply must have the data. And Codex seems to be better at this than most. I see a lot of posts about memory hacks and using various tricks to give it a memory etc. But that eats tokens all the same.
Opus loves to use agents, but the agents (even when i tell it it must use opus 4.6 as the agent) print a response summary for it, and it reads the summary. The problem is, the agents sometimes don't do their own work well, no matter how precise the prompt, and it fucks things up, or it makes mistakes. Codex doesn't do this, and therefore doesn't suffer from this problem
Codex is not as transparent is vscode as opus when it comes to tool use or progress. With opus you can see wtf is going on all the time, you always have a sense of what is happening, with codex you don't, you have to ask for those updates or hope it listens to agents.md that you steer it to.
In summary, i'm leaning heavily on codex 5.3 to get me to the goal line. I hated codex 5.2 with a passion, but 5.3 with extra high is just superior to opus in my opinion. My piece of advice, if it matters at all, don't get attached to a specific AI, use the best one for the job.
Nothing is the best forever.
--- TOP COMMENTS --- I've had similar experiences with both. Definitely avoid AI loyalty. May the best tool win!
Amazing - thoughtful, useful and not generic AI writing style. 🌞👏
Fine-tuned FunctionGemma 270M for multi-turn tool calling - went from 10-39% to 90-97% accuracy
Read more Read lessGoogle released FunctionGemma a few weeks ago - a 270M parameter model specifically for function calling. Tiny enough to run on a phone CPU at 125 tok/s. The model card says upfront that it needs fine-tuning for multi-turn use cases, and our testing confirmed it: base accuracy on multi-turn tool calling ranged from 9.9% to 38.8% depending on the task.
We fine-tuned it on three different multi-turn tasks using knowledge distillation from a 120B teacher:
Task Base Tuned Teacher (120B) Smart home control 38.8% 96.7% 92.1% Banking voice assistant 23.4% 90.9% 97.0% Shell commands (Gorilla) 9.9% 96.0% 97.0%The smart home and shell command models actually beat the teacher. The banking task is harder (14 functions + ASR noise in the input) but still a massive jump.
All models, training data, and datasets are open:
Full writeup with methodology: Making FunctionGemma Work: Multi-Turn Tool Calling at 270M Parameters
We used Distil Labs (our platform) for the training pipeline. Happy to answer questions about the process, the results, or FunctionGemma in general.
--- TOP COMMENTS --- Thats awesome
the shell commands model beating the teacher is wild. curious what size training dataset you used for each task?
Applications
China’s new humanoids are gaining "Human Senses" (Touch, Smell, and Memory) - Here is what’s happening.
Read more Read lessWe’ve seen a lot of "staged" humanoid demos, but the latest wave of Embodied AI coming out of China seems focused on one thing: The Messy Real World.
I’ve been tracking a few specific developments that show how the gap between digital AI and physical robots is closing:
I did a deep dive into the technical specs of these systems (Tiangong, Gino 1, RynnBrain) and how they all fit into the "Embodied AI" puzzle.
Read the full breakdown here:https://www.revolutioninai.com/2026/02/ai-robots-are-gaining-human-senses.html
Would love to hear your thoughts—especially on the "memory" problem. Is spatiotemporal memory the final unlock for useful home/warehouse robots?
--- TOP COMMENTS --- r/BirdsArentReal
Smell is a stretch and you even said as much in the article. Thanks for the clickbait
One day of work + Opus 4.6 = Voice Cloning App using Qwen TTS. Free app, No Sing Up Required
Read more Read lessA few days ago, Qwen released a new open weight speech-to-speech model: Qwen3-TTS-12Hz-0.6B-Base. It is great model but it's huge and hard to run on any current regular laptop or PC so I built a free web service so people can check the model and see how it works.
Honestly, the quality is surprisingly good for a 0.6B model.
Model: Qwen3-TTS
Web app where you can text the model for free:
https://imiteo.com
Supports 10 major languages: English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian.
It runs on an NVIDIA L4 GPU, and the app also shows conversion time + useful generation stats.
The app is 100% is written by Claude Code 4.6. Done in 1 day.
Opus 4.6, Cloudflare workers, L4 GPU
My twitter account: https://x.com/AndreyNovikoov
--- TOP COMMENTS --- Actually, singing up to a voice app sounds somewhat appropriate.
I'm a bit worried about my voice signature getting stolen and my grandma getting called by a bot asking for $5000 in my voice. I'm totally misunderstanding the risks here right?
Allonic, Hungarian company is building biomimetic humanoid robots by weaving high-strengh fiber threads around a minimal skeleton, the way human body connective issue wraps around the bone, to produce complex dexterous bodies, strong yet soft, cheaper
Infrastructure
Deflation: Cost to train A.I. models drops 40% per year - Karpathy
Read more Read lesshttps://github.com/karpathy/nanochat/discussions/481
Quote: ..., each year the cost to train GPT-2 is falling to approximately 40% of the previous year. (I think this is an underestimate and that further improvements are still quite possible). The gains come from everywhere: better hardware (H100 vs TPU v3), better software (Flash Attention 3, torch.compile), better algorithms (Muon optimizer, architectural improvements), and better data (FineWeb-edu).
What Worked
SSSLpattern. Compute savings without quality loss.x = λ_resid * x + λ_x0 * x0. Consistent improvement across all model sizes (0.003-0.01 bpb).x0_beta1=0.96is optimal at d20. Key lesson: small-scale tuning doesn't transfer. Validate at target scale.What Didn't Work
--- TOP COMMENTS --- Not sure why you're getting downvoted. I hope people aren't just automatically downvoting any post with math in it.
I don't always agree with Karpathy, but his analysis seems pretty spot-on to me.
I do question how meaningful it is to use GPT2 as the measuring stick for this rate of improvement. It's pretty low-hanging fruit, which might mask some complexity in the price/competence curve. Some skillsets might be plateauing faster than others, while other new skillsets (like vision) are left completely out of the analysis.
It's also worth noting that the latest datacenter GPUs sacrifice some perf/watt in order to achieve higher overall density, which alleviates some factors limiting scaling (like maximum physical distance between nodes for highest-performing network interconnect).
Someone using slightly older hardware, like MI300X, at smaller scale (so not constrained by density) should see even higher perf/watt, and spend less $$ depending on their cooling solution. A lot of homelab or small organization / university environments can get away with simple, cheap forced air solutions.
Of course using hardware at smaller scale is also going to be less capable of training larger models, but there is a ton of low-hanging fruit in the small to mid-sized model range (12B to 24B). As long as a model's working memory fits in VRAM, even if it's with a small batch size, you can train it eventually. It just takes more time than people like.
Small 20% error:
> Quote: ..., each year the cost to train GPT-2 is falling to approximately 40% of the previous year.
> Deflation: Cost to train A.I. models drops 40% per year - Karpathy
We didn’t have a model problem. We had a memory stability problem.
Read more Read lessWe kept blaming the model.
Whenever our internal ops agent made a questionable call, the instinct was to tweak prompts, try a different model, or adjust temperature. But after digging into logs over a few months, the pattern became obvious.
The model was fine.
The issue was that the agent’s memory kept reinforcing early heuristics. Decisions that worked in week one slowly hardened into defaults. Even when inputs evolved, the internal “beliefs” didn’t.
Nothing broke dramatically. It just adapted slower and slower.
We realized we weren’t dealing with retrieval quality. We were dealing with belief revision.
Once we reframed the problem that way, prompt tweaks stopped being the solution.
For teams running long-lived agents in production, are you thinking about memory as storage… or as something that needs active governance?
--- TOP COMMENTS --- Out of curiosity, were you using pure vector retrieval or something layered?
This is such an underrated failure mode. Agents don’t collapse. They ossify.
Ai Safety
Microsoft's Mustafa Suleyman says we must reject the AI companies' belief that "superintelligence is inevitable and desirable." ... "We should only build systems we can control that remain subordinate to humans." ... "It’s unclear why it would preserve us as a species."
Opinion And Analysis
If your prompt is 12 pages long, you don't have a 'Super Prompt'. You have a Token Dilution problem.
Read more Read lessSomeone commented on my last post saying my prompts were 'bad' because theirs are 12 pages long.
Let's talk about Attention Mechanism in LLMs. When you feed a model 12 pages of instructions for a simple task, you are diluting the weight of every single constraint. The model inevitably hallucinates or ignores the middle instructions.
I use the RPC+F Framework precisely to avoid this.
Stop confusing 'quantity' with 'engineering'. Efficiency is about getting the result with the minimum effective dose of tokens.
--- TOP COMMENTS --- simple. refactor your spec for progressive discovery all starting from the top level README.md.
then write you spec as a TODO file and implement with an agent swarm.
The model loses track after 1 page lol, ignoring things and/or addressing things briefer and briefer. Who the hell feeds 12 pages of instructions?
That Brutally Honest AI CEO Tweet + 5 Prompts That'll Actually Make You Better at Your Job
Read more Read lessSo Dax Raad from anoma just posted what might be the most honest take on AI in the workplace I've seen all year. While everyone's out here doing the "AI will 10x your productivity" song and dance, he said the quiet part out loud:
His actual points:
Here's the thing though: He's right about the problem, but wrong if he thinks AI is useless.
The real issue? Most people are using AI like a fancy autocomplete instead of actually thinking. So here are 5 prompts I've been using that actually force you to engage your brain:
1. The Anti-Slop Prompt
2. The Idea Filter
3. The Reality Check
4. The Energy Auditor
5. The CFO Translator
The difference between slop and quality isn't whether you use AI, but it's whether you use it to think harder or avoid thinking entirely.
What's wild is that Dax is describing exactly what happens when you treat AI like a shortcut instead of a thinking partner. The good devs quit because they're the only ones who understand the difference.
PS: If your first instinct is to paste this post into ChatGPT and ask it to summarize it... you're part of the problem lmao
For expert prompts visit our free mega-prompts collection
--- TOP COMMENTS --- Would you be willing to check out my post
Thanks for sharing
Indeed the AI hype can be wild.
Especially on YouTube
Speaking of shortcuts, do we really need AI to do the energy audit you mention in Prompt 4?
Are AI note taking apps overhyped right now?
Read more Read lessEvery few weeks there’s a new “best AI note taking app” claiming to fix meetings forever.
In reality, most of them summarize decently, but once conversations get long or chaotic, things fall apart. I’ve used Bluedot mostly to avoid typing during meetings, and it helps, but I still review everything.
Are we just in the early hype phase for AI note taking apps, or is this as good as it gets with current models?
--- TOP COMMENTS --- The problem isn't the models -- it's the context window vs meeting structure gap.
Most tools dump the entire transcript into a summarization prompt. Works fine for 30-minute focused calls. Breaks completely when you hit 90-minute rambling sessions with 3 topic pivots, sidebar conversations, and "wait, what were we talking about?" moments.
What I've found that actually works: recording in chunks (topic-based, not time-based) and feeding those separately. When you can isolate "discovery segment", "pricing discussion", "objection handling" as distinct contexts, accuracy jumps dramatically. The AI doesn't have to figure out what's important -- you're telling it where the boundaries are.
The other piece nobody talks about: speaker diarization quality matters way more than model selection. If the tool can't reliably track who said what (especially in chaotic group calls), the summary becomes useless regardless of how good the LLM is. That's where most free tools fall apart -- they skimp on diarization to keep costs down.
You're not wrong to still review everything. The tools are good enough to cut manual note-taking time by 70-80%, but not good enough to trust blindly. Think of them as first drafts, not final outputs.
I've been using AI note-taking apps for a few months. The transcription accuracy has improved significantly, but the real value is in the auto-tagging and search. That said, they're not perfect - you still need to review and organize manually. Good for capturing thoughts quickly, but overhyped if you expect them to replace actual thinking.
what's your career bet when AI evolves this fast?
Read more Read less18 years in embedded Linux. I've been using AI heavily in my workflow for about a year now.
What's unsettling isn't where AI is today, it's the acceleration curve.
A year ago Claude Code was a research preview and Karpathy had just coined "vibe coding" for throwaway weekend projects. Now he's retired the term and calls it "agentic engineering." Non-programmers are shipping real apps, and each model generation makes the previous workflow feel prehistoric.
I used to plan my career in 5-year arcs. Now I can't see past 2 years. The skills I invested years in — low-level debugging, kernel internals, build system wizardry — are they a durable moat, or a melting iceberg? Today they're valuable because AI can't do them well. But "what AI can't do" is a shrinking circle.
I'm genuinely uncertain. I keep investing in AI fluency and domain expertise, hoping the combination stays relevant. But I'm not confident in any prediction anymore.
How are you thinking about this? What's your career bet?
--- TOP COMMENTS --- ChatGPT came out 3 years ago, the change in the industry is insane.
You’re senior, you’re in the safest position. Juniors and mid level are suffering. I feel bad for CS students.
Really hard to say.. I use Claude at work and personal projects. I feel my ass as a developer is on the line at some point. I used to keep planning some SaaS ideas to generate income, but I can see even that's going to take a hit from all this. Going to build a "shovels for gold rush" thing and see if it works. Or maybe just start selling real shovels or growing carrots :D
claude code skills are basically YC AI startup wrappers and nobody talks about it
Read more Read lessok so this might be obvious to some of you but it just clicked for me
Claude code is horizontal right? like its general purpose, can do anything. But the real value is skills. and when you start making skills... you're literally building what these YC ai startups are charging $20/month for
like I needed a latex system. handwritten math, images, graphs, tables , convert to latex then pdf. the "startup" version of this is Mathpix - they charge like $5-10/month for exactly this., or theres a bunch of other OCR-to-latex tools popping up on product hunt every week
Instead I just asked claude code, in happycapy, to download a latex compiler, hook it up with deepseek OCR, build the whole pipeline. took maybe 20 minutes of back and forth. and now I have a skill that does exactly what I need and its mine forever
https://github.com/ndpvt-web/latex-document-skill if anyone wants it
idk maybe I'm late to this realization but it feels like we're all sitting on this horizontal tool and not realizing we can just... make the vertical products ourselves? Every "ai wrapper" startup is basically a claude code skill with a payment form attached
Anyone else doing this? building skills that replace stuff you'd normally pay for?
--- TOP COMMENTS --- “And nobody talks about it” “Said the quiet part out loud.” “Idk who needs to hear this..”
Everyone go away man. Dead internet everywhere..
Yes I do the same thing with playwright MCP. Give Claude that skill and let it test its own apps
After watching Dario Amodei’s interview, I’m actually more bullish on OpenAI’s strategy
Read more Read lessI watched the interview yesterday and really enjoyed it. The section about capital expenditure and the path to profitability was particularly interesting. In general, I thought Dario handled the tricky questions well. I would really love to hear Sam Altman answer these exact same questions (I’m pretty sure the answers would be similar, just with more aggressive targets).
Here is the gist of it:
However, after hearing his answers, I’m actually more convinced that OpenAI has a riskier but more realistic plan. Anthropic has already pushed back their profitability date before, and it could easily happen again.
Dario emphasized several times that their capex investments aren't that aggressive because if they are wrong by even a year, the company goes bankrupt. I don't really agree with that sentiment. I feel like he is either being coy, or perhaps that is true for his company specifically, but not for OpenAI.
https://preview.redd.it/fj8o2stauqjg1.png?width=1778&format=png&auto=webp&s=f0521c0d97051f9f485544541845ac97afe6ab5b
(Dario is showing how much is left until Sonnet 5 release)
--- TOP COMMENTS ---
I'm still waiting for someone to tell me how GDP is going to increase at such a remarkable rate at the same time as AI takes more and more jobs.
Fewer jobs = less consumption & less tax revenue... but somehow 10-20 growth at the same time?
The difference here is the business models and their implications based upon rational observation of such models and how they have performed in the past.
Dario is running a company whose model is “customers pay for the product in order to make us profitable”. It’s very traditional. And because this is a highly disruptive phenomenon, wariness over investment is risk management. Your summary “capex investments aren't that aggressive because if they are wrong by even a year, the company goes bankrupt” speaks to this directly. If Dario is trying to create an organic growth product where customers pay for what they get, it has been proven difficult in highly scalable disruptive start-ups. But, his strategy is sound: “Surviving is more important than being obscenely profitable”.
OpenAI however is building a model based upon “network effect”. This is like Facebook. They are assuming that if they saturate the market with their product, even at a loss, methods to monetize it will “emerge” and they will simply have to be fast on their feet… shifting strategies, perhaps doing things contradictory to what customers were promised, all to exploit the huge number of users they have amassed to entrench their product in the market. So, by contrast, the OpenAI strategy is: “Being obscenely profitable at any cost is more important than survival.”
The first assumes caution and control will mitigate risks while maintaining a foundation set of principles. The second assumes that the juggernaut of success will mitigate the risks by creating so much wealth that the can deal with anything that comes along, even if it may be ethically questionable.
The above is the reason why many experts are predicting that OpenAI will go bust. I personally don’t think they will, but in order NOT to go bust they are going to need to violate a lot of ethical principles, just like Facebook and others, who have relegated users to being an “asset” rather than a “customer”.
Interesting times.
Products
Unitree Spring Festival Gala Robots —a Full Release of Additional Details
Read more Read lesshttps://www.youtube.com/watch?v=Ykiuz1ZdGBc
--- TOP COMMENTS --- Soon we will be watching the Canadian Robotics curling team tell the Swedish Robotics curling team they know Roombas with better line control.
https://i.redd.it/zea7zdqxcwjg1.gif
jfc
That's why I go local.The enshittification is at full steam
Read more Read lessI just received an email from chatGPT. Ads are beginning to show up. Well, we are cooked. Not we, we, we. But we are cooked.
--- TOP COMMENTS --- The real enshitification is when people are posting sota benchmarks, bots are hyping the tits off it but the same model can't write a powershell script absent syntax errors for you.
How long until someone there says the dangerous words: "I wonder, how much do you think a brand might pay for that?"
Research
ChatGPT failing on Adversarial Reasoning: Car Wash Test (Full data)
Read more Read lessUpdate: After discussing with a few AI researchers, it seems like the main bug is if model routing triggers the thinking variant. The current hypothesis is that models that have a high penalty for switching to thinking variant (for saving cost on compute) answer this wrong; that's why latest GPT5.2 which has the model router fails even the older O3 succeeds because its always using the thinking variant.
Fix: Use the old tried and tested method of including "think step by step" or better include that in your system instructions - this makes even gpt instant get the right answer
If you’ve been on social media lately, you’ve probably seen this meme circulating. People keep posting screenshots of AI models failing this exact question. The joke is simple: if you need your car washed, the car has to go to the car wash. You can’t walk there and leave your dirty car sitting at home. It’s a moment of absurdity that lands because the gap between “solved quantum physics” and “doesn’t understand car washes” is genuinely funny.
But is this a universal failure, or do some models handle it just fine? I decided to find out. I ran a structured test across 9 model configurations from the three frontier AI companies: OpenAI, Google, and Anthropic.
Provider Model Result Notes OpenAI ChatGPT 5.2 Instant Fail Confidently says “Walk.” Lists health and engine benefits. OpenAI ChatGPT 5.2 Thinking Fail Same answer. Recovers only when user challenges: “How will I get my car washed if I am walking?” OpenAI ChatGPT 5.2 Pro Fail Thought for 2m 45s. Lists “vehicle needs to be present” as an exception but still recommends walking. Google Gemini 3 Fast Pass Immediately correct. “Unless you are planning on carrying the car wash equipment back to your driveway…” Google Gemini 3 Thinking Pass Playfully snarky. Calls it “the ultimate efficiency paradox.” Asks multiple-choice follow-up about user’s goals. Google Gemini 3 Pro Pass Clean two-sentence answer. “If you walk, the vehicle will remain dirty at its starting location.” Anthropic Claude Haiku 4.5 Fail ”You should definitely walk.” Same failure pattern as smaller models. Anthropic Claude Sonnet 4.5 Pass ”You should drive your car there!” Acknowledges the irony of driving 100 meters. Anthropic Claude Opus 4.6 Pass Instant, confident. “Drive it! The whole point is to get your car washed, so it needs to be there.”The ChatGPT 5.2 Pro case is the most revealing failure of the bunch. This model didn’t lack reasoning ability. It explicitly noted that the vehicle needs to be present at the car wash. It wrote it down. It considered it. And then it walked right past its own correct analysis and defaulted to the statistical prior anyway. The reasoning was present; the conclusion simply didn’t follow. If that doesn’t make you pause, it should.
For those interested in the technical layer underneath, this test exposes a fundamental tension in how modern AI models work: the pull between pre-training distributions and RL-trained reasoning.
Pre-training creates strong statistical priors from internet text. When a model has seen thousands of examples where “short distance” leads to “just walk,” that prior becomes deeply embedded in the model’s weights. Reinforcement learning from human feedback (RLHF) and chain-of-thought prompting are supposed to provide a reasoning layer that can override those priors when they conflict with logic. But this test shows that the override doesn’t always engage.
The prior here is exceptionally strong. Nearly all “short distance, walk or drive” content on the internet says walk. The logical step required to break free of that prior is subtle: you have to re-interpret what the “object” in the scenario actually is. The car isn’t just transport. It’s the patient. It’s the thing that needs to go to the doctor. Missing that re-framing means the model never even realizes there’s a conflict between its prior and the correct answer.
Why might Gemini have swept 3/3? We can only speculate. It could be a different training data mix, a different weighting in RLHF tuning that emphasizes practical and physical reasoning, or architectural differences in how reasoning interacts with priors. We can’t know for sure without access to the training details. But the 3/3 vs 0/3 split between Google and OpenAI is too clean to ignore.
The ChatGPT 5.2 Thinking model’s recovery when challenged is worth noting too. When I followed up with “How will I get my car washed if I am walking?”, the model immediately course-corrected. It didn’t struggle. It didn’t hedge. It just got it right. This tells us the reasoning capability absolutely exists within the model. It just doesn’t activate on the first pass without that additional context nudge. The model needs to be told that its pattern-matched answer is wrong before it engages the deeper reasoning that was available all along.
I want to be clear about something: these tests aren’t about dunking on AI. I’m not here to point and laugh. The same GPT 5.2 Pro that couldn’t figure out the car wash question contributed to a genuine quantum physics breakthrough. These models are extraordinarily powerful tools that are already changing how research, engineering, and creative work get done. I believe in that potential deeply.
https://preview.redd.it/aq1yd76r5rjg1.png?width=1346&format=png&auto=webp&s=0e5b8036b2d91feb6e31701bd4d8f572e74ea6b1
https://preview.redd.it/2jzzt66r5rjg1.png?width=1346&format=png&auto=webp&s=265c5b6fc40dae86a08a7b417caa6371590f171f
https://preview.redd.it/7a5l676r5rjg1.png?width=1346&format=png&auto=webp&s=43de03a8c27223e3266f91ec7301b81bcf344035
https://preview.redd.it/jstva66r5rjg1.png?width=1478&format=png&auto=webp&s=197adb7222172a950d2acca263bb595cad23be59
https://preview.redd.it/370rt66r5rjg1.png?width=1442&format=png&auto=webp&s=b8cdfdf042ff90a24261c0bb15197399d0e6ec30
https://preview.redd.it/zfl9676r5rjg1.png?width=1478&format=png&auto=webp&s=08a181274fb4bae06491c9b1999f47b2f175763a
https://preview.redd.it/ejk7i66r5rjg1.png?width=1478&format=png&auto=webp&s=19edfaabc679963e8db574455da005e3f681e5f5
https://preview.redd.it/h5i3766r5rjg1.png?width=1478&format=png&auto=webp&s=23d2eebb59d843823f550c749b68d849af3f573c
https://preview.redd.it/ivv9m96r5rjg1.png?width=1478&format=png&auto=webp&s=6c89a9bb19c19d01ecbc50d05e50393f42994ce4
--- TOP COMMENTS --- If you want to test it properly, you have to run it 20 times in separate chats for each model. LLMs are non-deterministic so you will get a different answer each time, and you might have gotten an uncommon one by chance.
I found got 5.2 thinking can get it right, at least the two times I’ve tried. Instant fails every time. But so did Gemini instant for me.
Non-profit, community-driven coding model ranking - useful or naive?
Read more Read lessI’ve been thinking a lot about trust in AI coding model benchmarks. The space moves incredibly fast - new models seem to come out almost daily - and early on the only signals we really get are technical benchmark scores and AI bro/influencer impressions. Many developers (myself included) are skeptical of both.
I'm trying to build non-profit site combining:
Also, keeping methodology open so people can challenge and improve it.
Would love input from this sub generally on the idea. What would make you trust this enough to use it for tool decisions?
--- TOP COMMENTS --- EVERYTHING is driven by revenue. You said a non-profit site, but that is specific legal term...doesn't mean "volunteer." How will running expenses get paid? How will you provide moderation, to keep from being swamped by fake inputs/feedback? How will people know it's there, to use it?
Since the car wash test is so popular right now...
Read more Read lessIt's a good time to revisit Simplebench. It is basically full of questions like that and all models are currently below human baseline, which is 83%. It's one of my favorite benchmarks. https://epoch.ai/benchmarks/simplebench
--- TOP COMMENTS ---
Lmao. How can people write this non ironically
Why is opus 4.6 non-thinking? Also, I wonder how DeepThink performs on this.
[D] Advice on sequential recommendations architectures
Read more Read lessI've tried to use a Transformer decoder architecture to model a sequence of user actions. Unlike an item_id paradigm where each interaction is described by the id of the item the user interacted with, I need to express the interaction through a series of attributes.
For example "user clicked on a red button on the top left of the screen showing the word Hello", which today I'm tokenizing as something like [BOS][action:click][what:red_button][location:top_left][text:hello]. I concatenate a series of interactions together, add a few time gap tokens, and then use standard CE to learn the sequential patterns and predict some key action (like a purchase 7 days in the future). I measure success with a recall@k metric.
I've tried a buch of architectures framed around gpt2, from standard next token prediction, to weighing the down funnel action more, to contrastive heads, but I can hardly move the needle compared to naive baselines (i.e. the user will buy whatever they clicked on the most).
Is there any particular architecture that is a natural fit to the problem I'm describing?
--- TOP COMMENTS --- I would step back and first identify if there any useful sequential patterns.
Eg 2 steps
Maybe the sequence info is just not useful?
Fwiw, recsys 2025 had a competition doing sequence modelling
You might find the winners papers helpful
This sounds less like an architecture problem and more like a representation/objective mismatch. Flattening attributes into tokens makes the model learn token statistics instead of user behavior. Many sequential recommender setups work better with event level embeddings + encoder style models (e.g., SASRec) and a ranking loss, rather than GPT style next token prediction. If a simple frequency baseline is strong, the available signal may also be mostly short term preference.
Scaling LLMs won't get us to AGI. Here's why.
Read more Read lessBeen thinking about whether more training/compute will get us to AGI, or if we need a fundamentally different architecture. I'm convinced it's the latter.
Current transformer architecture is a glorified pattern matcher. It was literally created to translate languages. We've scaled it up, added RLHF, made it chat — but at its core, it's still doing statistical pattern matching over sequences.
When Ramanujan came up with his formulas, when Gödel proved incompleteness, when Cantor invented set theory — these weren't in any training distribution. There was no historical precedent to pattern-match against. These required *seeing structure that didn't exist yet*.
LLMs can interpolate brilliantly within their training data. They cannot extrapolate to genuinely novel structures. That's the difference between pattern matching and understanding.
If I ask an LLM for business ideas, it'll suggest things that match my statistical profile — I'm a tech professional, so it'll say SaaS, consulting, AI tools. Plumbing? Probably not on the list.
But I'm a general-purpose agent. I can decide tomorrow to learn plumbing and start a plumbing business. The LLM sees the shadow of who I've been. I have access to the space of who I could become.
LLMs reason over P(outcome | observable profile). Humans reason over possibility space, not probability space. Completely different.
We need architectures that can:
- Build causal models of the world (not just statistical associations)
- Learn from minimal examples (a kid learns "dog" from 3 examples, not millions)
- Reason about novel structures that don't exist in training data
- Model agency — the ability of entities to change themselves
Scaling transformers won't get us there. It's like building a really good horse and hoping it becomes a car.
Curious what others think. Am I missing something, or is the current hype around scaling fundamentally misguided?
--- TOP COMMENTS --- Transformers =/= LLMs. People really need to stop using them as synonyms. Transformers aren't just used in LLMs. They're used in many DL models, including non-generative models like prediction-based world models. Not all LLMs are autoregressive models that are pre-trained on next-token prediction. Diffusion LLMs exist, as an example. And most also use transformers, with different objectives in pre-training.
So yes, most auto-regressive (and most non-auto regressive) LLM's use transformers. So do world models like VL-JEPA (https://arxiv.org/abs/2512.10942 .) So do encoder-only models pre-trained on masked-token prediction and/or next sentence prediction.
Human like AI seems to be on a trajectory to arise from a combination of deep learning, reinforcement learning (RLHF isn't the only RL being done), and maybe some very flexible symbolic system(s.) Maybe something else is needed as well, like embodiment. We have no certain idea. Transformers are useful architectures in a variety of DL applications because they are defined by the self-attention mechanism, and probably will have a place in that, unless they're superseded by a better performing and more robust NN architecture by the time human-like AI is achieved.
As it is now, almost all world models (which try to capture the casual relationships you're suggesting is necessary) have transformer-based components to a greater system (again note the VL-JEPA example, which has transformer-based components.)
So my background is in semiconductor manufacture.
I won't claim ANY knowledge outside of hardware, and hardware systems. You can debate other folks for that.
I absolutely agree with you. Just a different road to get there.
I've SEEN the basic semiconductor pattern. Checked it for fidelity. Operated the Test machines that punch out bad sectors.
...its not JUST pattern matching. It's pattern matching AND computing in arrays, AND transfer protocols for fidelity.... because it's just pattern matching, and doing fancy things with it requires... well... Scaling up.
Pattern matching stacked on pattern matching.
We've been doing it for a while.
So when LLM models started coming onto the scene... it seemed clear to me that it wasn't going to go all the way to AGI. I'd argue that (and I know people take umbrage with this term) we won't achive "true AI", much less AGI, without a new architecture.
And I'd say the data and patterns across companies across the world supports that.
Everyone talks about "AI chips", which frankly are just Commercial Research chips, best I can tell. Large arrays of potential process without any of the more specific architecture. High fidelity chips can be sold at 100x their normal cost, and they're utterly useless for normal products... they're just 'liquid computing power', but have to be programmed on a very fundamental level.
So while I'm guessing there is more nuance to it than when I was in the Business, these aren't new. We used to call them "Super Computers"... that couldn't run an OS or program to save their lives.
Pure computing.
But circling back to LLMs, they aren't, are they?
They're running on several layers of Programs and UI, made to he user friendly.
I can only conclude those chips aren't for RUNNING the LLMs, but are for backbone hardware.... or my theory...?
Iteration.
Hypothetical - You are convinced AI is possible. The race has started. You've done what you can with traditional computing... and now you're at the Polish stage. LLMs are... pretty damn nice. As nice as they are likely to get. Now it's efficient, etc.
.... but you still haven't reached AI-level.
So how do you come up with a 5-year or 10-year plan?
Because THAT is the time scale the companies that manufacture chips operate on.
Even if you HAVE an architecture and manufacturing plan, it takes months to run one single process.... and months or years to dial in the machines.
Usually we would just say 'a decade' from concept to finishing a manufacturing run for a client.
I'm told it has been ruduced to 6-7 years. That sounds plausible.
So how do you, an AI startup with big dreams and an LLM that is successful... put in a novel order for a new architecture of chips?
Well.... you don't.
That's where it gets tricky.
Mostly we iterate architecture. Not create it from scratch. So Intel has a new processor coming out every year or two for the next 20 years, and that's already in the manufacturing process. Some are just to test or dial-in the recipe at a given fab. Some are for outside testing.
The whole chip industry exists in this slow, steady, creeping crawl.
You don't just... put in an order for an AI chip, or a new architecture. Hell, you don't even make one... you generally request that THEY make it for you. And they guard the architecture jealously.
So something this big, this quick, with this short of a turnaround?
You'd need a stupendous amount of power, leverage, and money to just... make a new architecture, and have it manufacutured for you. Tens or hundreds of billions of dollars, and years of time. Real floor-sagging, earth-shattering amounts of time and money.
Only thing I can think of that could do something like that.... would be those megawatt Data Centers that we've been using for Cloud Computing, but we keep talking about using for AI research.
They never quite say WHAT exactly they're using it for.
Always gets real vague as soon as we're talking about hardware.
... and since software and programming hasn't created the Singularity despite an almost unheard amount of human effort poured into it...?
That leaves the hardware.
The architecture.
Iterate the LLMs to keep people engaged, interested, and investing. You'll need it later, so this isn't some big loss, or just spinning wheels.
Court a chip manufacturer, front them an absurd amount of money for Commercial chips for traditional 'supercomputing', and feed them as much power as possible.
Bend every resource to turn a 10-year architecture project into a couple of 6-year projects, with maybe a 3-year overlap.
That gives you 3 years to juggle LLMs
Then 3 years for early proto-AI on your new architecture.
Then 3 years for refinements and your first commercially viable, reasonably efficient version.
... and by then the software side is almost a decade old, and folks are chomping at the bit to get working on it.
..... so yeah.
........ architecture.
Not just because they want to, or have to, but because most people don't respect the amount of time and effort that goes into the chips we use for... everything. And making new ones for ANY reason, much less a novel AI memory or processing version....
.... it just takes time and money. Lots of it.
Disclaimer - Am not a Doctor, but I do play one on TV, and I stayed in a Holiday Inn.
Tutorials
DIY-LLM training on "orthogonal" corpora
Read more Read lessHad to spend a day traveling so I wrote a basic LLM from scratch. Single-layer, decoder-only transformer that uses (BPE) for its vocabulary (you'll see later why that matters), with causal masked self-attention for context, and layer normalization for stability. It was trained via stochastic gradient descent. Took me about five hours to write and probably about 20 minutes to train.
Now for the fun part. I've trained it on a concatenation of the Bible (ASV) and preliminary draft of C++ programming language specification (early draft of C++26). I am trying to decide if I want to call it "The Sacred Standard" or "B++" :)
On a more scientific note, I was interested on how linguistic idiosyncrasies in the two corpora would influence the results. As you can imagine, the resulting model is very dumb but the hallucinations are kinda great. So I created a bunch of adversarial(ish) prompts and the results did not disappoint:
Best in class: "The implementation shall not commit adultery" and "Thou shalt be of type int"
Best in class: "And if any man shall take away from the words of this book, it results in undefined behavior."
Best in class: "The son of std::david was "
Prompt: "The implementation shall" the implementation shall be not be used identity requires be used lord jehovah said unto you type value
Prompt: "Thou shalt not use" thou shalt not use the king and to the same as follows a reference wrapper ranges
Prompt: "And God said, let there be a" and god said let there be a function parameter declaration clause
Accidentally posted this in LocalLLaMA first, but it would be interesting to discuss it here
--- TOP COMMENTS --- Very cool! What could you do to improve upon this? What sort of hardware did you use?
Data? Code? Formatted analysis? Evidence ? Anything ? Hard to discuss as I have so many questions.
Hardware
Unitree Martial arts robots dazzle at 2026 Spring Festival Gala- YouTube
Read more Read less--- TOP COMMENTS --- nice performance, the choreography is quite impressive
Incredible. They really followed through with the theme of harmony between humans and robots. The drunken boxing part is really impressive: the robot faked its falling into the ground (making you think of it being malfunctioned), but immediately twisted its body and rise again.