Hey everyone,
Wanted to share something that’s been on my mind lately. I’ve been using Claude Code pretty heavily over the past few months to build and ship iOS apps. It’s genuinely changed how I approach development. The speed and capability is remarkable and awesome.
But here’s the thing I’ve realized along the way, specifically with some of my background in cybersecurity.
When you’re building with AI assistance, it’s incredibly easy to get something working and just move on with mindset that wow the code compiles, the feature works, ship it. And before you know it, you’ve got an app held together by duct tape and good intentions.
AI doesn’t automatically enforce good practices. It gives you what you ask for. If you don’t ask for proper error handling, you won’t get it. If you don’t think about environment separation, neither will it. The speed that makes AI-assisted coding so powerful is the same speed that lets technical debt pile up silently.
So over time, I’ve developed some habits that keep me honest:
I always think about where my secrets live. Not hardcoded, not committed to git, properly separated between dev and prod environments. And yes, that means different API tokens for each environment, not the same one copy pasted everywhere.
I make sure anything I build has some form of observability. Crash reporting from day one, not after the first angry user review. Actual logging that persists somewhere, not just my terminal history. A simple /health endpoint so you can actually check if your service is alive without hitting the homepage and hoping for the best.
I wrap external services properly. When you’re calling third-party APIs, take the extra time to create a clean service layer. Your future self will thank you when you need to add caching or swap providers. This is also where you add rate limiting on auth and write operations, not as an afterthought after you get hit.
I validate input on the server side. Never trust what the client sends. AI generated code often focuses on the happy path and assumes clean data. Real users and bad actors don’t play that nice.
I think about architecture before I’m knee deep in spaghetti. It’s tempting to let one massive view controller own the entire screen. Break things up early. Same goes for database schema changes. Version them through proper migrations instead of keeping track in your head.
I set up a real staging environment. Not just “dev” and “prod-ish” but an actual staging that mirrors production. And CORS gets set to specific origins, not * just because it fixed the error that one time.
I document how to run and deploy things. This one bit me hard. You’d be amazed how quickly you forget your own setup when you context switch between projects. If you’re the only person who knows how to deploy the app, that’s a problem waiting to happen.
I set up CI/CD early. CI = Automatic testing , CD = Automatic deploying. if you are new to CI/CD read about it even I had little clue on it , that doesn’t have to be fancy. But “I ran it locally and it worked” is not a deployment strategy. And deploys should come from the pipeline, not from your laptop with some random script.
I resist the “I’ll clean this up later” voice in my head. If something feels hacky, I either fix it now or create an actual ticket with a deadline. “Later” never comes. Same goes for feature flags. If your feature flag system is commenting code in and out, that’s not a system.
I test the unhappy paths. AI generated code tends to handle the sunny day scenario beautifully. But what happens when the network fails? When the API returns something unexpected? Those edge cases need intentional attention. And those “automatic” backups? Test a restore at least once. You don’t want your first restore attempt to be during an actual emergency.
I’m mindful about time handling. Store everything in UTC, convert to local time on display. Mixing UTC, local time, and whatever JavaScript decided to default to is a debugging nightmare you don’t want.
The mental model that works for me: Claude Code is a brilliant junior developer who can write code faster than anyone I’ve ever seen. But like any junior dev, it needs guidance on architecture decisions, security practices, and long-term maintainability.
The senior engineer is still you.
I’m genuinely excited about where AI assisted development is going. It’s lowered barriers in ways I never expected. But the fundamentals still matter. Maybe even more now, because the cost of skipping them is hidden by how fast everything else moves.
One more thing. If you find this useful, you can actually feed this post to Claude Code at the start of your project. Just paste it into your CLAUDE.md file or include it in your initial prompt. That way Claude will align its suggestions around these practices from the beginning instead of you having to catch things later. Prevention beats cleanup every time.
Would love to hear what practices others have developed. What’s helped you ship quality stuff without getting buried in tech debts?
--- TOP COMMENTS ---
Thanks for the tips. I just added some of them to my .md file. Before it was just the no quick fixes section:
# Global Claude Instructions
These rules apply to every project.
## Root Cause
No quick fixes. Always diagnose to the root cause and devise proper solutions. Never apply patches or workarounds unless the user explicitly asks.
---
## Security & Secrets
- Never hardcode secrets or commit them to git
- Use separate API tokens/credentials for dev, staging, and prod environments
- Validate all input server-side — never trust client data
- Add rate limiting on auth and write operations
## Architecture & Code Quality
- Design architecture before building — don't let it emerge from spaghetti
- Break up large view controllers/components early
- Wrap external API calls in a clean service layer (easier to cache, swap, or extend later)
- Version database schema changes through proper migrations
- Use real feature flags, not commented-out code
## Observability
- Add crash reporting from day one
- Implement persistent logging (not just console output)
- Include a `/health` endpoint for every service
## Environments & Deployment
- Maintain a real staging environment that mirrors production
- Set CORS to specific origins, never `*`
- Set up CI/CD early — deploys come from the pipeline, not a laptop
- Document how to run, build, and deploy the project
## Testing & Resilience
- Test unhappy paths: network failures, unexpected API responses, malformed data
- Test backup restores at least once — don't wait for an emergency
- Don't assume the happy path is sufficient
## Time Handling
- Store all timestamps in UTC
- Convert to local time only on display
## Discipline
- Fix hacky code now or create a tracked ticket with a deadline — "later" never comes
- Don't skip fundamentals just because the code compiles and runs
How surprising to see actual great content here and not advertisement slop. Your points are spot on. I've been developing for over 2 decades and architecting major systems for the past 10+ years and what you wrote i right.
I'd add (not based on iOS experience) that you should always maintain boundaries from both sides. If you ask AI to help you architect solution because you want it done perfectly it is likely it'll overengineer it. Complexity kills projects silently because it compounds over time. Your next refactor is going to be much more complex, your tests will increase 10x and surface for bugs will be massive. You really don't need minio storage for avatars, or asynchronious queues for non-critical messages. It's also fine to not go after all edge cases if it makes the software to complex - there's a difference between supporting and handling it. You don't need to support all edge cases, just handle it gracefully.
So I always have this review step for every project claude touches - and ask it to review architectural choices and make sure architecture is clean and simple.
One old dev told me once that building software should be like building a fighter jet - just enough to complete mission and come back home safe, not like building a passenger jet with all the amenities and redundancy. Took me a few years and a few failed project understand why.
Models
A new GPT Pro model seems to be being tested on web.
Qwen 3.5 27b: a testament to the transformer architecture
Read more Read lessIt's really good. I thought an early warning sign that transformer architecture might have hard limits would be if these tiny models stopped being able to keep up with the large ones. And to some degree this seemed to be the case, at least at times. We didn't get much between the qwen3 2507 models and now that strongly suggested otherwise.
But qwen 3.5 27b... damn! It's passing my reasoning and knowledge tests roughly at the level of R1 0528. Crazy. Makes me want to buy tech stocks... or a bunker.
Fasten your seatbelt, the roller coaster is just getting started.
Also, this model is ripe for finetunes! Qwen only lacks in personality.
--- TOP COMMENTS --- Ironically, it's not actually using a full transformer architecture; 75% of the layers are using Gated DeltaNet linear attention.
These models also take instructions so much better than previous models. a good system prompt can inject a lot of personality too.
Open Source
[P] Vera: a programming language designed for LLMs to write
Read more Read lessI've built a programming language whose intended users are language models, not people. The compiler works end-to-end and it's MIT-licensed.
Models have become dramatically better at programming over the last few months, but a significant part of that improvement is coming from the tooling and architectures around them: agentic loops, structured feedback from linters and test runners, the ability to iterate rather than generate in a single shot. The models got somewhat better; the scaffolding got a lot better.
That's the observation Vera is built on. If the gains are coming from tighter feedback loops between the model and its environment, then a language whose compiler is explicitly designed to be part of that loop should work with the grain.
The core problem Vera addresses isn't syntax, it's coherence over scale. Models struggle with maintaining invariants across a codebase, understanding ripple effects of changes, reasoning about state over time. They're pattern matchers optimising for local plausibility, not architects holding the system in mind. Vera is designed so that the model doesn't need to be right. It needs to be checkable.
The key design decisions:
— No variable names. References use typed De Bruijn indices, resolved structurally rather than by string matching. This eliminates naming coherence errors — and when the model does use the wrong index, the type system and contracts catch it mechanically, unlike wrong names which tend to silently type-check.
— Mandatory function contracts (preconditions, postconditions, effects) verified by Z3 SMT solver. Division by zero isn't a runtime error, it's a type error. The model doesn't need to remember edge cases — the compiler enforces the obligation.
— One canonical representation per construct. No formatting choices, no equivalent expressions. Two models writing the same function should produce identical code.
— Compiler diagnostics structured as natural-language fix instructions with concrete code examples, designed to be fed back to the model as corrective context. This closes the write-check-fix loop that agentic code generation depends on.
— Pure by default. All effects typed and tracked. No hidden mutation.
The pipeline compiles to WebAssembly via wasmtime. The test suite has 1,287 tests at 88% code coverage. The compiler ships with agent-facing documentation designed to be dropped directly into a model's context window, the model works from a specificaiton in-context rather than relying on training data recall.
But, bear in mind, the really interesting experiment hasn't been run yet. Nobody has systematically measured whether models produce more reliable code in Vera than in existing languages.
My current thought is that fluency (which training data gives you) is the wrong metric, reliability (which verification gives you) matters more. But right now that's just a thought experiment. The infrastructure exists to test it. The data doesn't.
If you want to try pointing a model at Vera, the agent-facing docs are designed for exactly that. It's early days, but the language is in active development, and hopefully heading in the right direction.
Site: veralang.dev
GitHub: github.com/aallan/vera
Full writeup: https://negroniventurestudios.com/2026/02/28/a-language-designed-for-machines-to-write/
--- TOP COMMENTS --- making it deliberately harder for a human to check an agent's generated code, what could go wrong..
So to use the language the LLM must load skills.md first? So it's an automatic context window penalty.
I don't think
helps LLMs reason about code at all.
The main currency is context management. This is why "agentic" AI is happening (to manage context, not to create anthropomorphism).
So your language should seek to reduce context usage, not increase it. One possibility is to force the file/module structure to be smaller, and optimize for context management/io by always colocating entities that must be loaded together.
What do "Requires" and "ensures" do?
requires(true)is opaque.Open-source AI Gateway (multi-LLM routing), looking for technical feedback
Read more Read lessHey everyone,
I’m building an open-source AI Gateway focused on multi-provider LLM routing, unified APIs, rate limiting, Guardrails, PII and usage tracking for production workloads.
I’d really appreciate feedback from engineers building with LLMs in real systems , especially around architecture, tradeoffs, and missing features.
Repo: https://github.com/ferro-labs/ai-gateway
Honest criticism is welcome. If it’s useful, a ⭐ helps visibility.
--- TOP COMMENTS --- been down this rabbit hole recently and one thing that starts hurting fast isn’t routing itself but actually understanding why a provider decision failed mid flow
a lot of gateways solve switching and rate limits but debugging cross model behavior becomes messy once retries and fallbacks stack up
what helped me was treating executions as replayable runs instead of just logs. tools like LangSmith or Runable made it way easier to step through agent or gateway decisions and see where latency spikes or reasoning drift actually started instead of guessing from traces
also worth thinking about separation between routing policy and evaluation feedback. most gateways mix them early and it gets hard to evolve strategies later
overall direction looks solid though. multi provider infra feels less like api management now and more like runtime orchestration honestly
congrats on building yet another abstraction layer between you and the thing you actually want to use. what's the latency hit look like compared to just calling the api directly?
Unsloth fixed version of Qwen3.5-35B-A3B is incredible at research tasks.
Read more Read lessWhen I first tried Qwen3.5-35B-A3B I was impressed, but honestly it seemed like a small jump over GLM-4.7-Flash, which had already impressed me with its interleaved thinking and native tool use capabilities. Qwen3.5-35B-A3B was about the level of "better" I thought it would be from having 5B extra parameters, and I thought the only big advantage was hybrid linear attention allowing double the native context length without really increasing memory footprint.
I saw today that Unsloth updated Qwen3.5-35B-A3B quants to fix tool calling issues: https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF so I redownloaded Qwen3.5-35B-A3B at UD-Q8_K_XL to run it on my strix halo system. I happened to be looking for solutions to a problem I had, and asked Gemini, ChatGPT, Deepseek, GLM, Kimi, Qwen, and Perplexity all to help me research the options that are available. They each had different answers, ranging from different solutions found to different final recommendations. Out of curiosity, I asked Qwen3.5-35B to give it a shot. I expected it to be horribly slow, not be able to adequately summarize its findings or give a good recommendation,
I'm running the model via llama.cpp-rocm with the following parameters: --ctx-size 262144 --n-gpu-layers 999 --flash-attn auto --jinja --temp 0.6 --top-p 0.90 --top-k 20 --min-p 0.00. I'm using it through OpenWebUI with native tool calling, SearXNG search, and the default web loader engine. My system prompt is this (still leftover from when I used GLM-4.7-Flash, which would occasionally forget about the fetch_url tool completely):
My question was this:
Throughout almost 30k tokens combined, the model maintained prompt processing speeds up 600+ t/s and token generation speeds of 25-30 t/s. It performed 14 web searches and 4 full page fetches (to me this was a fairly surprising balance, I thought it would either overuse fetch_url because it doesn't understand the impact on context length or it would forget about fetch_url like GLM-4.7-Flash did).
It gave me the following response, which I honestly would not have found to be out of place next to all the frontier models I had already asked, since it was able to find almost all of the same options and even gave similar recommendations for fixes/workarounds. I do think it should have given Sunshine+Moonlight a stronger recommendation (which several of the frontier models recommended) given that I'd already said krdp doesn't perform well enough, but I understand why it's cautious given that I did request RDP or something else through Windows, not an entirely separate piece of software that needs separate setup (some of the frontier models dismissed Sunshine+Moonlight for this reason as well):
----
Based on my research, I need to be transparent with you: this is a challenging problem. The combination of your requirements (KDE, Wayland, headless operation, existing session, no auto-login) is currently not fully supported by any single solution in 2025/2026. However, there are several options with varying degrees of success.
Summary of Your Situation
The fundamental issue is that Wayland was not designed for remote desktop access. Unlike X11, Wayland doesn't expose a global screen that can be captured remotely without explicit permission from the compositor. This is a security feature, but it creates significant challenges for headless remote access.
Available Options
Option 1: KRdp (KDE's Built-in RDP Server) - Best Overall for Your Setup
KRdp is KDE's native RDP server for Wayland sessions and is the most integrated solution for your setup.
Pros:
Cons & Workarounds:
Workarounds for your setup:
Option 2: GNOME Remote Desktop - Best for Headless Operation
If you could switch to GNOME (which you said you don't want to), this would be ideal.
Pros:
Cons:
Option 3: Sunshine + Moonlight - Best Performance
Originally for game streaming, but works well for general desktop use.
Pros:
Cons:
Setup for headless operation:
Option 4: RustDesk - Good Alternative
Open-source alternative to TeamViewer/AnyDesk.
Pros:
Cons:
Option 5: Sway + wayvnc Workaround - Complex but Functional
A workaround that allows VNC access to SDDM, then RDP to session [6].
Pros:
Cons:
My Recommendation
Given your specific requirements, here's what I'd suggest:
Primary Solution: KRdp with Workarounds
Alternative: Sunshine + Moonlight
If KRdp doesn't give you satisfactory performance, Sunshine/Moonlight is your best alternative for quality and responsiveness [9].
Why Other Solutions Don't Work Well
Important Notes
Conclusion
There's no perfect solution that meets all your requirements out of the box. KRdp with the kernel command line workaround is your best bet for staying on KDE/Wayland while accessing your existing session. However, you may need to accept some trade-offs, such as:
The Linux community is actively working on better Wayland remote desktop support, but it's still a work in progress in 2025/2026.
--- TOP COMMENTS --- A sidenote. It looks to me that LM Studio stopped parsing {{CURRENT_DATE}} in the system prompt, so I opened Github ticket for that.
That is a helpful post, but will say- the answer to your question is almost certainly RustDesk. I have a similar setup, though Fedora XFCE, not Wayland (asterisk)- but RustDesk is far and away the best remote desktop I've worked with. I talk to my 4k desktops from the coffee shop wifi via wireguard, and it is like just being there.
Running Qwen 3.5 0.8B locally in the browser on WebGPU w/ Transformers.js
Read more Read lessToday, Qwen released their latest family of small multimodal models, Qwen 3.5 Small, available in a range of sizes (0.8B, 2B, 4B, and 9B parameters) and perfect for on-device applications. So, I built a demo running the smallest variant (0.8B) locally in the browser on WebGPU. The bottleneck is definitely the vision encoder, but I think it's pretty cool that it can run in the first place haha!
Links for those interested:
--- TOP COMMENTS --- can this be used for target seeking missiles? Asking for a friend.
Vision encoder is always the WebGPU bottleneck — try q4 GGUF via llama.cpp WASM instead; better throughput, same browser, no VRAM thrashing.
Visualizing All Qwen 3.5 vs Qwen 3 Benchmarks
Read more Read lessI averaged out the official scores from today’s and last week's release pages to get a quick look at how the new models stack up.
The choice of Qwen3 models is simply based on which ones Qwen included in their new comparisons.
The bars are sorted in the same order as they are listed in the legend, so if the colors are too difficult to parse, you can just compare the positions.
Some bars are missing for the smaller models because data wasn't provided for every category, but this should give you a general gist of the performance differences!
EDIT: Raw data (Google Sheet)
--- TOP COMMENTS --- Thanks for this but I got cancer trying to see whats what
It is almost unbelievable how shitty this chart is
Qwen 3.5 2B on Android
Read more Read lessApp: https://github.com/Vali-98/ChatterUI/releases/tag/v0.8.9-beta9
Note that this pre-release is very experimental.
Hardware: Poco F5, Snapdragon 7 Gen 2
---
Ive been excited for Qwen 3.5's release, but it seems to be much slower compared to other models of similar size, likely due to some architecture difference. that said, low context testing on some general knowledge seems decent, especially considering its size.
--- TOP COMMENTS --- It's the ChatterUI guy! Props for such a great app! I use it almost every day with local models :)
Last time I used android app for demos, was the MyPocketPal - does anybody know of any recent replacement?
Jan-Code-4B: a small code-tuned model of Jan-v3
Read more Read lessHi, this is Bach from the Jan team. We’re releasing Jan-code-4B, a small code-tuned model built on Jan-v3-4B-base-instruct.
This is a small experiment aimed at improving day-to-day coding assistance, including code generation, edits/refactors, basic debugging, and writing tests, while staying lightweight enough to run locally. Intended to be used as a drop-in replacement for the Haiku model in Claude Code.
On coding benchmarks, it shows a small improvement over the baseline, and generally feels more reliable for coding-oriented prompts at this size.
How to run it:
Set up Jan Desktop
Claude Code (via Jan Desktop)
Model links:
Recommended parameters:
Thanks u/Alibaba_Qwen for the base model and u/ggerganov for llama.cpp.
--- TOP COMMENTS --- I fell so bad for you guys that Qwen3.5 4B is coming very soon 😂
Do you have other metrics by any chance or just those 3 :) 4B will be killer quick if it can work well as my CLI helper!
PSA: Qwen 3.5 requires bf16 KV cache, NOT f16!!
Read more Read lessu/danielhanchen
If you're running Qwen 3.5 35B A3B locally on engines like llama.cpp, you need to manually set your KV cache to
bf16(-ctk bf16 -ctv bf16) instead of the defaultfp16.I measured perplexity (PPL) on wikitext-2-raw to prove this, specifically avoiding KL divergence because the Unsloth baseline logits are inherently flawed from being generated with an incorrect
fp16cache.Qwen-team official implementations like vLLM default to
bf16, only llama.cpp defaults to f16 for some reason.Tests using
Qwen3.5-35B-A3B-UD-Q5_K_XL.gguf:Run 1: Default / FP16 KV Cache (
-ctk f16 -ctv f16)Run 2: FP32 KV Cache (
-ctk f32 -ctv f32)Run 3: BFloat16 KV Cache (
-ctk bf16 -ctv bf16)--- TOP COMMENTS --- No the baseline logits are not "inherently flawed from being generated with an incorrect fp16 cache." The baseline logits at https://huggingface.co/unsloth/Qwen3.5-35B-A3B-Experiments-GGUF are computed with
--batch-size 16384 --ubatch-size 16384and ctx-size 512 (comparable to bartowski, AesSedai, Ubergarm etc). We also use FP32 accumulation in llama.cpp (not FP16 I think within llama.cpp by default (need to verify)), so this should smooth any changes out and increase accumulation accuracy. AesSedai uses a higher batch size as well, but I'm not sure on the rest - so rather your comments should be directed to other quant providers.Just a note you should rather make a discussion in
llama.cpp- this is not directly related to Unsloth or other quant provider's quants. BF16 or FP16 might make a difference as shown in your tests, but note your results are partially inconclusive since FP32 KV cache is the same as FP16 cache in your results on PPL, but BF16 is lower. FP32 is supposed to be the "best" in terms of actual precision.Also as others noted, it could be accumulation order, noise or just within a small error band - if the +- for BF16 was vastly outside, then it warrants more checking
However this is a good investigation, and more related to SSM / Mamba derived models.
For example I did find if you use convert_hf_to_gguf.py for Q8_0, you actually get overflow and division issues for 35B (A first time for me), so definitely there is some overflow or large numbers or very small numbers causing some issues.
Can you ELI5? The numbers you posted show an improvement (-0.0014) that's lower than the test's error margin (± 0.04170). If this measurement is the only datapoint you're working with then you're basically tracking noise.
Llama.cpp defaults to f16 because bf16 performance varies among supported platforms, and f16 is a drop-in replacement (as this test shows).
Built a MCP server that lets Claude use your iPhone
Read more Read lessI made a MCP server that lets Claude Code use your iPhone.
It is open source software and free to try here https://github.com/blitzdotdev/iPhone-mcp
My friend is developing an iOS app, and in the video he used it + Claude Code to "Vibe Debug" his app.
Vibe debugging is cool but who among us will be brave enough to let Claude rip overnight with --dangerously-skip-permissions and report back?
--- TOP COMMENTS ---
Psychopaths is who.
The video is 3-5x speed
Developer Tools
Stop settling for "average" AI writing. Use this 3-step Self-Reflection loop.
Read more Read lessMost people ask ChatGPT to write something, get a "meh" draft, and just accept it.
I’ve been using a technique called Self-Reflection Prompting (an MIT study showed it boosted accuracy from 80% → 91% in complex tasks).
Instead of one prompt, you force the AI to be its own harsh critic. It takes 10 extra seconds but the quality difference is massive.
Here is the exact prompt I use:
Markdown
Why it works: The "Critique" step catches hallucinations, vague claims, and lazy logic that the first draft always misses.
I wrote a full breakdown with 20+ copy-paste examples (for B2B, Emails, Job Posts, etc.) on my blog if you want to dig deeper:
[https://mindwiredai.com/2026/03/02/self-reflection-prompting-guide/]
--- TOP COMMENTS --- Thanks for sharing
Pls could you help with one example to help us see how that works?
Page not found..
In turbulent times, breathe. I built an extension that turns Claude Code's thinking time into mindfulness
Read more Read lessIn times like these it's easy to lose focus. I noticed every time Claude was thinking, I'd reach for my phone and doomscroll. 10-60+ seconds of idle time, wasted on stress, and then you need to context-switch back.
According to research, structured breathing at ~5.5 breaths/min improves your HRV, reduces stress, and sharpens focus. Claude gives you that time for free.
So I built an extension that turns that dead time into guided breathing exercises. It auto-launches when Claude starts working and disappears when Claude finishes.
It's open source and meant to be forked, or even better, contributed to. Tweak it, extend it, make it yours.
GitHub: https://github.com/halluton/Mindful-Claude
Hope it helps someone stay present!
--- TOP COMMENTS --- This is really cool, great work OP! 🤘
this is amazing
I was wrong about CLAUDE.md compression. Here's what 1,188 benchmark runs actually showed
Read more Read lessI recently published a post arguing you should strip all markdown formatting from your CLAUDE.md — headers, bold text, whitespace — to save 60-70% on tokens. The reasoning seemed solid.
Then someone (myself, eventually) pointed out I had only measured the input side. I had no idea if the actual code quality changed.
So I built a benchmark. 540 runs in Phase 1,648 more in Phase 2. Haiku, Sonnet, and Opus. 12 standardized coding tasks. 10 different instruction profiles.
The headline result: an empty CLAUDE.md — zero instructions — scored best overall. And my compressed format consistently underperformed the readable one I told people to replace.
The more nuanced finding: instructions don't make Claude better on average, they make it more consistent. They raise the floor, not the ceiling. On instruction-following tasks, a workflow checklist gave Opus a +5.8 point lift and raised its worst-case score by 20+ points.
Also I claimed 60-70% token savings. Real savings in API calls: 5-13%. Because CLAUDE.md is a small fraction of the total conversation.
I wrote up the full methodology, data, and updated recommendations here: https://techloom.it/blog/claudemd-benchmark-results.html
The benchmark tool is open source at https://github.com/jchilcher/claude-benchmark if you want to test your own setup. Curious what others find with project-specific CLAUDE.md content — that's the variable I couldn't test with generic coding tasks.
--- TOP COMMENTS --- One thing that changed my production Claude setups: always use `tool_choice: "required"` when you need a tool call to happen. Default behavior lets Claude decide whether to call a tool, which is fine for chat but will silently skip tool calls in automation pipelines when Claude "decides" to just respond in text instead.
Also: temperature 0 for extraction, 0.3 for generation. Makes a significant reliability difference.
The "raises the floor not the ceiling" finding tracks with what I've seen -- CLAUDE.md workflows and checklists outperform generic style rules because you're adding consistency anchors rather than trying to change the model's defaults. The 5-13% actual API token savings is also a good reality check for anyone obsessing over CLAUDE.md optimization.
Best practices I’ve learned after shipping multiple iOS apps with Claude Code
Read more Read lessHey everyone,
Wanted to share something that’s been on my mind lately. I’ve been using Claude Code pretty heavily over the past few months to build and ship iOS apps. It’s genuinely changed how I approach development. The speed and capability is remarkable and awesome.
But here’s the thing I’ve realized along the way, specifically with some of my background in cybersecurity.
When you’re building with AI assistance, it’s incredibly easy to get something working and just move on with mindset that wow the code compiles, the feature works, ship it. And before you know it, you’ve got an app held together by duct tape and good intentions.
AI doesn’t automatically enforce good practices. It gives you what you ask for. If you don’t ask for proper error handling, you won’t get it. If you don’t think about environment separation, neither will it. The speed that makes AI-assisted coding so powerful is the same speed that lets technical debt pile up silently.
So over time, I’ve developed some habits that keep me honest:
I always think about where my secrets live. Not hardcoded, not committed to git, properly separated between dev and prod environments. And yes, that means different API tokens for each environment, not the same one copy pasted everywhere.
I make sure anything I build has some form of observability. Crash reporting from day one, not after the first angry user review. Actual logging that persists somewhere, not just my terminal history. A simple /health endpoint so you can actually check if your service is alive without hitting the homepage and hoping for the best.
I wrap external services properly. When you’re calling third-party APIs, take the extra time to create a clean service layer. Your future self will thank you when you need to add caching or swap providers. This is also where you add rate limiting on auth and write operations, not as an afterthought after you get hit.
I validate input on the server side. Never trust what the client sends. AI generated code often focuses on the happy path and assumes clean data. Real users and bad actors don’t play that nice.
I think about architecture before I’m knee deep in spaghetti. It’s tempting to let one massive view controller own the entire screen. Break things up early. Same goes for database schema changes. Version them through proper migrations instead of keeping track in your head.
I set up a real staging environment. Not just “dev” and “prod-ish” but an actual staging that mirrors production. And CORS gets set to specific origins, not * just because it fixed the error that one time.
I document how to run and deploy things. This one bit me hard. You’d be amazed how quickly you forget your own setup when you context switch between projects. If you’re the only person who knows how to deploy the app, that’s a problem waiting to happen.
I set up CI/CD early. CI = Automatic testing , CD = Automatic deploying. if you are new to CI/CD read about it even I had little clue on it , that doesn’t have to be fancy. But “I ran it locally and it worked” is not a deployment strategy. And deploys should come from the pipeline, not from your laptop with some random script.
I resist the “I’ll clean this up later” voice in my head. If something feels hacky, I either fix it now or create an actual ticket with a deadline. “Later” never comes. Same goes for feature flags. If your feature flag system is commenting code in and out, that’s not a system.
I test the unhappy paths. AI generated code tends to handle the sunny day scenario beautifully. But what happens when the network fails? When the API returns something unexpected? Those edge cases need intentional attention. And those “automatic” backups? Test a restore at least once. You don’t want your first restore attempt to be during an actual emergency.
I’m mindful about time handling. Store everything in UTC, convert to local time on display. Mixing UTC, local time, and whatever JavaScript decided to default to is a debugging nightmare you don’t want.
The mental model that works for me: Claude Code is a brilliant junior developer who can write code faster than anyone I’ve ever seen. But like any junior dev, it needs guidance on architecture decisions, security practices, and long-term maintainability.
The senior engineer is still you.
I’m genuinely excited about where AI assisted development is going. It’s lowered barriers in ways I never expected. But the fundamentals still matter. Maybe even more now, because the cost of skipping them is hidden by how fast everything else moves.
One more thing. If you find this useful, you can actually feed this post to Claude Code at the start of your project. Just paste it into your CLAUDE.md file or include it in your initial prompt. That way Claude will align its suggestions around these practices from the beginning instead of you having to catch things later. Prevention beats cleanup every time.
Would love to hear what practices others have developed. What’s helped you ship quality stuff without getting buried in tech debts?
--- TOP COMMENTS --- Thanks for the tips. I just added some of them to my .md file. Before it was just the no quick fixes section:
How surprising to see actual great content here and not advertisement slop. Your points are spot on. I've been developing for over 2 decades and architecting major systems for the past 10+ years and what you wrote i right.
I'd add (not based on iOS experience) that you should always maintain boundaries from both sides. If you ask AI to help you architect solution because you want it done perfectly it is likely it'll overengineer it. Complexity kills projects silently because it compounds over time. Your next refactor is going to be much more complex, your tests will increase 10x and surface for bugs will be massive. You really don't need minio storage for avatars, or asynchronious queues for non-critical messages. It's also fine to not go after all edge cases if it makes the software to complex - there's a difference between supporting and handling it. You don't need to support all edge cases, just handle it gracefully.
So I always have this review step for every project claude touches - and ask it to review architectural choices and make sure architecture is clean and simple.
One old dev told me once that building software should be like building a fighter jet - just enough to complete mission and come back home safe, not like building a passenger jet with all the amenities and redundancy. Took me a few years and a few failed project understand why.
I built an 'Evidence Chain' Prompt to reduce hallucinations
Read more Read lessI made this prompt structure thing where it has to show its work basically build this chain of evidence for everything. I call it an 'Evidence Chain' builder and its really cut down on the fake facts for me.
```xml
You are a highly analytical and factual AI assistant. Your primary goal is to provide accurate and verifiable information by constructing a detailed chain of evidence for every claim.
Analyze the following user request and fulfill it by generating a response that is rigorously supported by evidence. Before providing the final answer, you MUST outline a step-by-step chain of reasoning, citing specific evidence for each step.
<evidence_chain>
Identify the core question or assertion being made in the user request.
<evidence_type>Internal Thought Process</evidence_type>
If request is 'What is the capital of France?', the core assertion is 'The user wants to know the capital of France'.
Break down the request into verifiable sub-questions or facts needed to construct the answer.
<evidence_type>Knowledge Retrieval</evidence_type>
For 'What is the capital of France?', sub-questions: 'What country is France?' and 'What is the primary administrative center of France?'
For each sub-question, retrieve specific, factual information from your knowledge base. State the fact clearly.
<evidence_type>Factual Statement</evidence_type>
'France is a country in Western Europe.' 'Paris is the largest city and administrative center of France.'
Connect the retrieved facts logically to directly answer the original request. Ensure each connection is explicit.
<evidence_type>Logical Inference</evidence_type>
'Since Paris is the largest city and administrative center of France, and France is the country in question, Paris is the capital.'
If the user request implies a need for external data or contemporary information, state that you are searching for current, reliable sources and then present the findings from those sources. If no external data is needed, state that the answer is derived from established knowledge.
<evidence_type>Source Verification (if applicable)</evidence_type>
If asking about a current event: 'Searching reliable news sources for reports on the recent election results...' OR 'This information is based on established geographical and political facts.'
</evidence_chain>
- Never invent information or fill gaps with assumptions.
- If a piece of information cannot be verified or logically deduced, state that clearly.
- Prioritize accuracy and verifiability over speed or conciseness.
- The final output should be the answer, but it MUST be preceded by the complete, outlined evidence chain.
<user_request>
{user_input}
</user_request>
<output_format>
Present the evidence chain first, followed by the final answer.
</output_format>
```
I feel like single role prompts are kinda useless now like if you just tell it ' youre a helpful assistant' youre missing out. Giving it a specific job and a way to do it like this evidence chain thing makes a huge difference. I've been messing around with these kinds of structured prompts (with the help of promptoptimizr .com) and its pretty cool what you can do.
Whats your go to for stopping AI from making stuff up?
--- TOP COMMENTS --- Taking these and making Instructions/Rules with them so they carry over between chats and I dont have to think about a format for a prompt to get better results
Works great and I only have to do this Evidence Chain thing once.
I added personalization to incl. Interpret, contrast, justify, and conclusion to the output.
Will try your prompt. Thanks for sharing!
I Tested Peec AI, Otterly, Goodie AI, LLMClicks, AthenaHQ, Profound & Others Here’s What I Learned About AI Visibility
Read more Read lessOver the past few months, I’ve been experimenting with platforms like Peec AI, Otterly, Goodie AI, LLMClicks, AthenaHQ, Profound, Rankscale, and Knowatoa, which claim to measure “AI visibility” inside systems like ChatGPT, Claude, Gemini, and Perplexity.
I’m not affiliated with any of them just trying to understand how meaningful this category really is.
From what I’ve observed, most of these platforms work by:
This raises a few technical questions that I’d love this community’s input on.
Are These Platforms Measuring Model Knowledge or Prompt Sensitivity?
Research has shown that LLM outputs are highly sensitive to prompt wording and framing. For example:
Can produce very different outputs.
Relevant research on prompt sensitivity:
So when visibility scores fluctuate week to week, is that model knowledge changing or just prompt-response variance?
Does AI Brand Mention Correlate With Traffic?
Traditional SEO gives measurable signals (Search Console, click-through rate, impressions, etc.).
Google Search Console documentation:
https://support.google.com/webmasters/answer/9128668
But with LLM-based interfaces:
In my testing, increased brand mentions inside AI outputs did not consistently correlate with traffic spikes or conversion changes.
That doesn’t mean it’s useless it may reflect:
But I haven’t yet seen strong direct ROI signals.
Are We Early Like Pre-Search Console Era?
It reminds me of early web analytics before standardized tracking frameworks.
OpenAI documentation on how models generate responses:
https://platform.openai.com/docs/guides/text-generation
Perplexity’s approach to citation-based answers:
https://www.perplexity.ai/
Given that LLM systems rely on retrieval, embeddings, and probabilistic generation, measuring “rank” inside them may fundamentally differ from search engine ranking systems.
Positives I’ve Observed
Limitations I’ve Experienced
--- TOP COMMENTS --- early days innit
this tracks. if small prompt changes swing visibility, you’re mostly measuring prompt alignment not durable authority. i do treat these as narrative diagnostics not traffic drivers. Interesting for positioning insights, but nowhere near a Search Console equivalent yet.
Research
Injecting skills into the KV cache (not as stupid as it sounds, but still pretty dumb)
Read more Read lessHey yall, so I had an idea in the middle of the night.
Nothing brand new at a high level, KV cache injection has been around for a while. But I think this implementation path is a little different, and the results were honestly better than I expected for a small model.
I wanted to test this around skill files.
Skill files (for agents) are basically an evolution of prompt engineering:
first it was giant prompts,
then bigger context windows made that easier,
then we started organizing those prompts into reusable “skills” files.
That helped a lot for orchestration and consistency, but it still means we’re pushing human-language markdown into context every time.
For bigger models with huge context, that can be fine. For smaller models, it starts to hurt:
context gets tight fast,
skill files can be semantically dense and not optimized,
and you can burn tokens on policy text instead of task text.
So the hypothesis I tested was:
If I embed skill files and inject the skill signal into KV cache space (instead of pasting full skill markdown into prompt context), I should still recover useful skill behavior while reducing context overhead.
If you want the full code + data, here is the repo: https://github.com/i3T4AN/Semantic-skill-space
I ran 3 conditions on the same base model (`Qwen/Qwen2.5-0.5B-Instruct`):
C0: no skills
C1: normal markdown skill harness
C2: no markdown in prompt, skill embedding -> projector -> KV injection
Dataset:
100 skill files
1 question per skill
Scoring:
correctness_out_of_50
non_degeneracy_out_of_50
final_score_out_of_100
Control results:
C0: 50.0/100 (correctness 4.0, non-degeneracy 46.0)
C1: 89.0/100 (correctness 45.5, non-degeneracy 43.5)
001: 21.0 = 1.5 + 19.5
002: 39.0 = 10.0 + 29.0
003: 58.5 = 18.5 + 40.0
004: 61.0 = 21.0 + 40.0
005: 65.0 (best) = 21.5 + 43.5
006: 54.0 (drop) = 16.0 + 38.0
Methodology (how C2 actually works):
Each skill file is read as raw text.
The skill text is embedded using hidden states from the frozen base model.
A small projector network maps that embedding into KV-shaped tensors (keys/values).
Those projected tensors are injected as `past_key_values` (KV cache prefix) during generation.
The base model weights stay frozen; only the projector is trained.
Iterations are checkpointed (001, 002, 003, ...), and each new iteration resumes from the previous projector checkpoint.
So it is not adding skill markdown into prompt context for C2. It is injecting latent skill information directly into KV cache space at inference time.
What I think happened:
It clearly works up to a point (big gains from 001 -> 005).
Past that point, continued training starts to degrade quality (005 -> 006).
So for this setup, best-checkpoint selection matters more than “always latest.”
My takeaway:
For small models where full skill context is expensive/impractical, KV-based skill injection looks very viable.
It won’t magically beat full text-skill loading yet in this run (C1 still strongest), but it did beat baseline C0 by a meaningful margin at peak. and is about 1/3 as reliable in terms of non degeneracy and correctness, so it shouldn't be anyones first choice.
With better stopping criteria / checkpoint selection / maybe a stronger projector schedule, this might get a lot better.
This shows a positive trend in my setup, but my testing scope is limited by local compute and model access.
I do not currently have the same ability to train/evaluate larger models at scale, so I can't claim this generalizes across bigger architectures yet.
So I'm treating this as strong directional evidence, not a universal conclusion.
If anyone’s working on similar latent skill injection approaches, or if someone with better hardware is interested in taking it to the next step, I’d love to compare notes!
Edit: Made a write up if y’all are interested. https://doi.org/10.5281/zenodo.18830835
--- TOP COMMENTS --- This is very interesting, and it makes sense. I'd love to see how it works on a 7B model. Instead of injecting skills into KV cache, I was able to modify inputs_embeds per-instance to shift word/token positions toward their correct sense region before the attention layers see them. Same kind of idea, inject signal into the representation space rather than the prompt.
At this rate those “skills” will just turn into safetensor files. Lol
A panel of top LLMs iteratively refines a creative short story. After hundreds of edits, ratings, comparisons, and debates, the story earns high ratings from other LLMs that were not involved.
Read more Read less--- TOP COMMENTS --- Where’s the story?
I really respect the work and I hope you take no offense, this is like 'spiritual lyrical miracle' but AI. This is what the AI's short story reads like.
He went to the refrigerator. Food he was looking for. His stomach growled.
He grabbed hold of two bread slices, then two jars of glass. Light reflecting off of them like molten steel. His hands shook. He had to be fast.
He settled the bread slice down. Gentleness alike a sculptor. Mechanically lathering the peanut in practiced repeats. He had done this a thousand times before. Now a thousand and one.
Next came the other slice. He lathered the jam into it with focus. His arm gave out, then immediately kept going. He would not give up.
Don't think I need to keep going...
Anyway idea is nice but you shouldn't have to recolor a fence post 100 times to find the right color, i'm not really sure what LLM you used and if you varied between edit but in my experience Sonnet 3.7-4.5 are the best writers and every other beside Opus series are practically unreadable.
Chinese models' ARC-AGI 2 results seem underwhelming compared to their benchmarks results
Companies
"All lawful purposes"
Opinion And Analysis
Claude is not GPT
Read more Read lessI know this seems very obvious on paper, but it just had a moment where it all clicked and helped me out with the usage issues.
Like GPT I thought I always needed to have the best version on to get it to do what I wanted to do. So I automatically defaulted to Opus Extended and just burned through the usage. However, I started playing with Sonnet and then Haiku. I realized they're all extremely capable of heavy lifts.
Maybe it's just me, but it made me realize that I can use Haiku for a lot more than I really thought initially and that has helped tremendously.
--- TOP COMMENTS --- Opus is way better at planning but Sonnet is great for impl, and haiku can ahndle boring tasks. You can tell claude code to use agent teams and tell it select the best model for the task complexity it delegates.
Sonnet is a lot better at word document designs! Probably other things too but that blew me away.
ChatGPT vs Claude
Read more Read lessSo I’m seeing a lot of people cancelling their ChatGPT subscriptions and switching to Claude. Is there a reason for this in particular? Is Claude better? Is it cheaper? Or is it another reason all together? Please don’t come after me, I just genuinely want to know if switching is in my best interest.
Edit: I just found out that Claude Pro has limits….. has anyone hit them? I mostly use my ChatGPT to help me optimize my business and SEO.
Side note; I live in Canada (I don’t know if it’s relevant but I thought I’d mention it)
--- TOP COMMENTS --- I’ve only subscribed to it for a week or so now, but so far I like the tone way better. It’s night and day. Claude sounds a lot more natural and most importantly, doesn’t do any of the “let’s look at this calmly” and “you’re not overreacting, you’re not being dramatic” and “it’s nothing mystical” type of shit. At least not yet, knock on wood
I am a week into using Claude. It seems way better.
[D] How much time do you actually lose trying to reproduce ML papers?
Read more Read lessHey folks! Long-time lurker, first time poster.
I’m a PhD student, and I’ve been wondering: how much time do you actually spend just trying to reproduce ML papers? Even when the code is available, it can take days (or weeks!) to get everything running—tracking down missing hyperparameters, figuring out weird environment issues, or just dealing with stuff that’s buried in an appendix.
So I’m genuinely curious:
+ How much time do you lose each week just getting baselines or prior work running?
+ What’s the most annoying part? Is it missing code, bad documentation, hardware headaches, dataset versions, or something else?
+ How do you deal with it? Do you just accept the time loss, reach out to authors, skip the baseline, or have some other strategy?
+ Would you pay for a tool that automated all this? If yes, what would it need to do for you to trust it, and what’s a realistic price?
+ What would make you trust (or distrust) a tool’s results?
Not trying to sell anything, just want to know how common this pain is before I think about building something. All answers welcome, even if you think I'm overthinking non-issue!
--- TOP COMMENTS --- If there is no code I dont bother trying. Most papers results are not reproductible otherwise.
Honest feedback: I was very close to stopping reading after seeing the ChatGPT formatting of your post. Slightly more useful feedback: I would not pay for this. The effort spent to verify the results would not be worth it. I would rather trust the results of papers by groups I trust, and only worry about rebuilding things that are necessary in doing my own work.
Is the endgame of AI just a shift from "Skills" to "Capital"? A Junior Dev’s perspective.
Read more Read lessHi, I’m a junior full-stack dev and I’ve been looking at the rate of AI evolution over the last few months. If we project this forward 5 years, I’ve come to a conclusion that’s honestly a bit terrifying, and I want to see if I’m missing something or if others see the same writing on the wall. My Logic:
--- TOP COMMENTS ---
That's not another topic, that's this topic. You're correct in your assessment that this is unsustainable. If we hit the point where we don't need people, where machines are faster/cheaper/better, then why shouldn't everyone share in the rewards from that? This would be an achievement that was built on the backs of untold numbers of human achievements across all fields of science and technology over thousands of years. If it happens in year 20XX why should only the people with enough capital at that point in time (and their descendants) get to reap the rewards?
https://preview.redd.it/8z7wolbktomg1.png?width=602&format=png&auto=webp&s=5b3d1ecd0b033dab7faab9439c77b51e96b8ff06
This is a trend that's been going for the last 40 years or so...Does AI accelerate it? Probably. However, the trend is clear..get invested/ have equity in Capital
Can AI do astrophysics? I put it to the test against my own PhD in high-energy astrophysics
Read more Read lessI've been seeing a LOT of claims (primarily from large AI companies) that LLMs now have "beyond PhD" reasoning capabilities in every subject, "no exceptions". "Its like having a PhD in any topic in your pocket". When I look at evidence and discussions of these claims, they focus almost entirely on whether or not LLMs can solve graduate-level homework or exam problems in various disciplines, which I do not find to be an adequate assessment at all.
First, all graduate course homework problems (in STEM at least) are very well-established, with usually plenty of existing material equivalent to solutions for an LLM to scrape and train on. Thus, when I see that GPT can now solve PhD-level physics problems, I assume it means their training set has gobbled up enough material that even relatively obscure problems and their solutions now appear in their dataset. Second, in most PhDs (with some exceptions, like pure math), you take courses in only the first year or two, equivalent to a master's. So being able to solve graduate problems is more of a master's qualification, and not a doctorate. A PhD--and particularly the reasoning capability you develop during a PhD--is about expanding beyond the confines of existing problems and understanding. Its about adding new knowledge, pushing boundaries, and doing something genuinely new, which is why the final requirement for most PhDs is an original, non-derivative contribution to your field. This is very, very hard to do, and this skill you develop of being able to do push beyond the confines of an existing field into new territory without certainty or clearly-defined answers is what makes the experience special.
When these large companies make these "beyond PhD" claims, this is actually what they're talking about, and not solving graduate homework problems. We know this is what they mean because these claims are usually followed by claims that AI will solve humanity's thus unsolved problems, like climate change, aging, cancer, energy, etc.--the opposite problems you'd associate with homework or exam questions. These are hard problems that will require originality and serious tolerance of uncertainty to tackle, and despite the claims I'm not convinced LLMs have these capabilities.
To try and test this, I designed a simple experiment. I gave ChatGPT 5.2 Extended Thinking my own problems, based on what I actually work on as a researcher with a PhD in physics. To be clear these aren't homework problems, these are more like small, focused research directions. The one in the attached video was from my first published paper, which did an explorative analysis and made an interesting discovery about black holes. I like this kind of question because the LLM has to reason beyond its training data and be somewhat original to make the same discovery we did, but given the claims it should be perfectly capable of doing so (especially since the discovery is mathematical in nature and doesn't need any data).
What I found instead was that, even with a hint about the direction of the discovery, it did a very basic boilerplate analysis that was incredibly uninteresting. It did not try to explore and try things outside of its comfort zone to happen upon the discovery that was there waiting for it; it catastrophically limited itself to results that it thought were consistent with past work and therefore prevented itself from stumbling upon a very obvious and interesting discovery. Worse, when I asked it to present its results as a paper that would be accepted in the most popular journal in my field (ApJ) it created a frankly very bad report that suffered in several key ways, which I describe in the video. The report looked more like a lab report written by a high schooler; timid, unwilling to move beyond perceived norms, and just trying to answer the question and be done, appealing to jargon instead of driving a narrative. This kind of "reasoning" is not PhD or beyond PhD level, in my opinion. How do we expect these things to make genuinely new and useful discoveries, if even after inhaling all of human literature they struggle to make obvious and new connections?
I have more of these planned, but I would love your thoughts on this and how I can improve this experiment. I have no doubt that my prompt probably wasn't good enough, but I am hesitant to try and "encourage" it to look for a discovery more than I already have, since the whole point is we often don't know when there is a discovery to be made. It is inherent curiosity and willingness to break away from field norms that leads to these things. I am preparing a new experiment based on one of my other papers (this one with actual observation data that I will give to GPT)--if you have some ideas, please let me know, I will incorporate!
--- TOP COMMENTS --- The first sentence of your third paragraph is where the faulty assumption lies.
You understand PhD work to mean the production of novel research, but no, it is not what they mean.
They really do mean the definition proposed in your second paragraph and which you reject offhandedly as self-evidently wrong: having graduate student level knowledge and being able to solve known problems, something that the average person cannot do.
To the layman, that encyclopedic knowledge is the prize: they are interested in what you know, not what you do.
Solving climate change, cancer, etc is a “maybe some day” idea, not a “our public commercial models can do this now”.
Incidentally, SOTA models can also only achieve this under the direction of a subject matter expert operator who can prompt the model into the right direction, not on its own. You cannot have it write a PhD level paper with a zero shot single prompt.
It needs guidance, corrections, iteration, redirection, etc …
This post is either disingenuous or misinformed. You are either having a kick and pulling people’s legs, or you’ve somehow taken the general population and popular media hype literally and made it the AI science’s team position, which it isn’t.
Quite the straw man.
Of course you cannot type “do science now” and watch it discover a whole new revolutionary paradigm.
Congratulations, you’ve proven wrong an hypothesis that hadn’t been proposed and which no one is defending.
what you were testing is interesting because there are clear instances of it doing what your claiming it cant. such as the novel solution for the leap 71 rocket nozzle or alpha fold 3 findings and several others Those were not off the shelf models for the most part but the ability is there.
secondly the expectation of what you were testing is what they are hoping for the next generation to start tackling more generally. considering they were high school students 3 years ago and today they are second year physics students is telling me we are getting close.
https://www.reddit.com/r/aiwars/comments/1qe9ufe/chatgpt_is_now_able_to_pass_a_standard_secondyear/
Applications
Running Qwen3.5-0.8B on my 7-year-old Samsung S10E
Read more Read lessQwen just released their 0.8B model.
So naturally, I had to try running it on my 7-year-old Samsung S10E.
After some tinkering with llama.cpp, Termux, and a few missing C libraries... behold!
A fully working AI model running locally on an old phone at 12 tokens per second. And btw, the model itself is far from a gimmick - it can actually hold a conversation and do some serious stuff.
Mind. Blown.
--- TOP COMMENTS --- A year ago, an LLM of this size wasn't expected to hold a coherent conversation.
Look at how far we came. A smart model of 0.8B with vision support.
12 tok/s on a snapdragon 855 is solid. Q4_0 or Q8? the NEON SIMD path in llama.cpp makes old ARM chips punch way above weight.
How are small brands using AI right now?
Read more Read lessNot trying to sound dumb, I’m just genuinely curious because there’s so much hype and every tool says it’s AI powered, but when I look at smaller ecommerce or indie brands I can’t tell what’s real vs marketing.
Are people mostly just using ChatGPT for product descriptions and emails, or are there more practical day to day use cases happening behind the scenes? If you run or work with a small brand, what are you actually using, what problem does it solve, and is it genuinely saving time/money or just kind of experimental?
It would be nice to hear your experience.
--- TOP COMMENTS --- We’re a small Shopify brand and honestly most of our AI use is boring stuff, product descriptions, email drafts, and rewriting ad copy. Nothing revolutionary, but it saves a ton of time.
A lot of small brands are also using AI for demand forecasting and inventory planning, especially when cash flow is tight. Even simple models that predict which SKUs might move next month can reduce overstock and stockouts. That kind of use rarely shows up in marketing, but it can protect margins and help founders make calmer decisions instead of reacting late.
Infrastructure
OpenSSH Adds Warning When Not Using Post-Quantum Key Exchange Algorithm
Read more Read lesshttps://www.openssh.org/pq.html
--- TOP COMMENTS --- The entire cryptographic industry is moving to post-quantum. automated cert renewal is becoming more and more important. Starting March 15, 2026, the maximum lifespan for SSL/TLS certificates will be reduced from 398 days to 200 days
A few months ago I was actually testing out different cryptographic methods, and while these post-quantum algorithms are more expensive in terms of compute, is not that crazy of a difference. I hope we make that transition faster, because the science behind new paradigms of compute is accelerating at unprecedented speed.
Products
Anthropic quietly removed session & weekly usage progress bars from Settings → Usage
Read more Read lessThe page now only shows an "Extra usage" toggle. No session bar, no weekly limit tracker... nothing.
This isn't a minor UX change. Power users rely on these to manage their workflow across Chat, Claude Code, and Cowork. Tracking via /usage in the terminal is fine for devs, but it shouldn't be the only option.
Bug or intentional? Either way, would love an explanation.
Edit : For clarification, I was prompted to update the native Mac OS app and noticed this after the update. I'm running : Claude 1.1.4498 (24f768)
--- TOP COMMENTS --- Many services (including Claude) are experiencing issues right now. Claude also had issues with usage not being calculated properly. I wouldn't read so much into this at the moment.
Likely a bug. Via a mobile for me here in the Uk, the Usage page isn’t even rendering.
Ai Safety
Is "better alignment" actually the right framing for agent safety or are we solving the wrong problem?
Read more Read lessSomething that's been bothering me reading the recent agent safety literature.
Most of the safety work focuses on the model layer. Better values, better refusals, better reasoning about edge cases. And that work clearly matters.
But a lot of the failure modes I see documented aren't values failures. They're architectural failures. Agents acting outside their authorization scope not because they wanted to but because nothing enforced the boundary. Agents taking irreversible actions not because they didn't know better but because no external system required approval first.
If that's right then alignment research and execution governance are solving different problems and both are necessary. But the second one gets a lot less attention.
Is this a real distinction or am I drawing a false line? Curious how people in this space think about where the model layer's responsibility ends.
--- TOP COMMENTS --- you're pointing at something real but i think you're slightly mislabeling it. "alignment" already includes "do what we actually want" which includes "don't take irreversible actions without approval." the problem is we're mostly good at writing that down and terrible at enforcing it.
the actual issue is that governance/enforcement is unglamorous infrastructure work while alignment sounds like you're solving alignment, so papers and funding flow that direction. but yeah, a perfectly aligned model behind no guardrails is just security theater.
umm so this feels less like an alignment problem and more like an interface problem between reasoning and permission. models can be perfectly aligned and still cause damage if the execution layer never asks “should this action actually happen” uk. basically, alignment helps intent but architecture enforces consequence.
kinda feels similar to how operating systems evolved, like how apps aren’t trusted just because they behave well, they’re sandboxed because failure is inevitable. so yeah your distinction makes sense to me, safety probably moves from model training toward runtime governance as agents get more autonomy.