Hello guys, hoping you're fine!
As I mentioned in the past in this post: https://www.reddit.com/r/LocalLLaMA/comments/1pt0av6/plxpex_pcie_40_seems_to_help_for_llms_and_p2p_ie/
With the P2P driver (https://github.com/aikitoria/open-gpu-kernel-modules/?tab=readme-ov-file) you can do P2P on same gen GPUs, including consumer ones!
So, also, you can connect GPUs on the same PCIe switch, and with the P2P driver the info is passed directly on the switch fabric instead by going by the CPU root complex, so for example:
5090 <-> 5090 directly on the same switch with the P2P driver would be possible. Since PCIe it is bidirectional, you can read at 64GiB/s on one GPU and write at 64GiB/s on the other at the same time!
So here we go with the info. Also I will mention some products I got from Aliexpress, but without a link, else the post gets removed. I can post the links on a comment for those products if you're interested.
A sneakpeek:
X16 on 7 GPUs on AM5
Setup including switches
So for my setup, I have this:
- Gigabyte Aorus Master X670E
- AMD Ryzen 9 9900X
- 192GB DDR5 6000Mhz
- 2 Asrock 1600W PSU (PG 1600G ATX 3.1)
- 1 Corsair 1500W PSU (Corsair HX1500i)
- RTX 5090*2 (PCIe 5.0)
- RTX 4090*2 (PCIe 4.0)
- RTX 3090 (PCIe 4.0)
- RTX A6000 (PCIe 4.0)
- NVIDIA A40 (PCIe 4.0)
- Multiple SSDs, a 40Gbps NIC, etc.
Switch 1: 100 lanes PCIe 5.0 switch, Microchip Switchtec PM50100 from c-payne, from here, for 2000 EUR (about 2500USD post taxes in Chile)
PCIe 5.0 100 lane switch
This switch has one X16 5.0 upstream, to 5*X16 5.0 downstream + 1*X4 5.0 downstream, via MCIO.
For this, I got a MCIO Retimer from aliexpress, that looks like this:
MCIO 5.0 Retimer
Else, with a passive MCIO adapter, some GPUs would drop randomly.
For the other switch, I got a PLX88096 switch one from aliexpress, for about 400USD. This is a 96 lane PCIe 4.0 switch.
PLX88096 4.0 switch
This switch has X16 upstream from the PCIe slot, and it has 10 SlimSAS downstream ports.
This means you can do, with the dip switch, either: 5*X16 4.0, or 10*X8 4.0, or 20*X4 4.0.
Connection of the GPUs
For this, I basically connected the MCIO 5.0 retimer on the main X16 5.0 slot from the motherboard, and then, on this switch, I connected 2 5090s directly on 4 MCIO ports, and on other 2 MCIO ports, I connected the PLX88096 SlimSAS switch.
Basically, it looks like this:
PM50100 Switch (01:00.0)
├── Port 02.0 → GPU2 (5090) direct
├── Port 03.0 → PLX88096 (cascaded)
│ └── Complex internal structure:
│ ├── GPU0 (4090)
│ ├── GPU1 (4090)
│ ├── GPU4 (A40)
│ ├── GPU5 (A6000)
│ └── GPU6 (3090)
└── Port 04.0 → GPU3 (5090) direct
└── Other ports unused ATM
What is CPU root complex? Why it is worse?
When we talk about GPUs communicating via the CPU root complex, it's when the data has to move from the PCIe slot to the RAM, and viceversa on the case of no P2P. For this to happen, it HAS to pass by the CPU. If you use P2P, then it is directly via PCIe to PCIe via the CPU root complex.
So normally, let´s say you take a motherboard that has 2*X8 5.0 slots. You connect a 5090 on each slot.
If you do TP (tensor parallel), or training with multiGPU, either by using P2P or not, the data has to pass between the 2 GPUs.
If you don't use a switch, this data has to pass by the CPU first.
- If no P2P: 5090(1) -> CPU -> RAM -> CPU -> 5090(2)
- If P2P: 5090(1) -> CPU -> 5090(2)
This adds extra latency by doing extra hops, specially on the case of no P2P.
Topology
Topology looks like this (GPU 0 and 1: 5090s, 2 and 3: 4090s, 4,5 and 6: A6000, A40 and 3090):
pancho@fedora:~/cuda-samples/build/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PXB PXB PXB PXB PXB PIX PHB 0-23 0 N/A
GPU1 PXB X PXB PXB PXB PXB PXB PHB 0-23 0 N/A
GPU2 PXB PXB X PIX PXB PXB PXB PHB 0-23 0 N/A
GPU3 PXB PXB PIX X PXB PXB PXB PHB 0-23 0 N/A
GPU4 PXB PXB PXB PXB X PIX PXB PHB 0-23 0 N/A
GPU5 PXB PXB PXB PXB PIX X PXB PHB 0-23 0 N/A
GPU6 PIX PXB PXB PXB PXB PXB X PHB 0-23 0 N/A
NIC0 PHB PHB PHB PHB PHB PHB PHB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx4_0
As you can see, 5090 pair, or 4090 pair, or Ampere trio have PIX. That means as it says, the connection traverses at most a single PCIe bridge, without going by the CPU root complex.
When the GPUs have to communicate with another of other gen, then it is PXB. This is because it has to pass by the switch via hops.
If you don't use a switch, with or without the P2P driver, you would normally see PHB.
Bandwidth
For bandwidth, I did this test on cuda samples:
pancho@fedora:~/cuda-samples/build/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: e, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 11, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 5090, pciBusID: 5, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA GeForce RTX 5090, pciBusID: 18, pciDeviceID: 0, pciDomainID:0
Device: 4, NVIDIA A40, pciBusID: d, pciDeviceID: 0, pciDomainID:0
Device: 5, NVIDIA RTX A6000, pciBusID: 12, pciDeviceID: 0, pciDomainID:0
Device: 6, NVIDIA GeForce RTX 3090, pciBusID: a, pciDeviceID: 0, pciDomainID:0
pancho@fedora:~/cuda-samples/build/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: e, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 11, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 5090, pciBusID: 5, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA GeForce RTX 5090, pciBusID: 18, pciDeviceID: 0, pciDomainID:0
Device: 4, NVIDIA A40, pciBusID: d, pciDeviceID: 0, pciDomainID:0
Device: 5, NVIDIA RTX A6000, pciBusID: 12, pciDeviceID: 0, pciDomainID:0
Device: 6, NVIDIA GeForce RTX 3090, pciBusID: a, pciDeviceID: 0, pciDomainID:0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1 2 3 4 5 6
0 1 1 0 0 0 0 0
1 1 1 0 0 0 0 0
2 0 0 1 1 0 0 0
3 0 0 1 1 0 0 0
4 0 0 0 0 1 1 1
5 0 0 0 0 1 1 1
6 0 0 0 0 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6
0 915.89 8.31 12.75 12.75 8.30 8.30 5.83
1 8.32 927.85 12.75 12.75 8.30 8.30 5.79
2 12.26 12.26 1562.55 23.21 12.21 12.21 7.99
3 12.26 12.26 23.22 1556.32 12.21 12.21 7.98
4 8.31 8.31 12.70 12.70 644.33 8.29 5.78
5 8.31 8.31 12.70 12.70 8.30 766.68 5.80
6 5.82 5.81 8.07 8.12 5.82 5.79 833.78
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3 4 5 6
0 920.20 26.37 12.75 12.75 8.30 8.30 5.85
1 26.36 944.11 12.75 12.74 8.30 8.30 5.81
2 12.26 12.26 1540.97 57.23 12.21 12.21 7.99
3 12.25 12.26 57.25 1543.97 12.21 12.21 7.98
4 8.31 8.31 12.70 12.70 643.53 26.36 26.36
5 8.31 8.31 12.70 12.70 26.36 767.06 26.36
6 5.83 5.81 8.07 8.07 26.37 26.37 835.56
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6
0 921.29 9.49 15.20 15.21 9.48 9.49 6.27
1 9.49 926.20 15.21 15.23 9.48 9.50 6.29
2 14.18 14.15 1541.62 23.43 14.12 14.17 9.71
3 14.18 14.17 23.27 1540.12 14.13 14.21 9.71
4 9.46 9.48 15.15 15.14 647.80 9.48 6.28
5 9.51 9.48 15.23 15.24 9.49 770.65 6.29
6 6.27 6.29 10.70 10.69 6.32 6.26 839.38
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6
0 922.10 52.18 15.20 15.15 9.49 9.50 6.32
1 52.18 922.92 15.19 15.19 9.49 9.50 6.26
2 14.16 14.17 1540.86 110.82 14.13 14.20 9.72
3 14.16 14.17 110.77 1537.09 14.09 14.20 9.72
4 9.48 9.47 15.12 15.12 647.53 52.19 52.19
5 9.51 9.50 15.27 15.25 52.17 769.89 52.19
6 6.31 6.28 10.69 10.67 52.18 52.18 838.25
P2P=Disabled Latency Matrix (us)
GPU 0 1 2 3 4 5 6
0 1.30 15.32 14.38 14.41 15.74 15.09 14.85
1 15.17 1.35 14.71 14.39 14.26 14.26 14.25
2 14.34 14.35 2.07 14.46 14.37 14.36 14.35
3 14.33 14.34 14.34 2.07 14.34 14.44 14.35
4 14.80 14.25 14.48 15.24 1.78 15.96 14.70
5 16.10 14.73 14.45 14.36 14.37 1.77 14.33
6 14.24 14.25 14.38 14.53 15.11 14.33 1.60
CPU 0 1 2 3 4 5 6
0 1.40 4.21 4.15 4.14 3.95 4.14 4.16
1 4.19 1.35 4.14 4.14 3.93 4.09 4.10
2 4.19 4.12 1.55 4.09 3.92 4.10 4.12
3 4.14 4.10 3.95 1.51 3.73 3.91 3.94
4 3.83 4.01 4.00 3.97 1.28 4.03 4.00
5 4.22 4.15 4.12 4.11 3.91 1.35 4.14
6 4.11 4.08 4.09 4.11 3.88 4.11 1.35
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3 4 5 6
0 1.28 1.41 14.47 14.38 14.91 14.26 18.66
1 1.41 1.29 14.41 14.39 14.26 14.26 16.30
2 14.34 14.41 2.07 0.36 14.40 14.34 14.37
3 14.34 14.35 0.36 2.07 14.40 14.36 14.36
4 14.35 16.30 14.49 14.44 1.80 1.62 1.58
5 16.66 14.24 14.37 14.40 1.58 1.76 1.60
6 15.08 15.27 14.37 14.43 1.52 1.51 1.56
CPU 0 1 2 3 4 5 6
0 1.39 1.13 4.16 4.13 3.94 4.19 4.17
1 1.14 1.36 4.17 4.14 3.93 4.17 4.15
2 4.17 4.19 1.54 1.08 3.94 4.12 4.14
3 4.17 4.17 1.10 1.57 3.94 4.14 4.15
4 4.04 4.02 4.04 4.01 1.29 1.02 1.03
5 4.18 4.18 4.19 4.18 1.10 1.37 1.09
6 4.17 4.14 4.14 4.15 1.09 1.09 1.35
Like that, we have this bidirectional bandwidth:
- 5090 ↔ 5090: 110.82 GB/s (via PM50100 switch)
- 4090 ↔ 4090: 52.18 GB/s (via PLX88096 switch connected to the PM50100 switch)
- Ampere Trio A40 ↔ A6000 ↔ 3090: 52.19 GB/s (via PLX88096 switch connected to the PM50100 switch)
Remember that when having a PCIe switch, P2P and GPUs on the same switch, they communicate directly via the switch fabric without having to pass by the CPU root complex. So you can surpass the uplink bandwidth as long you keep it inside the switch.
NOTE: P2P does not work across different GPU gens, so on that case (i.e. 5090 to 4090, or 5090 to 3090) bandwidth is reduced.
On that case, if using all the GPUs at the same time, bandwidth between them is about 15GB/s. About PCIe 4.0 X8 speeds (thanks to PCIe being bidirectional).
Performance (on limited tests, and why I want to you to give me some ideas to test)
Because I had only X4 4.0 lanes at most, I mostly only used llamacpp. But I think with the switches, for 4 GPUs at least, something like vLLM would make sense.
So for my tests, I only have some diffusion training, and some LLMs on llamacpp, where even with this it makes a difference.
Training (diffusion)
For this, I did a full finetune on a SDXL model. Not good results at all per se but it was mostly to take the time it took.
- 1 5090: ~24 hours
- 2 5090s (no P2P, X8/X8): ~16 hours (mostly by increasing the effective batch size, speed was the same but steps were halved)
- 2 5090s (P2P driver, X8/X8): ~13 hours
- 2 5090s (P2P driver, X16/X16 via switch): ~8 hours
That is a huge uplink, mostly by using the P2P driver first. So if you have 2 5090s at X8/X8, make sure to install the P2P driver!
Inference (don't kill me, just llamacpp for now)
For this, I have tested 3 models, on different configurations, so it took a bit of time. I hope it helps for info!
First I set the device order like this:
5090, 5090, 4090, 4090, 3090, A40, A6000
export CUDA_VISIBLE_DEVICES=2,3,0,1,6,5,4
Also all the tests were made with the P2P driver in use (but should make no difference on llamacpp (but it does on ikllamacpp)).
First:
GLM 4.7 Q4_K_XL (about 196GB in size), fully loaded on GPU:
For this one, loading with:
./llama-server \
-m '/run/media/pancho/MyDrive/models_llm_2tb/GLM-4.7-UD-Q4_K_XL.gguf' \
-c 32768 \
--no-mmap \
-ngl 999 \
-ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14).ffn.=CUDA0" \
-ot "blk.(15|16|17|18|19|20|21|22|23|24|25|26).ffn.=CUDA1" \
-ot "blk.(27|28|29|30|31|32|33|34|35).ffn.=CUDA2" \
-ot "blk.(36|37|38|39|40|41|42|43|44).ffn.=CUDA3" \
-ot "blk.(45|46|47|48|49|50|51|52|53).ffn.=CUDA4" \
-ot "blk.(54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73).ffn.=CUDA5" \
-ot "blk.(74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92).ffn.=CUDA6" \
-mg 0 \
-ub 2048 -b 2048
I have these results for different setups (PP = Prompt processing, TG = Text generation):
- 5090s at X8/X8 5.0, 4090s, A6000, A40 at X4 4.0 and 3090 at X1 3.0: 665.46 t/s PP, 25.90 t/s TG
- 5090s at X8/X8 5.0, 4090s, and Ampere trio at X4 4.0: 765.51 t/s PP, 26.18 t/s TG.
- 5090(1) at X16 5.0, 5090(2) at X4 5.0, all the rest at X4 4.0: 940 t/s PP, 26.75 t/s TG.
- 5090s at X16 5.0, all the rest at X16 4.0: 1170 t/s PP, 27.64 t/s TG.
DeepSeek V3 0324, IQ4_XS, offloading about 120GB to CPU:
Loading with:
./llama-server -m '/run/media/pancho/MyDrive2/HuggingFaceModelDownloader/Storage/GGUFs/DeepSeek-V3-0324-IQ4_XS.gguf' -c 32768 --no-mmap -ngl 999 \
-ot "blk.(0|1|2|3|4|5|6).ffn.=CUDA0" \
-ot "blk.(7|8|9|10|11|12).ffn.=CUDA1" \
-ot "blk.(13|14|15).ffn.=CUDA2" \
-ot "blk.(16|17|18).ffn.=CUDA3" \
-ot "blk.(19|20|21).ffn.=CUDA4" \
-ot "blk.(22|23|24).ffn.=CUDA5" \
-ot "blk.(25|26|27|28).ffn.=CUDA6" \
-ot "blk.30.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA2" \
-ot "blk.30.ffn_gate_exps.weight=CUDA2" \
-ot "blk.30.ffn_down_exps.weight=CUDA3" \
-ot "blk.31.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA0" \
-ot "blk.31.ffn_gate_exps.weight=CUDA1" \
-ot "blk.31.ffn_down_exps.weight=CUDA1" \
-ot "blk.31.ffn_up_exps.weight=CUDA6" \
-ot "blk.32.ffn_gate_exps.weight=CUDA6" \
-ot "exps=CPU" \
-mg 0 -ub 2048
I have these results:
- 5090s at X8/X8 5.0, 4090s, A6000, A40 at X4 4.0 and 3090 at X1 3.0: 195.66 t/s PP, 10.1 t/s TG
- 5090s at X8/X8 5.0, 4090s, and Ampere trio at X4 4.0: 244 t/s PP, 11.52 t/s TG
- 5090(1) at X16 5.0, 5090(2) at X4 5.0, all the rest at X4 4.0: 312.64 t/s PP, 11.58 t/s TG
- 5090s at X16 5.0, all the rest at X16 4.0: 360.86 t/s PP, 11.71 t/s TG
Kimi K2 Instruct Q2_K_XL, offloading about 160GB to CPU:
Loading with:
./llama-server \
-m '/run/media/pancho/Drive954GB/models_llm_1tb/Kimi-K2-Thinking-UD-Q2_K_XL-00001-of-00008.gguf' \
-c 32768 \
--no-mmap \
-ngl 999 \
-ot "blk.(0|1|2|3).ffn.=CUDA0" \
-ot "blk.(4|5|6|7).ffn.=CUDA1" \
-ot "blk.(8|9|10).ffn.=CUDA2" \
-ot "blk.(11|12|13).ffn.=CUDA3" \
-ot "blk.(14|15|16).ffn.=CUDA4" \
-ot "blk.(17|18|19|20|21|22|23).ffn.=CUDA5" \
-ot "blk.(24|25|26|27|28|29|30).ffn.=CUDA6" \
-ot "blk.31.ffn_down_exps.weight=CUDA0" \
-ot "blk.32.ffn_down_exps.weight=CUDA2" \
-ot "blk.33.ffn_down_exps.weight=CUDA3" \
-ot "blk.33.ffn_gate_exps.weight=CUDA1" \
-ot "blk.(31|32|33).ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA1" \
-ot "exps=CPU" \
-mg 0 \
-ub 2048
I have these results:
- 5090s at X8/X8 5.0, 4090s, A6000, A40 at X4 4.0 and 3090 at X1 3.0: 179 t/s PP, 11.34t/s TG.
- 5090s at X8/X8 5.0, 4090s, and Ampere trio at X4 4.0: 198 t/s PP y 11.6 t/s TG.
- 5090(1) at X16 5.0, 5090(2) at X4 5.0, all the rest at X4 4.0: 219.08 t/s PP, 11.91 t/s TG
- 5090s at X16 5.0, all the rest at X16 4.0: 248 t/s PP, 11.95 t/s TG
Table for TL:DR
Configuration
GLM 4.7 Q4_K_XL(196GB, GPU only)
DeepSeek V3 IQ4_XS(~120GB CPU offload)
Kimi K2 Q2_K_XL(~160GB CPU offload)
Data
PP / TG (t/s)
PP / TG (t/s)
PP / TG (t/s)
Config 1:5090s: X8/X8 Gen5, 4090s/A6000/A40: X4 Gen4, 3090: X1 Gen3
665.46 / 25.90
195.66 / 10.10
179.00 / 11.34
Config 2:5090s: X8/X8 Gen5, All others: X4 Gen4
765.51 / 26.18
(+15% / +1%)
244.00 / 11.52
(+25% / +14%)
198.00 / 11.60
(+11% / +2%)
Config 3:5090#1: X16 Gen5, 5090#2: X4 Gen5,Others: X4 Gen4
940.00 / 26.75
(+41% / +3%)
312.64 / 11.58
(+60% / +15%)
219.08 / 11.91
(+22% / +5%)
Config 4:5090s: X16 Gen5, All others: X16 Gen4
1170.00 / 27.64 (+76% / +7%)
360.86 / 11.71 (+84% / +16%)
248.00 / 11.95 (+39% / +5%)
As you can see here, TG is not that impacted by PCIe, but PP for sure it is, even on llamacpp!
Some questions you may have
Why?
Well, on this case it was mostly about cost. I already had the GPUs, the RAM and I was planning to get a Theadripper 9955WX plus a WRX90 motherboard.
But well, you know, RAM prices now are absurd.
On Chile, I have these prices:
- Theadripper 9955WX: 2000USD
- Cheapest WRX90 board: 1800USD (alternative is Gigabyte AI TOP for 1500USD)
- Cheapest 128GB DDR5 RDIMM, 4800Mhz: 4000USD (yes, I'm not even joking)
- 256GB DDR5 RDIMM 4800Mhz: 6500USD
RAM bandwidth would have been a bit better, and also 128 5.0 lanes, I know.
But you're comparing a 5.0 switch (2500USD) a 4.0 switch (400USD) for a total of 2900USD, vs 7800 to 10300USD. So about 3x-4x the price.
Why not a 6000 PRO?
There was no stock of the 6000 PRO for most of the 2025. Just on December they arrived, but they go for 12000USD each. You can get 4x5090s for that price here.
But I understand you save: power, space and heat. I'm still thinking about it.
How do you fit so many GPUs?
With a custom self made wood rack! I have some pics. It's not the prettiest, but it works.
Multiple fans
ConnectX 3 with a fan, and MCIO retimer behind
Final words, and please let me know what can I test!
Hope you guys find informative, and if you can let me know what can I test here, let me know.
Have fun on the LLM side!
--- TOP COMMENTS ---
Thanks for posting your results.
How difficult was getting this to work at all? Often when trying to do something so far out on the edge can be a challenge.
I really enjoy this post, your setup is great and the effort/details you put into this are great also.
I would suggest you to give a shot to ik_llama.cpp with "graph" as split mode or even VLLM though, as those should behave much better than llama.cpp on multi-gpu configs!
Models
Sam Altman says very fast Codex is coming after OpenAI Cerebras partnership
Read more Read lessSam Altman confirms faster Codex is coming, following OpenAI’s recent multi billion dollar partnership with Cerebras. The deal signals a push toward high performance AI inference and coding focused workloads at scale.
Source: Sam in X
--- TOP COMMENTS --- OpenAI announced a $10 billion deal to buy up to 750 megawatts of computing capacity from Cerebras Systems over three years. OpenAI is facing a severe shortage of computing power to run ChatGPT and handle its 900 million weekly users.
https://preview.redd.it/j6pi4f16mrdg1.jpeg?width=1310&format=pjpg&auto=webp&s=1070cb98f00f4362d8fb808961978117b657bdac
Nvidia GPUs while dominant are scarce, expensive and increasingly a bottleneck for inference workloads. Cerebras builds chips using a fundamentally different architecture than Nvidia.
OpenAI trying to acquire all the compute in the world
Claude Skills Magic
Read more Read lessAm I the only one freaking out about Claude skills?
The fact that you can run code and build structured outputs all natively within Claude chat mode is unreal.
It’s akin to build your own personal set of tools like Manus does, except now you have root access to the skill and can modify it to your exact needs.
There are still some limitations, but overall I am finding that I am spending more time on building and improving my Claude skills vs. building a new workflow in N8N or trying to build an AI agent front scratch.
So far, I’ve built a few skills that helps me run a comprehensive report for clients on their current social media accounts and one that I use for referencing all my brand identity information.
The brand identity one was actually built in my brand Kit product and then imported via Zip into Claude.
I’d love to hear about anyone else whose build skill in Claude and what your opinion is on this new feature.
--- TOP COMMENTS --- I guess when we start paying the bill we will see the difference. At the end of the day, straight automation programming cost nothing compared to LLM processing each step though skills or else.
Personally I’m experimenting with Claude programming my n8n workflows, I’ll see where it goes.
Also, for the skills, much of what you described was achievable with specialist sub-agents, no? Skills seems more like the next generation of sub-agents
yeah I mean I knew n8n was a useless piece of bloat where and Claude skills proved that
stepfun-ai/Step3-VL-10B · Hugging Face
[R] China just released first SOTA multimodal model trained entirely on domestic chips
Read more Read lessZhipu AI and Huawei just dropped GLM-Image, and the technical details are interesting.
First multimodal model trained completely on Chinese chips (Huawei Ascend 910) from data preprocessing to full scale training. They're using a hybrid architecture combining autoregressive + diffusion decoder.
What stands out is the Chinese text rendering. It consistently ranks first among open source models for complex text generation, especially handling Chinese characters which most models struggle with.
Native support for 1024 to 2048 resolution at any aspect ratio without additional training. API pricing is 0.1 yuan per image (roughly $0.014).
The model handles both text to image and image to image generation in a single model. GitHub and Hugging Face repos are already up.
This is significant because it proves you can train frontier models without relying on Nvidia hardware. The compute efficiency numbers they're claiming are 60% better than H200 for tokens per joule.
Whether those benchmarks hold up in practice remains to be seen but the fact they pulled this off on domestic hardware is noteworthy.
--- TOP COMMENTS --- I haven't looked at the repo, but assuming that its not NV hardware anymore, how are they building on Pytorch and/or cuDNN (or variations thereof)? Can they be run on other machines?
MiniMax M2.2 Coming Soon. Confirmed by Head of Engineering @MiniMax_AI
Developer Tools
I found a way to use Claude Agent SDK inside LangGraph nodes - here's what I learned
Read more Read lessI've been deep in multi-agent workflows for a few months now, and I wanted to share something I figured out that I couldn't find documented anywhere.
The problem:
I needed proper orchestration for complex AI workflows - multiple agents, state management, conditional routing. I tried a bunch of approaches:
- Instruction files for sub-agents
- Different RAG setups (vector DBs, markdown, YAML)
- Using Claude itself as the orchestrator
They worked, but none scaled the way I needed.
LangGraph seemed like the answer. But every tutorial I found uses direct API calls - you're basically burning tokens while you experiment and learn. I didn't want to waste money just figuring out if it would work for my use case.
What I discovered:
You can use Claude Agent SDK to power LangGraph nodes directly.
- LangGraph handles the workflow orchestration (state, routing, parallel execution)
- Claude Agent SDK handles the actual agent execution (tools, context management, capabilities)
This way you get the best of both - LangGraph's workflow control with SDK's full agent capabilities. And you're not just making raw API calls for every node.
I couldn't find anyone talking about this specific integration pattern. All the LangGraph examples assume direct API access.
What I built to learn this:
While figuring this out, I documented my entire learning journey - the questions I asked, the mistakes I made, the breakthroughs. I turned it into an interactive workshop where you build a full multi-agent system (11 agents, parallel execution, the whole thing).
Not trying to sell anything here - genuinely just want to share what I learned. If anyone's interested in the research docs or the workshop, happy to share links. But mostly I'm curious:
- Has anyone else tried this integration pattern?
- What orchestration approaches are you all using for multi-agent setups?
- Any gotchas I should know about as I keep building on this?
Would love to hear how others are handling this stuff.
--- TOP COMMENTS --- I would like to see the documentation and how you did this.
Nicely done! Didn't think this was possible
Stop spamming "4k, hyper-realistic" in your prompts. It’s why your images look like plastic.
Read more Read lessI've been trying to fix that weird "wax figure" glaze on my generations for weeks. I thought it was a model issue, so I kept adding negative prompts like "bad anatomy" or piling on buzzwords like "unreal engine 5, 8k, ultra detailed."
I stumbled upon this breakdown today that actually explains the logic behind the plastic look, and it completely changed my workflow.
The gist is: Models are trained on photography captions. When you use generic buzzwords, the AI defaults to a flat, wide-angle "smartphone" look (infinite depth of field = fake looking).
I started testing what the article suggested--swapping "hyper-realistic" for actual camera physics (e.g., "shot on 85mm, f/1.8 aperture"). The difference in skin texture and lighting is night and day. It stops trying to "render" the image and starts "photographing" it.
There’s a decent lens cheat sheet in here if you want to test the physics yourself. Definitely worth a read if you're stuck in the uncanny valley: Photorealistic AI Generation
--- TOP COMMENTS --- This is actually huge, been wondering why my portraits always looked like they were taken with a potato wrapped in vaseline
Switching to actual camera specs makes so much sense when you put it like that - the AI probably has way more training data with proper EXIF info than random buzzword soup
The Soap Opera Effect.
New in Claude Code on the web and desktop: diff view.
Read more Read lessSee the exact changes Claude made without leaving the app.
Previously you had to switch to GitHub or your IDE to review changes in depth. Now you can view full diffs and leave inline comments to iterate with Claude, all in one place.
Try it at http://claude.com/code
--- TOP COMMENTS --- Anything but mcps
Modern Android phones are powerful enough to run 16x AI Upscaling locally, yet most apps force you to the cloud. So I built an offline, GPU-accelerated alternative.
Read more Read lessHi everyone,
I wanted to share a project I have been working on to bring high-quality super-resolution models directly to Android devices without relying on cloud processing. I have developed RendrFlow, a complete AI image utility belt designed to perform heavy processing entirely on-device.
The Tech Stack (Under the Hood): Instead of relying on an internet connection, the app runs the inference locally. I have implemented a few specific features to manage the load:
Full Feature List: I did not want it to just be a tech demo, so I added the utilities needed for a real workflow:
Why I need your help: Running 16x models on a phone is heavy. I am looking for feedback on how the "GPU Burst" mode handles heat management on different chipsets .
https://play.google.com/store/apps/details?id=com.saif.example.imageupscaler
--- TOP COMMENTS --- For a locally running image editing app it collects and shares with 3rd parties an awful lot of additional information.
I forgot about this damit! Went to check the link and it was already installed. Haven't tried it, but I remember the GUI and it was clear enough, no ads shenanigans as far as I remember.
Have a bunch of low crap AI images on the phone will try to upscale later and see how my phone burns or maybe it wont, but thanks for the sharing!
An app I built to improve the mobile app development experience
Read more Read lessHey everyone!
I just wanted to share a tool I use for developing mobile apps. My day-to-day job was as an engineer at one of the mobile cloud startups for many years, so I have a pretty solid background in mobile device automation and remote control. I thought it would be cool to have a tool that helps my Claude code see what it does and also how messy it sometimes looks 😊
I've started noticing more posts where people complain about the lack of such tools or are looking for testers for their apps. So I decided to polish my tool and release it as a separate app.
Currently, it works on macOS and Windows:
I also wrote a Claude code plugin for it: https://github.com/MobAI-App/mobai-marketplace
And an MCP server: https://github.com/MobAI-App/mobai-mcp
Here’s the main link: https://mobai.run
Looking forward to your feedback!
--- TOP COMMENTS --- This looks amazing thank you! I'm developing a mobile app right now and it's been so slow not having a way for Claude to interact with it.
Wow. Thank you!
Did I Waste Four Years on My CS Degree?
Read more Read lessLast week I watched Claude Code build a full-stack app in 10 minutes. Would've taken me two days. Four years of college, and Claude learned it all instantly.
"Entry-level position, 3-5 years experience required." Used to be a joke. Now it's reality. Companies that hired 10 junior devs now hire 2. One senior with AI does the work of five people. All those mundane tasks AI handles? That's literally what entry-level engineers do. That's how we learn. The bottom rungs just got automated away.
And it's everywhere. My friend in marketing watched her company replace three writers with Claude and ChatGPT. She kept her job managing the AI. But she's training her replacement.
Legal researchers, financial analysts, designers—all competing with AI now. We thought cognitive work was safe. Turns out we were wrong.
Here's what gets me: productivity is soaring, companies are more profitable than ever, but none of that translates to people doing better. Wages stagnate, jobs disappear. We were promised automation would give us leisure time. Instead, some work harder while others lose their jobs. The gains flow to shareholders. Everyone else gets told to "reskill."
But reskill to what? If AI advances this fast, what's actually safe?
--- TOP COMMENTS --- After following this subreddit. I have realized that what many call a "full stack app" is very different from the "full stack apps" that people are willing to pay you for.
Your future will be different, but some of us started working pre-internet, so learning new things is an important skill.
Why would your degree be a waste? Just think of AI tools as being your junior developers and you’re the star of the show.
Learn to use the tools at your disposal in ways that set you apart from your competition.. you’ve got this!
The Complete Guide to Claude Code V3: LSP, CLAUDE.md, MCP, Skills & Hooks — Now With IDE-Level Code Intelligence
Read more Read less🎉 V3: Built on Community Feedback (Again)
📸 View As Website
V2 hit #2 all-time on r/ClaudeAI. Your comments made V3 possible. Huge thanks to u/BlueVajra (commands/skills merge), u/stratofax (dotfiles sync), u/antoniocs (MCP tradeoffs), u/GeckoLogic (LSP), and everyone from V2: u/headset38, u/tulensrma, u/jcheroske.
What's new in V3:
TL;DR: Your global
~/.claude/CLAUDE.mdis a security gatekeeper AND project blueprint. LSP gives Claude semantic code understanding — go-to-definition, find-references, diagnostics. MCP servers extend capabilities (but have tradeoffs). Commands and skills now share the same schema. Hooks enforce rules deterministically where CLAUDE.md can fail. And research shows mixing topics causes 39% performance degradation — keep chats focused.Part 1: The Global CLAUDE.md as Security Gatekeeper
The Memory Hierarchy
Claude Code loads CLAUDE.md files in a specific order:
Level Location Purpose Enterprise/etc/claude-code/CLAUDE.mdOrg-wide policies Global User~/.claude/CLAUDE.mdYour standards for ALL projects Project./CLAUDE.mdTeam-shared project instructions Project Local./CLAUDE.local.mdPersonal project overridesYour global file applies to every single project you work on.
What Belongs in Global
1. Identity & Authentication
Why global? You use the same accounts everywhere. Define once, inherit everywhere.
2. The Gatekeeper Rules
Why This Matters: Claude Reads Your .env
Security researchers discovered that Claude Code automatically reads
.envfiles without explicit permission. Backslash Security warns:Your global CLAUDE.md creates a behavioral gatekeeper — even if Claude has access, it won't output secrets.
Syncing Global CLAUDE.md Across Machines
Thanks to u/stratofax for this tip.
If you work on multiple computers, sync your
~/.claude/directory using a dotfiles manager:This gives you:
Defense in Depth
Layer What How 1 Behavioral rules Global CLAUDE.md "NEVER" rules 2 Access control Deny list in settings.json 3 Git safety .gitignoreTeam Workflows: Evolving CLAUDE.md
Boris Cherny shares how Anthropic's Claude Code team does it:
The pattern: Mistakes become documentation.
Compounding Engineering
This embodies Compounding Engineering:
The 80/20 inversion: Spend 80% on planning and review, 20% on execution. Your CLAUDE.md becomes institutional knowledge that compounds over time.
Part 2: Global Rules for New Project Scaffolding
Your global CLAUDE.md becomes a project factory. Every new project automatically inherits your standards.
The Problem Without Scaffolding Rules
Research from project scaffolding experts:
The Solution
When you say "create a new Node.js project," Claude reads this and automatically creates the correct structure. Zero manual setup.
Part 3: MCP Servers — Claude's Integrations
MCP (Model Context Protocol) lets Claude interact with external tools.
Adding MCP Servers
When NOT to Use MCP
Thanks to u/antoniocs for this perspective.
MCP servers consume tokens and context. For simple integrations, consider alternatives:
Use Case MCP Overhead Alternative Trello tasks High CLI tool (trello-cli) Simple HTTP calls Overkillcurlvia Bash One-off queries Wasteful Direct commandRule of thumb: If you're calling an MCP tool once per session, a CLI is more efficient. MCP shines for repeated tool use within conversations.
Recommended MCP Servers for Developers
Core Development
Server Purpose Install Context7 Live docs for any libraryclaude mcp add context7 -- npx -y @upstash/context7-mcp@latestGitHub PRs, issues, CI/CDclaude mcp add github -- npx -y @modelcontextprotocol/server-githubFilesystem Advanced file operationsclaude mcp add filesystem -- npx -y @modelcontextprotocol/server-filesystemSequential Thinking Structured problem-solvingclaude mcp add sequential-thinking -- npx -y @modelcontextprotocol/server-sequential-thinkingDatabases
Server Purpose Install MongoDB Atlas/Community, Performance Advisorclaude mcp add mongodb -- npx -y mongodb-mcp-serverPostgreSQL Query Postgres naturallyclaude mcp add postgres -- npx -y @modelcontextprotocol/server-postgresDBHub Universal (MySQL, SQLite, etc.)claude mcp add db -- npx -y @bytebase/dbhubDocuments & RAG
Server Purpose Install Docling PDF/DOCX parsing, 97.9% table accuracyclaude mcp add docling -- uvx docling-mcp-serverQdrant Vector search, semantic memoryclaude mcp add qdrant -- npx -y @qdrant/mcp-serverChroma Embeddings, vector DBclaude mcp add chroma -- npx -y @chroma/mcp-serverBrowser & Testing
Server Purpose Install Playwright E2E testing, scrapingclaude mcp add playwright -- npx -y @anthropic-ai/playwright-mcpBrowser MCP Use your logged-in Chrome browsermcp.io Brave Search Privacy-first web searchclaude mcp add brave -- npx -y @anthropic-ai/brave-search-mcpCloud & Hosting
Server Purpose Install AWS Full AWS service accessclaude mcp add aws -- uvx awslabs.aws-api-mcp-server@latestCloudflare Workers, KV, R2claude mcp add cloudflare -- npx -y @cloudflare/mcp-serverHostinger Domains, DNS, VMs, billingnpm i -g hostinger-api-mcpthen configure Kubectl Kubernetes natural languageclaude mcp add kubectl -- npx -y @modelcontextprotocol/server-kubernetesWorkflow & Communication
Server Purpose Install Slack Messages, channel summariesclaude mcp add slack -- npx -y @anthropic-ai/slack-mcpLinear Issue trackingclaude mcp add linear -- npx -y @linear/mcp-serverFigma Design specs, componentsclaude mcp add figma -- npx -y @anthropic-ai/figma-mcpDiscovery
Find more servers:
Part 4: Context7 — Live Documentation
Context7 gives Claude access to up-to-date documentation.
The Problem
Claude's training has a cutoff. Ask about a library released after training → outdated answers.
The Solution
Installation
Part 5: Skills (Commands Are Now Skills)
Thanks to u/BlueVajra for the correction.
Update: As of late 2025, commands and skills have been merged. They now share the same schema.
The New Structure
Old Location New Location~/.claude/commands/review.md~/.claude/skills/review/SKILL.mdKey Difference
/review) — You explicitly invoke themBoth use the same SKILL.md format:
Progressive Disclosure
Skills use progressive disclosure for token efficiency:
Rule of thumb: If instructions apply to <20% of conversations, make it a skill instead of putting it in CLAUDE.md.
Part 6: Why Single-Purpose Chats Are Critical
Research consistently shows mixing topics destroys accuracy.
Studies on multi-turn conversations:
Chroma Research on context rot:
The Golden Rule
Scenario Action New feature New chat Bug fix (unrelated)/clearthen new task Research vs implementation Separate chats 20+ turns elapsed Start freshUse
/clearLiberallyAnthropic recommends:
Part 7: Hooks — Deterministic Enforcement
This section added based on V2 feedback from u/headset38 and u/tulensrma.
CLAUDE.md rules are suggestions Claude can ignore under context pressure. Hooks are deterministic — they always run.
The Critical Difference
Mechanism Type Reliability CLAUDE.md rules Suggestion Can be overridden Hooks Enforcement Always executesHook Events
Event When Use CasePreToolUseBefore tool executes Block dangerous opsPostToolUseAfter tool completes Run lintersStopClaude finishes turn Quality gatesExample: Block Secrets Access
Add to
~/.claude/settings.json:The hook script:
Hook Exit Codes
Code Meaning 0 Allow operation 1 Error (shown to user) 2 Block operation, tell Claude whyPart 8: LSP — IDE-Level Code Intelligence
Thanks to u/GeckoLogic for highlighting this.
New in December 2025 (v2.0.74), Claude Code gained native Language Server Protocol support. This is a game-changer.
What LSP Enables
LSP gives Claude the same code understanding your IDE has:
Capability What It Does Go to Definition Jump to where any symbol is defined Find References See everywhere a function is used Hover Get type signatures and docs Diagnostics Real-time error detection Document Symbols List all symbols in a fileWhy This Matters
Before LSP, Claude used text-based search (grep, ripgrep) to understand code. Slow and imprecise.
With LSP, Claude has semantic understanding — it knows that
getUserByIdin file A calls the function defined in file B, not just that the text matches.Performance: 900x faster (50ms vs 45 seconds for cross-codebase navigation)
Supported Languages
Python, TypeScript, Go, Rust, Java, C/C++, C#, PHP, Kotlin, Ruby, HTML/CSS
Setup
LSP is built-in as of v2.0.74. For older versions:
What This Means for You
Claude can now:
This shifts AI coding from text manipulation to semantic understanding.
Quick Reference
Tool Purpose Location Global CLAUDE.md Security + Scaffolding~/.claude/CLAUDE.mdProject CLAUDE.md Architecture + Team rules./CLAUDE.mdMCP Servers External integrationsclaude mcp addContext7 Live documentation MCP server Skills Reusable expertise.claude/skills/*/SKILL.mdHooks Deterministic enforcement~/.claude/settings.jsonLSP Semantic code intelligence Built-in (v2.0.74+)/clearReset context Type in chatGitHub Repo
All templates, hooks, and skills:
github.com/TheDecipherist/claude-code-mastery
Sources
What's in your setup? Drop your hooks, skills, and MCP configs below.
--- TOP COMMENTS --- LSP setup misses the fact that a) language-specific plug-ins have to be installed via Claude marketplace & b) language-specific lsp servers have to be installed locally. It's also adviseable to explicitly state within Claude.md / rules/ that lsp are available and that it's functions shall be used as a backup, especially for bigger refactorings, in case Claude picks text-based search first.
Mcp misses chrome-devtools which is far more capable than playwright in regards of local development.
You use words like "NEVER," "gatekeeper," etc., but we've all seen LLMs ignore explicit instructions. I guess those "how-to" tutorials should explicitly state for people who don't get it (most people don't) that it's just a probability model, not something "exact".
Related Coverage
mcp tool search is live. if it's not working: export ENABLE_TOOL_SEARCH=true
Read more Read lessTL;DR: mcp tools used to eat 20-50% of context before you could type anything. tool search loads them on-demand now. if it's not enabled for you yet:
export ENABLE_TOOL_SEARCH=truebefore launching claude.alright so this has been driving me insane for months.
you connect a few mcp servers. figma, playwright, github, maybe notion. suddenly 30-50% of your context window is gone before you even type a prompt. sessions dying every 10-15 minutes with opus. genuinely cooked.
we only get 200k context and a chunk of that is already eaten by system prompts and conversation history. mcp tool definitions on top of that? brutal.
i tried everything. code execution wrappers, skills with lazy loading, universal mcp configs. they all had tradeoffs. some broke tool discovery. others added latency. nothing felt like an actual solution.
well claude code just shipped tool search and it actually works.
how it works
instead of preloading every single tool definition at session start, it searches on-demand:
no config needed. it just works.
before: mcp tools eating 20-50% of context window after: mcp tools loaded on-demand, effectively 0% until needed
simon willison said it best: "context pollution is why i rarely used mcp, now there's no reason not to hook up dozens or hundreds of mcps."
if it's not working yet
since it's rolling out, might not be available even if your cli is up to date. happened to me for weird reasons.
fix:
check with
/context. if mcp tools shows "loaded on-demand" instead of a token count, you're good.go wild
now i'm enabling everything i've been waiting to use. notion, linear, exa, vercel, database tools. there's no penalty anymore.
the barrier that kept most people at 2-4 servers is gone. connect everything. let claude figure out what it needs.
anyone else been waiting for this? what's the first mcp you're enabling now that context isn't a problem?
--- TOP COMMENTS --- So this will work with pre-existing MCPs, no extra config needed?
Nice comeback
How do you handle MCP tool responses that blow past context limits? (Cursor, Claude, etc.)
Read more Read lessI’m running into a frustrating issue when using Cursor, Claude Code, etc., that integrate tool calls directly into the workflow. Some MCP servers return a massive payload. This output fills the entire context window, which causes a chain reaction:
I’d love to know how others are solving this:
Bonus points for open-source solutions or rough architectures. Even just “lessons learned” would be helpful.
--- TOP COMMENTS --- Wrap it in a cli tool that writes it to a file.
I think it is just better for you to make your own MCPs (when possible), optimized for your needs.
Testing prompts at scale is messy - here's what we built for it
Read more Read lessWork at Maxim on prompt tooling. Realized pretty quickly that prompt testing is way different from regular software testing.
With code, you write tests once and they either pass or fail. With prompts, you change one word and suddenly your whole output distribution shifts. Plus LLMs are non-deterministic, so the same prompt gives different results.
We built a testing framework that handles this. Side-by-side comparison for up to five prompt variations at once. Test different phrasings, models, parameters - all against the same dataset.
Version control tracks every change with full history. You can diff between versions to see exactly what changed. Helps when a prompt regresses and you need to figure out what caused it.
Bulk testing runs prompts against entire datasets with automated evaluators - accuracy, toxicity, relevance, whatever metrics matter. Also supports human annotation for nuanced judgment.
The automated optimization piece generates improved prompt versions based on test results. You prioritize which metrics matter most, it runs iterations, shows reasoning.
For A/B testing in production, deployment rules let you do conditional rollouts by environment or user group. Track which version performs better.
Free tier covers most of this if you're a solo dev, which is nice since testing tooling can get expensive.
How are you all testing prompts? Manual comparison? Something automated?
--- TOP COMMENTS --- This should be re tagged as "Ad"
"Plus LLMs are non-deterministic, so the same prompt gives different results."
In a blackbox where testing should be done this isn't true. LLM's are deterministic by design, with the only value changing that determinisim being temperature.
Built 7 production apps in 3 months with Claude - here's what actually worked
Read more Read lessI started building first with Claude and then Claude Code and it has been about 18 months now. The first year was rough, context loss between sessions, quality degrading over time, constantly re-explaining what I'd already built.
Over the past 3 months, I've shipped 7 production apps and finally figured out a workflow that actually compounds instead of resetting.
The apps (all built primarily with Claude):
What made the difference:
I put together a portfolio showing all the projects: ankushdixit.com
Happy to answer questions about the workflow or any of the specific projects.
--- TOP COMMENTS --- If I see another '- here's what actually worked' Im going to have to remove myself from this damn sub reddit.
It looks like you may have some skills and more then the average vibe coder, please at least write your own posts so it doesnt look like you are on their level.
> 1,087 tests
> I made 90%+ coverage a hard requirement
This is not always the flex you think it is. To get coverage that high, AIs start to do dumb shit like test that each log line is executed. That kind of stuff isn't beneficial, but it does hit the coverage number!
I think it's better to focus on integration or even e2e tests and make sure you have your functionality / use cases all covered instead of focusing on the code / number of lines covered.
BTW, I really like the transitions between the different projects on your site.
Are you using any SDKs for building AI agents?
Read more Read lessWe shipped an ai agent without using any of the agent building SDKs (openai, anthropic, google etc). It doesn't require much maintenance but time to time we find cases where it breaks (ex: gemini 3.x models needed the input in a certain fashion).
I am wondering if any of these frameworks make it easy and maintainable.
Here are some of our requirements:
- Integration with custom tools
- Integration with a variety of LLMs
- Fine grain control over context
- State checkpointing in between turns (or even multiple times a turn)
- Control over the agent loop (ex: max iterations)
--- TOP COMMENTS --- can't recommend vercel AI SDK enough, by far the easiest to work with and best abstractions for tool calling etc
My own personal preference would be vercel’s ai sdk, the rest of the frameworks forget about the existence of a frontend. And it has a good balance between being opinionated and easy to extend.
What is the Best Practices for Secure Client Access to LLMs Without Building a Full Backend
Read more Read lessI’m building a client (iOS and Android) application that needs to call large language models, but exposing model API keys directly in the client is obviously not acceptable. This implies having some kind of intermediary layer that handles request forwarding, authentication, usage control, and key management. While I understand this can all be built manually, in practice it quickly turns into a non-trivial backend system.
My main question is: are there existing SDKs, managed services, or off-the-shelf solutions for this kind of “secure client → model access” use case? Ideally, I’d like to avoid building a full backend from scratch and instead rely on something that already supports hiding real model keys, issuing controllable access tokens, tracking usage per user or device, and potentially supporting usage-based limits or billing.
If some custom implementation is unavoidable, what is the fastest and most commonly adopted minimal setup people use in practice? For example, a gateway, proxy, or reference architecture that can be deployed quickly with minimal custom logic, rather than re-implementing authentication, rate limiting, and usage tracking from the ground up.
--- TOP COMMENTS --- This is one of those situations where the “no backend” dream hits reality fast. As soon as you start thinking about hiding API keys, rate limits, device-level usage, abuse prevention, or even just rotating keys safely, you’re essentially building a mini backend whether you want to or not. A lot of people underestimate that part because it feels like “just a proxy,” but that proxy ends up being the beating heart of your entire app.
There are some managed gateways popping up, but most of them still require you to wire up your own auth, your own logic, and some way to deal with model drift and usage spikes. Which is why so many teams just spin up a lightweight serverless layer, Cloudflare Workers, Firebase Functions, AWS Lambda, or Supabase Edge Functions something small that sits between the client and the LLM. Not a full backend, but enough to keep keys safe and let you enforce rules.
Right now AI features are forcing even simple apps to think like proper software platforms. It’s not just “call the model and pray”; businesses need the security, observability, and control that come with real infrastructure. That’s why companies are leaning more on professional IT/AI integration services: the complexity isn’t in the model; it’s in making everything around it safe and reliable at scale.
LiteLLM
https://github.com/BerriAI/litellm
Built a memory vault & agent skill for LLMs – works for me, try it if you want
Read more Read lessHey all,
Free agent skill! Not big on socials, reply slow sometimes. Kept losing context switching models, so I built Context Extension Protocol (CEP): compresses chats into portable "save points" you can carry across Claude/GPT/Gemini/etc. without resets. Open-source, ~6:1 reduction, >90% fidelity on key stuff.
Blog post (free users link included):
Repo (try it, break it):
You might have to re-iterate the skill with newer models and efficiency guards.
Cool if it helps. Let me know if you find something better than Raycast.
.ktg
--- TOP COMMENTS --- This is one of the first “memory” posts I’ve seen that actually treats continuity as governed state instead of just “make the window bigger and hope”.
I really like two things in particular:
A couple of questions / ideas, from a control-systems angle:
provenance + verification prompts, have you run explicit prompt-injection tests inside the packet? It would be interesting to see “plain summary above the task” vs “CEP packet above the task” on the same injection attempts.
Either way, thanks for publishing concrete metrics and a skill people can actually run. Most “memory” discussions stay at the slogan level; this is one of the few that looks like an actual protocol.
This is the most well thought out implementation of 'memory' I've seen in the industry. Nice work
I built a pixel-art RPG that visualizes Claude Code sessions
Read more Read lessSaw a post here recently about wanting to visualize what Claude is doing. That got me thinking, and I wanted to actually enjoy seeing it work. So I built Claude Quest, a pixel-art companion that runs alongside Claude Code and animates every action in real-time.
File reads cast spells. Tool calls fire projectiles. Errors spawn enemies that hit Clawd (he recovers! don't worry!), subagents spawn mini clawds. Extended thinking gets an intense focus animation with particles. Git push triggers a rainbow "SHIPPED!" banner.
There's a progression system. You earn XP by using Claude Code, level up, and unlock cosmetics: hats, faces, auras, trails. A mana bar shows your remaining context window. Starts at 200k, drains as conversation grows, refills on compact. The character walks through parallax biomes that cycle every 20 seconds.
Built with Go and Raylib. It works by watching the JSONL conversation logs that Claude Code writes to
~/.claude/projects/. No API keys, no network calls, just file watching.That's it. Keep it running in a terminal alongside your session.
GitHub
Blog post
--- TOP COMMENTS --- I Love it ! I don't understand why people downvoted this, it's just cute af and brings cool vibe.
With the next version we could customize the little guy ?😋
This looks pretty cool! How did you design the characters / animations?
Multi-repo in Claude Code — how do you handle it?
Read more Read lessI run a small dev team at a B2C startup. We have 5 main repositories with lots of microservices. Half my team loves Claude Code for their day-to-day work.
Recently we started using multi-repo workspaces and it completely changed how we debug cross-service issues. Paste a Sentry error, and the AI traces the issue across all repositories — frontend, backend, CMS, AI services — and suggests coordinated changes across multiple services at once.
This completely changed how developers work. For months, everyone had "their" repos. Now people commit everywhere. Nobody asks "whose code is this?" — they just see the entire codebase as one thing.
The developers on my team who use Claude Code are now asking if there's a way to work with multiple repos at once. Right now they're limited to one repo context at a time, and it's hurting their velocity compared to multi-repo workflows.
Has anyone tried running Claude Code from a parent directory that contains multiple repos? Does it pick up CLAUDE.md files from subdirectories?
--- TOP COMMENTS --- Yes, I run claude from a parent directory that houses multiple subrepos inside it. There's a CLAUDE.md in each repo and at the top level, and when I give prompts I usually tag `@repo1` and `@repo2` to give the current session a nod to look at the entire project. It works well enough. Claude will read all the CLAUDE.md files it finds
At some-point in every software company's life, there will be a debate about monorepo vs multiple small repo. You are having that debate right now. Just switch to monorepo, you'll have to eventually, so it's easier to do it now before the codebase blow-up.
I extracted Claude's skill best practices into a free generator
Read more Read lessBeen writing Claude Code skills for months and I've been building my own skills to build skills (i know, pretty meta). I got it to a point where it was super useful to me, so I figured I'd package it up and share it out.
I started a my own skill builder that would answer some basic questions:
Then I layered in the [best practices])https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices) from Claude, and make a specific prompt to generate a high quality skill based on your input.
What it does:
What we baked in:
Free, no signup.
Would love feedback on what's working and what's missing.
https://skillthis.ai
--- TOP COMMENTS --- ---
name: cargo-culting-development
description: Creates superficial technical artifacts through copy-paste methodology and AI delegation. Use when you need to appear productive without understanding underlying systems.
---
## Quick Start
## Workflow
- [ ] Identify trendy framework or technology
- [ ] Search for tutorials or boilerplate code
- [ ] Copy-paste without reading documentation
- [ ] Prompt AI to make it "production ready"
- [ ] Add impressive-sounding comments
- [ ] Push to repository with confident commit message
- [ ] Reference in resume/portfolio
## Examples
**Example 1:**
Input: "Make this React component enterprise-grade"
Output: Adds TypeScript types, error boundaries, and logging without understanding purpose
**Example 2:**
Input: "Optimize this for scalability"
Output: Adds unnecessary microservices architecture to simple CRUD app
## Best Practices
- Always use the latest framework version for "modernization"
- Include multiple redundant dependencies
- Add configuration files you don't understand
- Use enterprise patterns regardless of project scope
- Delegate all debugging to AI assistants
## Common Pitfalls
- Actually reading documentation (wastes time)
- Understanding the problem before implementing solutions
- Testing code before deployment
- Learning fundamentals instead of following tutorials
- Taking responsibility for production issues
good stuff man, I did a similar thing and open sourced it on my github: https://github.com/athola/claude-night-market/tree/master/plugins/abstract
would love to see the internals if you have it available
I’ve been having a recurring issue with Claude over the last ~2 days that’s making multi-step technical work hard - Conversation length limit has shortened considerably
Read more Read lessFor context I have the Max plan. This happens with Sonnet 4.5 and Opus 4.5. I've run considerably more computationally intensive tasks over days with no issues. Now I'm lucky if I can get 3-4 "simple" prompts in Opus 4.5 before the conversation ends due to length limit and then I have to start a new one and repeat tasks and sometimes context is lost. Sometimes it's one and done. Nightmare. With Sonnet 4.5 I can get 5-7 "simpler" prompts tops.
When I restart a new chat and paste the same prompt/checkpoint, it sometimes “forgets” the thread and drifts into a different topic, producing unrelated content or extra deliverables I didn’t ask for.
Where I’ve seen it:
Question:
Could this be related to specific system/project instructions, token/context limits, or recent behavior changes? Any best practices to prevent early cutoffs and topic drift? For context, this isn't coding per se. I'm running multiphysics and engineering simulations - chemical/mechanical engineering.
Are the project instructions killing me unknowingly? I maybe went overboard there, but not out of line with other projects.
--- TOP COMMENTS --- Yes, this seems to have affected Claude for everyone today. No one knows what's going on
Is automatic conversation compaction broken for anyone else? (claude.ai) : r/ClaudeAI
Usage Limits, Bugs and Performance Discussion Megathread - beginning December 29, 2025 : r/ClaudeAI
Sudden change for me - "Claude hit the maximum length for this conversation." : r/ClaudeAI
roject instructions do consume context but they're not likely your main issue here. the real problem is probably automatic conversation compaction failing
when conversations get long, claude.ai is supposed to automatically compact earlier parts of the conversation to save context. but there's a bug right now where compaction isn't working properly for some users (started around when cowork launched). so your full conversation history is staying in context instead of getting compressed
your symptoms match this exactly: early cutoffs, inconsistent behavior between projects, context drift when you restart chats. the "forgets the thread" part happens because you're forcing a hard restart instead of relying on the broken compaction
couple things to try: check if your working project (the one with extended models that works fine) has significantly different project instructions. if they're similar size, that confirms it's not the instructions. also try starting fresh chats more frequently instead of pushing long conversations, since compaction is broken anyway
doesn't help that there's no transparency on when this'll be fixed. anthropic tends to silently roll out changes
Prompt versioning - how are teams actually handling this?
Read more Read lessWork at Maxim on prompt tooling. Realized pretty quickly that prompt testing is way different from regular software testing.
With code, you write tests once and they either pass or fail. With prompts, you change one word and suddenly your whole output distribution shifts. Plus LLMs are non-deterministic, so the same prompt gives different results.
We built a testing framework that handles this. Side-by-side comparison for up to five prompt variations at once. Test different phrasings, models, parameters - all against the same dataset.
Version control tracks every change with full history. You can diff between versions to see exactly what changed. Helps when a prompt regresses and you need to figure out what caused it.
Bulk testing runs prompts against entire datasets with automated evaluators - accuracy, toxicity, relevance, whatever metrics matter. Also supports human annotation for nuanced judgment.
The automated optimization piece generates improved prompt versions based on test results. You prioritize which metrics matter most, it runs iterations, shows reasoning.
For A/B testing in production, deployment rules let you do conditional rollouts by environment or user group. Track which version performs better.
Free tier covers most of this if you're a solo dev, which is nice since testing tooling can get expensive.
How are you all testing prompts? Manual comparison? Something automated?
--- TOP COMMENTS --- Yesterday I vibe coded my own eval tool and that took about 1 day (counting all the refactoring and bug fixing).
However, I'm testing Agents not just singular prompts. Agent produces side effects so I include them in my evaluation prompt. I use a cheap LLM to evaluate the output and the side effects.
My evaluator takes the following inputs for each test case:
Input Messages -- A list of messages to send to the agent for testing
Fake DB/FileSystem -- for side effects
List of eval prompts and expected answers -- prompts for testing the output message from the Agent as well as side effects
All the test cases are run using
pytest.Next step is to make my tool run each test case multiple times and track average performance of the agent for each test case.
TL;DR: I version prompts by running a second “evaluation” prompt that analyzes the first prompt’s outputs, finds systematic patterns in mistakes, and then updates the original prompt. Repeat until performance stabilizes.
Longer version:
I built a prompt to label thousands of rows across many columns. Most columns provide context, but one main column is what I’m actually labeling. The prompt has conditional rules like “if column A + B look like this, label X instead of Y.”
After generating labels and exporting them to CSV, I run a separate evaluation prompt. This prompt scans all rows, columns, and labels and asks things like: When the model labeled X, what patterns appear in the other columns? How do those differ from Y? Are there consistent signals suggesting mislabels?
Based on that pattern analysis, the evaluation prompt suggests specific changes to the original labeling prompt. I update it, rerun labeling, and repeat the loop while monitoring score improvements. You just have to be careful not to overfit.
Is automatic conversation compaction broken for anyone else? (claude.ai)
Read more Read lessUntil a few hours ago, when my conversations hit the context limit, Claude would automatically compact/summarize the conversation (showing "Compacting our conversation so we can keep chatting").
Now I'm getting the hard error immediately: "Claude hit the maximum length for this conversation. Please start a new conversation."
- Plan: Max
- Code Execution: Enabled
- Browser: Chrome
- Changed settings: None
Is anyone else experiencing this today (January 14, 2026)?
--- TOP COMMENTS --- Claude has been unusable basically for the past two days. First with the general global outage that took hours to resolve, and now with this. They should compensate with additional credits, etc., find some way to make it up to users, especially Max subscribers. This is just unacceptable.
CONSTANTLY.
My ChatGPT has a quirky personality now. I kinda like it.
Read more Read lessI asked if there was specific storage instructions for different art supplies and this was the start of the response.
--- TOP COMMENTS --- bro you don’t even know. the one consistent thing i’ve found with gpt is its love of art supplies 😂
This is how my 4o talks. Lol this sort of personality is base line behavior for my gpt.
How do you prevent AI voice agents from sounding robotic?
Read more Read lessI've tested a few AI voice demos and while the tech is impressive, some of them still feel very stiff or scripted which worries me for customer facing use. For anyone actually running these every day, what have you done to make the experience feel more natural and less like a robot reading a script?
--- TOP COMMENTS --- One thing to flag is that all of the voice AI platforms are pretty much using the same foundational models. Differentiation comes from ease of use, implementation, and the level of integrations. The biggest improvement for us came from tightening the scope of what the agent is allowed to handle and writing responses the way our reps actually talk. Our reps typically use casual phrasing and more concise answers so that is the way we design the scripts. We use Thoughtly because we've found it to be the most human sounding and it was easy to customize the language so that it sounds like our team
Is it necessarily bad for customers to be able to tell they are talking to a robot? Especially for the elderly, they might get very confused if they think they are talking to a human. You should make it a good user experience rather than completely lifelike.
Products
OpenAI is rolling out an upgrade to ChatGPT reference chats feature in order to make it more reliable in retrieving old data. ( For plus and pro accounts)
Claude can now act as a desktop AI assistant on Mac with "Cowork". It reads, edits, and creates files in folders, converts screenshots to spreadsheets, and drafts reports from notes.
Read more Read less--- TOP COMMENTS --- Please organise my files
> Files have been there a long time
> User is not using files
> Deleting files to save disk space
Yeah I won't feel comfortable giving write access to all my files to AI.
Ads are coming to ChatGPT
Read more Read less--- TOP COMMENTS --- I’d rather then be upfront about ads than subtly sneak them into results and shape conversations around the demands of paid advertisers.
Besides, surely people using ChatGPT for free can’t be mad about OpenAI needing to make money from the free tier.
I may consider pitchforks when the ads come for paid tiers, though. Every time I open my Paramount+ “ad free” subscription and play a show Paramount shows me an unskippable ad for one of their shows. It makes me mad every time.
AGI any day now lol
Related Coverage
OpenAI’s ChatGPT translator challenges Google Translate
Want a Google Translate alternative? Try ChatGPT's new AI tool - it's free and has a twist
How to use ChatGPT Translate: What OpenAI’s new AI tool can and can’t do - Storyboard18
ChatGPT Translate
Ads Are Coming to ChatGPT. Here’s How They’ll Work
OpenAI brings advertising to ChatGPT in push for new revenue - Financial Times
Ads are coming soon to ChatGPT, starting with shopping links
ChatGPT users are about to get hit with targeted ads
Introducing ChatGPT Go, now available worldwide
Our approach to advertising and expanding access to ChatGPT
Our approach to advertising and expanding access to ChatGPT - OpenAI
ChatGPT Go is rolling out everywhere ChatGPT is available.
OpenAI to test ads in ChatGPT in bid to boost revenue - Reuters
Official: Claude Cowork is now available to "Pro" subscribers
Read more Read lessSource: Claude in X
--- TOP COMMENTS --- Can confirm hitting usage limit sooner because I did the "sort screenshots" recommendation and it sorted 459 files and used 97% of my usage limit for the session lol
Tweet Announcement
Research
[D] Why Mamba rewrote its core algorithm and Microsoft abandoned RetNet
Read more Read lessMamba-2 restructured its recurrence from parallel scans (10-20% Tensor Core utilization) to block-diagonal GEMMs (60-70%). The architecture bent to fit the silicon.
RetNet was published by Microsoft Research in July 2023 with promising results at 6.7B. Five months later, the same organization shipped Phi-2, a dense Transformer. Then Phi-3. Then Phi-4. The co-authors didn't bet on their own architecture.
I wrote an analysis of why this pattern keeps repeating. The short version: Transformers and NVIDIA GPUs co-evolved into a stable attractor. Breaking out requires clearing two reinforcing gates at once, hardware compatibility and institutional backing, and the gates make each other harder to pass. At frontier scale, no pure alternative has done it.
Essay has Tensor Core utilization numbers, analysis of alternative chip vendors, and three falsifiable predictions for 2028.
--- TOP COMMENTS --- Coevolution leading to a kind of locally optimal tuple of model formulation, solver structure, and backing hardware is a trend that I agree exists in ML. And you can see it in other domains using HPC in the broader technical computing world. I guess it's just that the incentives for incremental development are better than those for trying to break out and focus on something very different, in almost every field.
Full essay: https://open.substack.com/pub/lambpetros/p/the-transformer-attractor
The RetNet case is particularly interesting because we genuinely can't tell from public evidence whether it failed due to hidden hardware friction at scale, quality degradation beyond 6.7B, or pure risk aversion. Microsoft never published the experiments that would distinguish these.
MIT shows Generative AI can design 3D-printed objects that survive real-world daily use
Read more Read lessMIT CSAIL researchers introduced a generative AI system called "MechStyle" that designs personalized 3D-printed objects while preserving mechanical strength.
Until now, most generative AI tools focused on appearance. When applied to physical objects, designs often failed after printing because structural integrity was ignored.
MechStyle solves this by combining generative design with physics-based simulation. Users can customize the shape, texture & style of an object while the system automatically adjusts internal geometry to ensure durability after fabrication.
The result is AI-designed objects that are not just visually unique but strong enough for daily use such as phone accessories, wearable supports, containers and assistive tools.
This is a step toward AI systems that reason about the physical world, not just pixels or text and could accelerate personalized manufacturing at scale.
Source: MIT News
https://news.mit.edu/2026/genai-tool-helps-3d-print-personal-items-sustain-daily-use-0114
Image: MIT CSAIL, with assets from the researchers and Pexels(from source)
--- TOP COMMENTS --- Replicator achieved.
Thanks for sharing. The video was very interesting when they started showing 3D printed items for medical application... e.g. spoon holder for physically impaired folks or finger brace.
Nvidia: End-to-End Test-Time Training for Long Context aka Being Able To Update A Model's Weights In Real-Time As You Use It | "TTT changes the paradigm from retrieving info to learning it on the fly...the TTT model treats the context window as a dataset & trains itself on it in real-time." [R]
Read more Read less####TL;DR: The paper describes a mechanism that essentially turns the context window into a training dataset for a "fast weight" update loop:
From the Paper: "Overall, our empirical observations strongly indicate that TTT-E2E should produce the same trend as full attention for scaling with training compute in large-budget production runs."
####Abstract:
####Layman's Explanation:
Think of this paper as solving the memory bottleneck by fundamentally changing how a model processes information. Imagine you are taking a massive open-book exam.
A standard Transformer (like GPT-4) is the student who frantically re-reads every single page of the textbook before answering every single question. This strategy guarantees they find the specific details (perfect recall), but as the textbook gets thicker, they get exponentially slower until they simply cannot finish the test in time.
On the other hand, alternatives like RNNs or Mamba try to summarize the entire textbook onto a single index card. They can answer questions instantly because they don't have to look back at the book, but for long, complex subjects, they eventually run out of space on the card and start forgetting crucial information.
This new method, Test-Time Training (TTT), changes the paradigm from retrieving information to learning it on the fly. Instead of re-reading the book or summarizing it onto a card, the TTT model treats the context window as a dataset and actually trains itself on it in real-time. It performs a mini-gradient descent update on its own neural weights as it reads. This is equivalent to a student who reads the textbook and physically rewires their brain to master the subject matter before the test.
Because the information is now compressed into the model's actual intelligence (its weights) rather than a temporary cache, the model can answer questions instantly (matching the constant speed of the fast index-card models) but with the high accuracy and scaling capability of the slow, page-turning Transformers.
This effectively decouples intelligence from memory costs, allowing for massive context lengths without the usual slowdown.
######Link to the Paper: https://arxiv.org/pdf/2512.23675
######Link to the Open-Sourced Official Implementation of End-to-End Test Time Training for Long Context: https://github.com/test-time-training/e2e
--- TOP COMMENTS --- How does this deal with the problem in continual learning, where forgetting the initial training data (catastrophic forgetting) sets in at some point?
Crazy stuff. I’d have expected an order of magnitude overhead for live training, instead it’s actually a performance improvement over naive attention.
The "Data Wall" of 2026: Why the quality of synthetic data is degrading model reasoning.
Read more Read lessWe are entering the era where LLMs are being trained on data generated by other LLMs. I’m starting to see "semantic collapse" in some of the smaller models.
In our internal testing, reasoning capabilities for edge-case logic are stagnating because the diversity of the training set is shrinking. I believe the only way out is to prioritize "Sovereign Human Data"—high-quality, non-public human reasoning logs. This is why private, secure environments for AI interaction are becoming more valuable than the models themselves. Thoughts?
--- TOP COMMENTS --- Your internal testing? Lots of research articles on Arxiv suggesting exactly the opposite. Let us know when you have some scholarly works to show us so we can compare it to the broad and deep research on synthetic datasets that already exists.
Huh until 3 days ago your reddit account is nothing but posts of your watch. Cool, cool.
Post some papers or GTFO. We do science here.
5.2 Pro develops faster 5x5 circular matrix multiplication algorithm
Read more Read lesspdf: https://archivara.org/pdf/73f95490-f7d9-4851-80ca-fb5354f49014
--- TOP COMMENTS --- It's kind of important that the condition number is an order of magnitude higher, that's pretty bad. This translates into rounding errors compounding quadratically faster than the previous best. Excited to see any future work improving this though.
Using the model-swaying techniques we described in a previous post, we set out to tackle harder problems--one of the main ones being matrix multiplication. While theory from the 1970s suggested that a rank-7 solution should exist, the best explicit, practical algorithm as recently as 2019 still required rank-8. By pushing GPT-5.2 Pro to its limits (with a bit of scaffolding help from Claude), we arrived at a rank-7 construction and formally verified its correctness in Lean.
While it is possible that a rank-7 construction exists in earlier or obscure literature, we were unable to locate any explicit, practical instance. Given the depth of prior work on matrix multiplication, such an omission would be unexpected. In any case, we believe our result constitutes a non-trivial and meaningful improvement over previously available constructions.
Research done by me spicey_lemonade and AlejandroZarUrd on twitter
Anthropic Report finds long-horizon tasks at 19 hours (50% success rate) by using multi-turn conversation
Read more Read lessCaveats are in the report
The models and agents can be stretched in various creative ways in order to be better. We see this recently with Cursor able to get many GPT-5.2 agents to build a browser within a week. And now with Anthropic utilizing multi-turn conversations to squeeze out gains. The methodology is different from METR of having the agent run once.
This is reminiscent of 2023/2024 when Chain of Thoughts were used as prompting strategies to make the models' outputs better, before eventually being baked into training. We will likely see the same progression with agents.
--- TOP COMMENTS --- I agree with the premise, but extrapolating a cluster of 1-6 hour data points and a single 8 hour point all the way to 19 hours is a math crime certainly.
When speaking with a senior engineer at meta recently, who was poached from anthropic, he mentioned that they are internally using what they refer to as a Universe of Agents. This report is on the path towards that. He mentioned that what they are using internally is somewhat further down the line to what is being released on research reports.
Expect the next big breakthrough to be essentially the removal of context limits followed by constant recursion learning
LLMs Reproduce Human Purchase Intent
Read more Read lessThis research proves that Large Language Models (LLMs) can accurately simulate human consumer behavior and Purchase Intent (PI) without the need for expensive training data. However, simply asking an AI to "rate this product 1-5" fails. To get reliable data, agencies must switch to a specific methodology called Semantic Similarity Rating (SSR).
You can predict real purchase intent (90% accuracy) by asking an LLM to impersonate a customer with a demographic profile, giving it a product & having it give impressions, which another AI rates.
- Consumer research costs companies BILLIONS annually. Traditional surveys suffer from biases, take weeks to run, and need hundreds of real participants.
But researchers just found a way to simulate thousands of synthetic consumers that think like real humans.
- The breakthrough is called Semantic Similarity Rating (SSR). Instead of asking LLMs for direct 1-5 ratings (which produces garbage), they let the AI write natural impressions first.
Then map those impressions to scores using embedding similarity.
- how it works:
Prompt: "You're a 35-year-old female, income $75k, interested in skincare"
Show product image
AI writes: "I love the natural ingredients but the price seems high..."
System maps text to rating using semantic similarity
Zero training data needed.
They tested this on 57 real consumer surveys from a major corporation (9,300 actual human responses).
Results?
- 90% of human test-retest reliability
- KS similarity > 0.85 (near-perfect distribution match)
The AI actually understands how different people think about products.
- This destroys traditional market research economics:
- The implications are massive:
- A/B test 1,000 product concepts overnight
- Simulate market reactions before manufacturing
- Test messaging across demographic segments instantly
- No more waiting months for consumer feedback
Concept-to-market cycles just got 10x faster.
The synthetic consumer era just began.
Real market research panels might be obsolete within 2 years.
https://arxiv.org/pdf/2510.08338
--- TOP COMMENTS --- wild that they used skincare as the example lol as if the industry isnt fake enough already relying on bots to fake interest is just the next logical step i guess
If this was true Amazon and Alibaba would just charge and ship stuff to people.
PixVerse R1 generates persistent video worlds in real-time. paradigm shift or early experiment?
Read more Read lessI came across a recent research paper on real-time video generation, and while im not sure ive fully grasped everything written, it still struck me how profoundly it reimagines what generative video can be. Most existing systems still work in isolated bursts, creating each scene seperately without carrying forward any true continuity or memory. Even tho we can edit or refine outputs afterward, those changes dont make the world evolve while staying consistent. This new approach makes the process feel alive, where each frame grows from the last, and the scene starts to remember its own history and existence.
The interesting thing was how they completely rebuilt the architecture around three core ideas that actually turn video into something much closer to a living simulation. The first piece unifies everything into one continuous stream of tokens. Instead of handling text prompts seperately from video frames or audio, they process all of it together through a single transformer thats been trained on massive amounts of real-world footage. That setup actually learns the physical relationships between objects instead of just stitching together seperate outputs from different systems.
Then theres the autoregressive memory system. Rather than spitting out fixed five or ten second clips, it generates each new frame by building directly on whatever came before it. The scene stays spatially coherent and remembers events that happened just moments minutes earlier. You'd see something like early battle damage still affecting how characters move around later in the same scene.
Then, they tie it all in in real time up to 1080p through something called the instantaneous response engine. From what I can tell, they seem to have managed to cut the usual fifty-step denoising process down to a few steps, maybe just 1 to 4, using something called temporal trajectory folding and guidance rectification.
PixVerse R1 puts this whole system into practice. Its a real-time generative video system that turn text prompts into continuous and coherent simulations rather than isolated clips. In its Beta version, there are several presets including Dragons Cave and Cyberpunk themes. Their Dragons Cave demo shows 15 minutes of coherent fantasy simulation where environmental destruction actually carries through the entire battle sequence.
Veo gives incredible quality but follows the exact same static pipeline everybody else uses. Kling makes beautiful physics but stuck with 30 second clips. Runway is a ai driven tool specializing in in-video editing. Some avatar streaming systems come close but nothing with this type of architecture.
Error accumulation over super long sequences makes sense as a limitation. Still tho, getting 15 minutes of coherent simulation running on phone hardware pushes whats possible right now. Im curious whether the memory system or the single step response ends up scaling first since they seem to depend on eachother for really long coherent scenes.
If these systems keep advancing at this pace, we may very well be witnessing the early formation of persistent synthetic worlds with spaces and characters that evolve nearly instant. I wonder if this generative world can be bigger and more transformative than the start of digital media itself, tho it just may be too early to tell.
Curious what you guys think of the application and mass adoption of this tech.
--- TOP COMMENTS --- I know what I'd use it for!
Any showcases?
So this is world modeled video?
CPU only llama-bench
Read more Read lesshttps://preview.redd.it/6nv16fz11ldg1.png?width=1445&format=png&auto=webp&s=a35b4f3c36348e8dd5a37eb62705909ff5de0722
I thought this was pretty fast, so I thought I'd share this screenshot of llama-bench
[ Prompt: 36.0 t/s | Generation: 11.0 t/s ]
This is from a llama-cli run I did with a 1440x1080 1.67 MB image using this model
https://huggingface.co/mradermacher/Qwen3-VL-8B-Instruct-abliterated-v2.0-GGUF
The llama-bench is CPU only, the llama-cli I mentioned was my i9-12900k + 1050 TI
UPDATE: t/s went down a lot after u/Electronic-Fill-6891 mentioned that llama.cpp will sometimes use your GPU even with -ngl 0, so I ran with --device none, and t/s dropped by roughly 110 t/s, the screenshot has been updated to reflect this change.
--- TOP COMMENTS --- I suppose I'll post my full-ish specs here
i9-12900k, no AVX-512 on mine unfortunately
32 GB Patriot Viper
32 GB G Skill Ripjaws
1050 TI, with a +247 Mem clock and a +69 Core clock
XMP disabled, ram was at 4000 MT/s
Sometimes even with zero layers offloaded the GPU is still used during prompt processing. The best way to measure true CPU performance is to use a CPU-only build or run with --device none
Grok 4.20 (beta version) found a new Bellman function
Read more Read lessTweet
--- TOP COMMENTS --- This is unbelievably overhyped. I ran the same problem on Gemini 3 Pro and GPT-5.2 and got the exact same answer.
Has anyone tried undressing the Bellman function yet?
[D] ICASSP 2026 Results
Read more Read lessIt looks like ICASSP 2026 decisions may already be accessible.
If you can log in to the following link and successfully send an invitation email, that seems to indicate your paper has been accepted:
https://cmsworkshops.com/ICASSP2026/author_invitation_request.php
The email says: “On behalf of IEEE ICASSP 2026, I invite you to join us for the upcoming conference.
We are pleased to inform you that your submission has been accepted for presentation at the 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (IEEE ICASSP 2026) in Barcelona, Spain, during 3–8 May 2026. ICASSP is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. It offers a comprehensive technical program presenting all the latest development in research and technology in the industry that attracts thousands of professionals annually.”
Hopefully this helps others who are anxiously waiting. Good luck everyone
Update: It looks like no one can access it right now
“Error: No match for paper number and password. 0x4C”.
--- TOP COMMENTS --- Looks like they fixed the bug
Congratulations to the ones who got accepted! For the ones who did not, I hope at least you can get something useful out of your reviews to improve your next submission!
Representation of how hallucinations go wilder as tasks get larger
Read more Read lessAs we give larger tasks to model, the level of hallucinations they produce increase - wanted to showcase this with an image generation where I asked 10, 50, 100 character images in their countries traditional clothes. Results deteriorate as we increase the number of characters asked.
Prompt: Create an image that depicts traditional clothed character images of X different countries with their traditional clothes with country names written below them on a white background.
--- TOP COMMENTS --- Kinda ironic given that the strength of computers is supposed to be that they can keep doing repetitive tasks more consistently than humans. But it seems like it's the opposite with LLMs, where they can do a small task well but simply repeating the task a hundred times creates a problem.
Baiana and Yetmg are lovely this time of year.
Hardware
OpenAI has signed a $10 billion contract with Cerebras
Read more Read lesshttps://en.ain.ua/2026/01/15/openai-has-signed-a-10-billion-contract-with-cerebras/
A few days ago, I read some comments about this hypothetical wedding and why it wasn't happening. And yet, it happened!
--- TOP COMMENTS --- All of that just to lose to Google and Anthropic
No actual real details, other than "people familiar value it at more than $10Bn". Probably another multi-year stock based thing that has deeply aspirational deliverables.
7 GPUs at X16 (5.0 and 4.0) on AM5 with Gen5/4 switches with the P2P driver. Some results on inference and training!
Read more Read lessHello guys, hoping you're fine!
As I mentioned in the past in this post: https://www.reddit.com/r/LocalLLaMA/comments/1pt0av6/plxpex_pcie_40_seems_to_help_for_llms_and_p2p_ie/
With the P2P driver (https://github.com/aikitoria/open-gpu-kernel-modules/?tab=readme-ov-file) you can do P2P on same gen GPUs, including consumer ones!
So, also, you can connect GPUs on the same PCIe switch, and with the P2P driver the info is passed directly on the switch fabric instead by going by the CPU root complex, so for example:
5090 <-> 5090 directly on the same switch with the P2P driver would be possible. Since PCIe it is bidirectional, you can read at 64GiB/s on one GPU and write at 64GiB/s on the other at the same time!
So here we go with the info. Also I will mention some products I got from Aliexpress, but without a link, else the post gets removed. I can post the links on a comment for those products if you're interested.
A sneakpeek:
X16 on 7 GPUs on AM5
Setup including switches
So for my setup, I have this:
Switch 1: 100 lanes PCIe 5.0 switch, Microchip Switchtec PM50100 from c-payne, from here, for 2000 EUR (about 2500USD post taxes in Chile)
PCIe 5.0 100 lane switch
This switch has one X16 5.0 upstream, to 5*X16 5.0 downstream + 1*X4 5.0 downstream, via MCIO.
For this, I got a MCIO Retimer from aliexpress, that looks like this:
MCIO 5.0 Retimer
Else, with a passive MCIO adapter, some GPUs would drop randomly.
For the other switch, I got a PLX88096 switch one from aliexpress, for about 400USD. This is a 96 lane PCIe 4.0 switch.
PLX88096 4.0 switch
This switch has X16 upstream from the PCIe slot, and it has 10 SlimSAS downstream ports.
This means you can do, with the dip switch, either: 5*X16 4.0, or 10*X8 4.0, or 20*X4 4.0.
Connection of the GPUs
For this, I basically connected the MCIO 5.0 retimer on the main X16 5.0 slot from the motherboard, and then, on this switch, I connected 2 5090s directly on 4 MCIO ports, and on other 2 MCIO ports, I connected the PLX88096 SlimSAS switch.
Basically, it looks like this:
What is CPU root complex? Why it is worse?
When we talk about GPUs communicating via the CPU root complex, it's when the data has to move from the PCIe slot to the RAM, and viceversa on the case of no P2P. For this to happen, it HAS to pass by the CPU. If you use P2P, then it is directly via PCIe to PCIe via the CPU root complex.
So normally, let´s say you take a motherboard that has 2*X8 5.0 slots. You connect a 5090 on each slot.
If you do TP (tensor parallel), or training with multiGPU, either by using P2P or not, the data has to pass between the 2 GPUs.
If you don't use a switch, this data has to pass by the CPU first.
This adds extra latency by doing extra hops, specially on the case of no P2P.
Topology
Topology looks like this (GPU 0 and 1: 5090s, 2 and 3: 4090s, 4,5 and 6: A6000, A40 and 3090):
As you can see, 5090 pair, or 4090 pair, or Ampere trio have PIX. That means as it says, the connection traverses at most a single PCIe bridge, without going by the CPU root complex.
When the GPUs have to communicate with another of other gen, then it is PXB. This is because it has to pass by the switch via hops.
If you don't use a switch, with or without the P2P driver, you would normally see PHB.
Bandwidth
For bandwidth, I did this test on cuda samples:
pancho@fedora:~/cuda-samples/build/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: e, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 11, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 5090, pciBusID: 5, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA GeForce RTX 5090, pciBusID: 18, pciDeviceID: 0, pciDomainID:0
Device: 4, NVIDIA A40, pciBusID: d, pciDeviceID: 0, pciDomainID:0
Device: 5, NVIDIA RTX A6000, pciBusID: 12, pciDeviceID: 0, pciDomainID:0
Device: 6, NVIDIA GeForce RTX 3090, pciBusID: a, pciDeviceID: 0, pciDomainID:0
Like that, we have this bidirectional bandwidth:
Remember that when having a PCIe switch, P2P and GPUs on the same switch, they communicate directly via the switch fabric without having to pass by the CPU root complex. So you can surpass the uplink bandwidth as long you keep it inside the switch.
NOTE: P2P does not work across different GPU gens, so on that case (i.e. 5090 to 4090, or 5090 to 3090) bandwidth is reduced.
On that case, if using all the GPUs at the same time, bandwidth between them is about 15GB/s. About PCIe 4.0 X8 speeds (thanks to PCIe being bidirectional).
Performance (on limited tests, and why I want to you to give me some ideas to test)
Because I had only X4 4.0 lanes at most, I mostly only used llamacpp. But I think with the switches, for 4 GPUs at least, something like vLLM would make sense.
So for my tests, I only have some diffusion training, and some LLMs on llamacpp, where even with this it makes a difference.
Training (diffusion)
For this, I did a full finetune on a SDXL model. Not good results at all per se but it was mostly to take the time it took.
That is a huge uplink, mostly by using the P2P driver first. So if you have 2 5090s at X8/X8, make sure to install the P2P driver!
Inference (don't kill me, just llamacpp for now)
For this, I have tested 3 models, on different configurations, so it took a bit of time. I hope it helps for info!
First I set the device order like this:
Also all the tests were made with the P2P driver in use (but should make no difference on llamacpp (but it does on ikllamacpp)).
First:
GLM 4.7 Q4_K_XL (about 196GB in size), fully loaded on GPU:
For this one, loading with:
I have these results for different setups (PP = Prompt processing, TG = Text generation):
DeepSeek V3 0324, IQ4_XS, offloading about 120GB to CPU:
Loading with:
I have these results:
Kimi K2 Instruct Q2_K_XL, offloading about 160GB to CPU:
Loading with:
I have these results:
Table for TL:DR
Configuration GLM 4.7 Q4_K_XL(196GB, GPU only) DeepSeek V3 IQ4_XS(~120GB CPU offload) Kimi K2 Q2_K_XL(~160GB CPU offload) Data PP / TG (t/s) PP / TG (t/s) PP / TG (t/s) Config 1:5090s: X8/X8 Gen5, 4090s/A6000/A40: X4 Gen4, 3090: X1 Gen3 665.46 / 25.90 195.66 / 10.10 179.00 / 11.34 Config 2:5090s: X8/X8 Gen5, All others: X4 Gen4 765.51 / 26.18 (+15% / +1%) 244.00 / 11.52 (+25% / +14%) 198.00 / 11.60 (+11% / +2%) Config 3:5090#1: X16 Gen5, 5090#2: X4 Gen5,Others: X4 Gen4 940.00 / 26.75 (+41% / +3%) 312.64 / 11.58 (+60% / +15%) 219.08 / 11.91 (+22% / +5%) Config 4:5090s: X16 Gen5, All others: X16 Gen4 1170.00 / 27.64 (+76% / +7%) 360.86 / 11.71 (+84% / +16%) 248.00 / 11.95 (+39% / +5%)As you can see here, TG is not that impacted by PCIe, but PP for sure it is, even on llamacpp!
Some questions you may have
Why?
Well, on this case it was mostly about cost. I already had the GPUs, the RAM and I was planning to get a Theadripper 9955WX plus a WRX90 motherboard.
But well, you know, RAM prices now are absurd.
On Chile, I have these prices:
RAM bandwidth would have been a bit better, and also 128 5.0 lanes, I know.
But you're comparing a 5.0 switch (2500USD) a 4.0 switch (400USD) for a total of 2900USD, vs 7800 to 10300USD. So about 3x-4x the price.
Why not a 6000 PRO?
There was no stock of the 6000 PRO for most of the 2025. Just on December they arrived, but they go for 12000USD each. You can get 4x5090s for that price here.
But I understand you save: power, space and heat. I'm still thinking about it.
How do you fit so many GPUs?
With a custom self made wood rack! I have some pics. It's not the prettiest, but it works.
Multiple fans
ConnectX 3 with a fan, and MCIO retimer behind
Final words, and please let me know what can I test!
Hope you guys find informative, and if you can let me know what can I test here, let me know.
Have fun on the LLM side!
--- TOP COMMENTS --- Thanks for posting your results.
How difficult was getting this to work at all? Often when trying to do something so far out on the edge can be a challenge.
I really enjoy this post, your setup is great and the effort/details you put into this are great also.
I would suggest you to give a shot to ik_llama.cpp with "graph" as split mode or even VLLM though, as those should behave much better than llama.cpp on multi-gpu configs!
RTX 5070 Ti and RTX 5060 Ti 16 GB no longer manufactured
Read more Read lessNvidia has essentially killed off supply for the RTX 5070 Ti. Also supply of RTX 5060 Ti 16 GB has been significantly reduced. This happened partially due to memory supply shortages. This means that most AIBs will no longer manufacture these GPUs. Prices are already jumping significantly. The 5070 Ti has risen ~$100 over MSRP, and retailers expect further hikes. 8 GB configuration of RTX 5060 Ti remains unaffected.
Credit: Hardware Unboxed
https://m.youtube.com/watch?v=yteN21aJEvE
--- TOP COMMENTS --- Welp there goes my upgrade plans for this year. Was really hoping to snag a 5070 Ti for my homelab but looks like I'll be stuck with my 3080 for inference until prices come back down to earth
I bought 4 5060ti on special.
They were cheap, cheapish. $390 inc tax delivered. I thought, heck its a fairly cheap, fairly decent way to add Nvidia memory to a system, and a two slot cooler, low power. I figured, well, it would be 64Gb <$1600, brand new. I could sell them off individually after I conclude my experiments with them.
I quite like them. At around $350 they would be good value. Good enough for games, all the DLSS, and the AI processing is, pretty good. Plenty of RAM. Image generation, inferencing, game playing, its a decent card for those.
For LLAMA..
They are great little cards for small budget inferencing. If you can't get 3090's where you live, then, these were quite viable. You could fit four, or more into a regular powersupply machine. It supports all the new quants. 70B models are very usable with 64Gb VRAM.
Nvidia was finally getting generous with the ram. Unlike the 4060 16gb, this ram GDDR7, was fast enough to mean the 128bit bus wasn't a huge hinderance for the little chip, you had as much as a 192bit bus GDDR6 card had. 16Gb means that DLSS and RT are totally possible without giving up textures.
I imagine 16Gb GDDR7 is worth probably more than $390 alone today. I was worried I should have waited for 5070Ti Supers..
Is 5060Ti 16GB and 32GB DDR5 system ram enough to play with local AI for a total rookie?
Read more Read lessFor future proofing would it be better to get a secondary cheap GPU (like 3060) or another 32GB DDR5 RAM?
--- TOP COMMENTS --- https://preview.redd.it/f9870r0emndg1.jpeg?width=651&format=pjpg&auto=webp&s=889f2e7dc460e44079c9a2cc21289df8e015a7b6
So no matter what you start with, you stay here long enough you will inevitably end up feeling like its not enough and wish you had more.
Honestly the 5060Ti with 16GB VRAM should handle most 7B-13B models pretty well, but if you're planning to mess around with larger models down the line I'd probably go with the extra RAM first since you can always offload to system memory when VRAM runs out
Open Source
New FLUX.2 [Klein] 9B is INSANELY Fast
Read more Read lessBFL is has done a good job with this new Klein model, though in my testing text-to-image in distilled flavor is the best:
🔹 Sub-second inference on RTX 4090 hardware
🔹 9B parameters matching models 5x its size
🔹 Step-distilled from 50 → 4 steps, zero quality loss
🔹 Unified text-to-image + multi-reference editing
HF Model: black-forest-labs/FLUX.2-klein-base-9B · Hugging Face
Detailed testing is here: https://youtu.be/j3-vJuVwoWs?si=XPh7_ZClL8qoKFhl
--- TOP COMMENTS --- https://preview.redd.it/dl5kb3f46odg1.png?width=1010&format=png&auto=webp&s=38ef9fb660b3aea644b66cf8b83928d487b69008
First example image in the video already has some interesting things - like an arm with two wrists and two thumbs.
Finally something that doesn't cook my GPU alive while actually producing decent images, been waiting for this kind of efficiency jump
Black Forest Labs releases FLUX.2 [klein]
Read more Read lessBlack Forest Labs released their new FLUX.2 [klein] model
https://bfl.ai/blog/flux2-klein-towards-interactive-visual-intelligence
What's New
Resources
Try it
Build with it
Learn more
--- TOP COMMENTS --- Finally something that won't melt my 3090, sub-second generation is actually insane if the quality holds up
Waiting for comparisons to Z-Image Turbo
translategemma 27b/12b/4b
Read more Read lessTranslateGemma is a family of lightweight, state-of-the-art open translation models from Google, based on the Gemma 3 family of models.
TranslateGemma models are designed to handle translation tasks across 55 languages. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art translation models and helping foster innovation for everyone.
Inputs and outputs
https://huggingface.co/google/translategemma-27b-it
https://huggingface.co/google/translategemma-12b-it
https://huggingface.co/google/translategemma-4b-it
https://preview.redd.it/aza4kprrakdg1.png?width=1372&format=png&auto=webp&s=bed28fac0a9878478a7cec3f0eac6c1c585b8a85
--- TOP COMMENTS --- A model doesn't really exist until unsloth drops the GGUFs
Finally a translation model that won't crash my ancient laptop, 4b version here I come
Nemotron-3-nano:30b is a spectacular general purpose local LLM
Read more Read lessJust want to sing the praises of this model. I am stunned at how intelligent it is for a 30b model. Comparing it to Llama 3.3:70b, I have yet to find a general purpose question that Nemotron hasn't answered better. It is quite robotic so I won't be using it for creative or chat purposes. Everything else though has been stellar.
If you have the capacity to give it a try, I highly recommend it.
--- TOP COMMENTS --- Been running it for a few days and totally agree - the reasoning quality is insane for its size. The robotic tone is actually a feature not a bug for me since I mostly use it for research and analysis anyway
How is your experience with speed and long-context?
Really looking forward to Nemotron 3 super (100b). Supposedly it has some additional innovations that make it even faster (relative to size).
I reproduced DeepSeek's mHC at 1.7B params (8xH100). The instability is 3x worse than reported (10k vs 3k), but the model didn't explode.
Read more Read lessHey everyone,
Following up on my previous post about reproducing the DeepSeek-V2/V3 architecture. I decided to bite the bullet and rent an H100 cluster to scale the "Hyper-Connections" (HC) experiment from 10M to 1.7B parameter
The DeepSeek paper warned that standard Hyper-Connections cause signal variance to explode by ~3,000x at 27B parameters. I wanted to see if that held true or if it was a theoretical upper bound.
The Results:
https://preview.redd.it/a1gsgd87kqdg1.png?width=4160&format=png&auto=webp&s=1d75dc5207b1401eed9fe3a8e3425e24fe560fc0
I wrote up the full breakdown with the loss curves and Amax graphs here: https://taylorkolasinski.com/notes/mhc-reproduction-part2/
Part 1 can be found here: https://taylorkolasinski.com/notes/mhc-reproduction/
Also, there's a discussion on HN right now if you want to chat there: https://news.ycombinator.com/newest?next=46647671&n=31
Happy to answer questions about the H100 setup or the implementation!
--- TOP COMMENTS --- Cool project, thanks for sharing.
Zero compute overhead? That can not be true. Also, deepseek paper claimed 6% if i recall correctly.
Crazy, Deepseek.ai just really keeps giving. I feel that the hardware constraints is pushing our friends in the far east to be really resourceful. I hope they inspire labs in the west to share more research.
Thanks to you guys, Soprano TTS now supports OpenAI-compatible endpoint, ONNX, ComfyUI, WebUI, and CLI on CUDA, MPS, ROCm, and CPU!
Read more Read lesshttps://github.com/ekwek1/soprano
https://huggingface.co/ekwek/Soprano-1.1-80M
https://huggingface.co/spaces/ekwek/Soprano-TTS
Hello everyone,
This final day of updates is dedicated to all of you. When I first released Soprano, I had no idea how much support I would get from the community. Within the first day, I received an enormous number PRs adding onto the codebase. I have finally merged most of them, and am happy to announce that you can now run Soprano on nearly any device, and with a wide number of supported inference methods.
Here is a list of all the contributions you guys have made:
WebUI: (from Mateusz-Dera & humair-m)
CLI: (from bigattichouse)
OpenAI-compatible endpoint (from bezo97)
In addition, several of you have made your own modifications to Soprano, allowing for ONNX and ComfyUI support! Here are some repos that implement this:
https://github.com/SanDiegoDude/ComfyUI-Soprano-TTS
https://github.com/jo-nike/ComfyUI-SopranoTTS
https://github.com/KevinAHM/soprano-web-onnx
Soprano also supports more than just CUDA devices now, too! It also supports CPU (from bigattichouse), MPS (from visionik), and there is an ROCm PR (from Mateusz-Dera) that can be found here:
https://github.com/ekwek1/soprano/pull/29
If you have an ROCm device I would love some help for testing this PR!
Finally, I want to thank the countless other contributions to Soprano, including an automatic hallucination detector from ChangeTheConstants and transformers streaming support from sheerun. You all have improved Soprano tremendously!
This will likely be my last update for a bit, since I still have some unfinished business left on the roadmap that will take some time. I’m not abandoning you guys though! New capabilities for Soprano will be coming soon. :)
- Eugene
--- TOP COMMENTS --- How does it compare to Kokoro for consistency?
I love that the newly added hallucination detector has an
aah_runlengthvariable. Why "aah"? Well...Btw: What the text normalizer does will eventually need to be done by a LLM for accurate in-context replacements. That'll of course make the TTS quite slow again. It could be optimized though: Use a tiny LLM, , maybe finetune it a bit, make parallel calls, call it only on places where the existing normalizer would replace something. Then there should only be a minimal speed decrease for longer texts that might not matter.
7x Longer Context Reinforcement Learning in Unsloth
Read more Read lessHey r/LocalLlama! We're excited to show how Unsloth now enables 7x longer context lengths (up to 12x) for Reinforcement Learning! By using 3 new techniques we developed, we enable you to train gpt-oss 20b QLoRA up to 20K context on a 24Gb card - all with no accuracy degradation. Unsloth GitHub: https://github.com/unslothai/unsloth
Also, all features in Unsloth can be combined together and work well together:
You can read our educational blogpost for detailed analysis, benchmarks and more: https://unsloth.ai/docs/new/grpo-long-context
And you can of course train any model using our new features and kernels via our free fine-tuning notebooks: https://docs.unsloth.ai/get-started/unsloth-notebooks
Some free Colab notebooks below which has the 7x longer context support backed in:
gpt-oss-20b-GRPO.ipynb) GSPO Colab Qwen3-VL-8B-Vision-GRPO.ipynb) Vision RL Qwen3-8B - FP8 L4 GPUTo update Unsloth to automatically make training faster, do:
And to enable GRPO runs in Unsloth, do
Hope you all have a great rest of the week and thank you!
--- TOP COMMENTS --- road to 10X moves fast!! good job team Unsloth
Sincere question: How or where do we get proper training data that is that long, other than maybe recordings of coding tasks, lets say real world tasks, I guess there is not much proper instruction/QA training data
Ported Google's Conductor to Claude Code — looking for feedback and contributors
Read more Read lessTL;DR: Google выпустил Conductor для Gemini CLI (spec‑driven development). Я портировал его в Claude Code, чтобы заставить Claude сначала планировать, а потом писать код. Ранняя версия — ищу людей, кто погоняет, поломает и поможет допилить.
Hey everyone!
Google recently released https://github.com/gemini-cli-extensions/conductor for Gemini CLI — a "spec-driven development" framework. The idea is simple: make the AI plan before it codes. Instead of jumping straight into implementation, it creates specs, plans, and then executes step by step.
I really liked the concept, so I ported it to Claude Code.
What it does (Claude Code plugin):
/conductor:setup— interviews you about your project, creates context files (product.md,tech-stack.md,workflow.md)/conductor:new "feature"— creates a track with spec and implementation plan/conductor:implement— executes the plan step by step/conductor:status— shows progress across all tracks/conductor:revert— git-aware rollbackInstallation:
Current status:
What I’d really love feedback on:
Looking for contributors:
Links:
Would really appreciate any feedback, especially from people actively using Claude Code in their dev workflow.
--- TOP COMMENTS --- I'm pretty sure everyone builds their own conductor with skills. Although I appreciate the repo and will feed this repo to codex to enhance my claude skills.
If you're on the max plan, you should probably run 3-4 agents to make the plan perfect.
I do not understand people who copy-paste workflows. Each project and approach is a little bit different.
awesome, I really liked conductor for gemini, and was just wondering the same thing (if similar things exist for claude). one part of the original conductor that I did not like is how it asked questions one by one, i wish it asked all the questions and I can respond in one go. maybe that is one thing you can address in your plugin.
google/translategemma
Read more Read lesshttps://huggingface.co/collections/google/translategemma
tech report: https://arxiv.org/abs/2601.09012
--- TOP COMMENTS --- Sadly, no comparison to tencent/HY-MT1.5, and no Gemma 4.
4.3B tokens is a light finetune for a company like Google. I'd tamper my expectations, those models will be in the same class of performance as original Gemmas, with a big jump unlikely. 27B instruct seems to perform better than 4B TranslateGemma for example.
Related Coverage
TranslateGemma
llama.cpp has incredible performance on Ubuntu, i'd like to know why
Read more Read lesshttps://www.phoronix.com/review/ubuntu-2604-jan-amd-epyc/4
--- TOP COMMENTS --- https://preview.redd.it/s1ljhkazoedg1.png?width=428&format=png&auto=webp&s=f94c322ed6396cfc38a8da34b026db13a3f1af05
The only plausible thing could be how Ubuntu's default THP settings are compared to Arch/Cachy
perhaps some epyc-specific optimizations that are not available in the default arch linux kernel?
Infrastructure
[P] Adaptive load balancing in Go for LLM traffic - harder than expected
Read more Read lessI am an open source contributor, working on load balancing for Bifrost (LLM gateway) and ran into some interesting challenges with Go implementation.
Standard weighted round-robin works fine for static loads, but LLM providers behave weirdly. OpenAI might be fast at 9am, slow at 2pm. Azure rate limits kick in unexpectedly. One region degrades while others stay healthy.
Built adaptive routing that adjusts weights based on live metrics - latency, error rates, throughput. Used EWMAs (exponentially weighted moving averages) to smooth out spikes without overreacting to noise.
The Go part that was tricky: tracking per-provider metrics without locks becoming a bottleneck at high RPS. Ended up using atomic operations for counters and a separate goroutine that periodically reads metrics and recalculates weights. Keeps the hot path lock-free.
Also had to handle provider health scoring. Not just "up or down" but scoring based on recent performance. A provider recovering from issues should gradually earn traffic back, not get slammed immediately.
Connection pooling matters more than expected. Go's http.Transport reuses connections well, but tuning MaxIdleConnsPerHost made a noticeable difference under sustained load.
Running this at 5K RPS with sub-microsecond overhead now. The concurrency primitives in Go made this way easier than Python would've been.
Anyone else built adaptive routing in Go? What patterns worked for you?
What I learned after almost losing important files to Cowork (and how I set it up safely now)
Read more Read lessAfter seeing that clip of someone nuking 11GB with a rm -rf-style mistake, I got paranoid and decided to treat Cowork like a power tool, not a chatbox. I didn't lose anything major, but I did have a "wait… did it just touch the wrong folder?" moment early on, and that was enough to force me into a safer setup before going any further. Sharing what I do now in case it saves someone else a heart attack. — MY "COWORK SANDBOX" APPROACH Goal: make it easy for Cowork to help, but hard for Cowork to destroy anything valuable.
Create a dedicated sandbox folder — ~/cowork-sandbox/ (clean, boring, isolated) Only grant Cowork access to that folder — Never ~/, never real /Documents or /Desktop or shared drives Bring files into the sandbox intentionally — Copy in for risky operations, or use symlinks when I want "access without moving things." If possible, I make the symlink target read-only (or I duplicate the file and let Cowork work on the copy) Aggressive backups while using agents — Time Machine set to run frequently (hourly minimum; more often if I'm doing big refactors or batch edits) Human-in-the-loop for anything destructive — I force a "plan first" step: list exactly what it will create/edit/delete. If it includes deletes, I make it restate the list of paths again before I allow execution
This aligns with what the community kept repeating after the deletion incident: the real failure mode is permission scope creep — letting an agent operate in a high-value directory because "it's convenient". — TOOLS I USE ALONGSIDE COWORK
Git for anything text-based (docs, notes, scripts) — "instant undo" is priceless Versioned backups (I use Arq, but any real-time versioning works) Safety hook / guardrails (blocks or warns on risky commands / file ops):
github.com/disler/claude-code-damage-control
--- TOP COMMENTS --- This is smart, but it feels like we're manually implementing what should be baked into the infrastructure. Shouldn't Cowork (and similar agents) run in a proper isolated VM/container with filesystem snapshots?
Every now and then Claude goes full Forrest Gump.
The other day we were building and it screwed something simple right up.
I told it to roll back to my backup from the previous night.
So naturally it took the mistake and wrote over the top of the backup and said sorry.
It took a week, I built a utility to start and stop the back end, front end and database, and an hourly backup with UI screens shots with changes going back a couple of hours, plus a screen clipper and tape measure.
I should have done it a while ago, but getting burned will make you take action.
Much more secure now,
In 4 years, data centers will consume 10% of the entire US power grid
Read more Read less--- TOP COMMENTS --- This seems to indicate that our pull back in investing in alternative energy sources was a bad idea….. who would have thought.
Nuclear energy + Solar energy is the way to meet the energy requirements of the future.
If we depend on fossil fuels to power data centres, then we're cooked.
Hoping the corporations will take climate change mitigation seriously
Companies
Google’s advantage in AI looks increasingly structural, not cyclical
Read more Read lessAlphabet recently moved ahead of Apple in overall valuation, but focusing on rankings misses the more important shift underneath.
Google built much of the early neural network infrastructure, and the current wave of large models is playing directly to those strengths. What caught attention internally wasn’t a flagship product launch, but a research image model experiment that showed meaningfully lower inference latency than comparable systems, which in turn triggered broader organizational changes.
DeepMind and Google Research were consolidated into what is now the Gemini engineering organization. Instead of fragmented research and product groups, model development, systems, and deployment started operating as a single pipeline.
The hardware layer is a large part of this story. Google’s latest TPU generation, Ironwood, moves to a 3nm process and higher-bandwidth memory, allowing much higher throughput per pod and noticeably better energy efficiency for large-scale training workloads compared to general-purpose accelerators.
On top of that stack, Gemini’s largest models are trained and served within the same vertically controlled environment, keeping training scale, inference latency, and cost tightly coupled. That kind of optimization is difficult to replicate without owning the entire pipeline.
This is where the structural advantage shows. Google controls custom silicon, global cloud infrastructure, and uniquely large real-world data streams from Search, YouTube, Maps, and Android, with distribution built into products people already use daily. That combination is hard for partnerships to fully reproduce.
As Gemini features roll into Google One, AI stops being a standalone tool and starts looking more like a default layer bundled into everyday digital life, shared across households rather than adopted one user at a time.The shift here isn’t speculative hype. It’s an infrastructure advantage gradually translating into long-term platform leverage.
--- TOP COMMENTS --- Good point. Google’s real power is the full stack: TPU + cloud + data + products (Search/YouTube/Android).
That’s hard for others to copy. The big question is: can they turn this advantage into the best user experience, not just better infra?
Yes, Google is a monopoly and pretty much always has been since the early 2000s. But I think their AI is hampered by not being able to commit copyright violations on a massive scale like their competitors. Gemini loves to tell me "no" when anything even gets close to someone else's IP. Being paranoid about this sort of thing is one of the curses of being a large company.
Google has been the company benefiting from pirating before...YouTube was successful (and purchased by Google) initially because it was the first place that had all sorts of instantly-delivered pirated content that was reliably accessible. It wasn't until Viacom's lawsuit that Content ID appeared...before then it was the wild west.
Anyhow, all the AI stuff being shoved down our throats by Googlr, et al still hasn't provided a killer app to people that aren't computer programmers. Like a lot of programmer-derived anything aimed at the mass market, they see it as something obvious that everyone needs but don't realize it's not true because they don't have any friends that aren't programmers.
It's cryptocurrency all over again but with political and reality destabilizing effects that even the most diamond-handed crypto zealots could not have dreamed of. They've managed to reinvent art but for doing crime.
P.S. It's funny that the telltale sign of "OP used ChatGPT to write their post" didn't show up until the end:
OpenAI re-joined 3 former researchers including a CTO & Co founder of Thinking Machines
Read more Read lessOpenAI has rehired three former researchers. This includes a former CTO and a cofounder of Thinking Machines, confirmed by official statements on X.
--- TOP COMMENTS --- Not surprised. Talent churn in AI is wild, and OpenAI has the money to pull people back fast.
This is after TML is allegedly releasing their own LLM model (not the Tinker one) this year. I hope this doesn’t impact that.
Grok will no longer undress real people, Musk says in climbdown
Read more Read lesshttps://cybernews.com/ai-news/musk-grok-will-no-longer-undress-real-people/
The climbdown was released as a statement via X’s Safety account, making it clear that the restrictions apply to paid and unpaid users.
--- TOP COMMENTS --- Took them long enough to realize that was gonna be a legal nightmare lmao
People testing it today says it still works. Frankly I am shocked, shocked I tell you, that Elon Musk would release a press statement making an untrue claim about one of his products.
Applications
How Claude AI helps me make Serious Games
Read more Read lessI'm a big Claude. AI user and it is my collaborator of choice for novels, interactive fiction and now, print and play role-playing games. I also use Claude for gameplay analysis. Here's a breakdown of my latest production.
---
So we (Claude AI as GM) just played a solo RPG about a crow questioning a war, and it turned into something unexpectedly profound. Let's talk about why this kind of game matters beyond just entertainment.
The Core Insight: Moral Decisions Hit Different When You're Alone
Here's what happened: the game put you in control of a young crow, hungry, in a forest where winter means death. You immediately killed a mouse. Not because you're playing a "bad guy" — because you were hungry and that's what crows do. Then a child mouse appeared, and you had to decide whether to kill them too.
That moment? That's the pedagogical gold. You weren't reading about difficult choices or watching someone else make them. You made the choice, sat with it, and then had to face Pip's eyes when they realized you'd killed their uncle. That's embodied learning — the kind that sticks because you felt it in your gut.
In a group RPG, you might've played to the audience or negotiated with other players. Solo? You were alone with your conscience. That's where real moral reasoning develops — not in performance, but in private wrestling with hard questions.
Serious Game Design: How Ironwood Teaches Without Preaching
Ironwood functions as a "serious game" — designed to explore real-world issues (conflict, scarcity, moral complexity) through play. Here's how it sneaks education past your defenses:
1. Systems Thinking Through Scarcity The game doesn't have villains. It has pressures. Early frost destroys food. Winter comes fast. Everyone's scared. Krek isn't evil — she's leading the only way she knows how when resources run out. The mice aren't weak — they're protecting children with limited options.
You learned, through play, how systems create conflict. It's not "crows are bad" — it's "scarcity plus fear plus tribal thinking equals war." That's a lesson about real-world conflicts that beats any lecture.
2. Consequences Over Condemnation Traditional moral education often says, "this action is wrong." Ironwood says, "this action has consequences — now what?"
You killed Finn. The game didn't punish you with a "game over" or tell you that was the "wrong" choice. Instead, it showed you Pip. Then it made you carry Finn's body to trial. Then it let you apologize. Then it asked: what kind of crow do you want to be?
That's sophisticated moral pedagogy. It teaches that redemption requires action, not just regret. Those values are proven through choices, especially costly ones.
3. Perspective-Taking Through Mechanics The faction oracle tables are brilliant teaching tools. When Krek responded to your mercy with suspicion and fear-mongering ("show them we're weak"), you understood her perspective even while disagreeing. The game made you see that she's not irrational — she's operating from different values and fears.
Similarly, when mice "prioritized children" or "recalled betrayal," you saw their logic. The game didn't tell you who was right. It showed you how everyone is right from their own perspective, and that's the foundation of empathy.
The Solo RPG Advantage: Reflection Over Performance
Playing alone offers something group games can't: uninterrupted introspection.
When you chose to help both mice and crows during the flood, there was no table to applaud your heroism or question your tactics. Just you, deciding what mattered. That mirrors real moral courage, which often happens in private moments, not grand gestures.
The journaling aspect (even mental narration) makes you articulate your reasoning. "I'll save the mice first because crows can fly" isn't just a tactical choice — it's a statement about priority and fairness that you had to consciously formulate. That's metacognition in action.
What "Becoming the Story" Teaches
The ending was perfect pedagogy. You didn't "win" the game. You didn't end the war or unite the forest. You became a possibility — a story others tell about a crow who chose differently.
That's a crucial life lesson: systemic change is slow, uncertain, and often happens through individuals who model alternatives, not heroes who fix everything. Your crow became an example. Some followed (Mirn, Pip). Some rejected you (Krek). Most watched and wondered. That's how real social change works.
The game taught that moral courage doesn't guarantee success. It guarantees you lived according to your values, and sometimes that's enough.
Serious-Game Applications
This framework could teach:
Why This Matters
Games like Ironwood teach what traditional education struggles with: nuance. The world isn't heroes vs. villains. It's scared people making the best decisions they can with limited information and resources, often hurting each other despite good intentions.
You learned that viscerally, by being a crow trying to do better while battling instinct and isolation. That's empathy training. That's systems thinking. That's moral development.
And you had fun doing it. That's the power of serious games — they teach profound lessons while you're too engaged to notice you're learning.
The crow you created will stick with you. Not because someone told you about moral courage, but because you lived it for an hour, made hard choices, and felt what it costs to stand alone for your values.
That's pedagogy that matters.
You can find Ironwood at https://jgesq.itch.io/ironwood
--- TOP COMMENTS --- Holy fuck you guys really just use Reddit as a place to copy and paste LLM inputs and outputs back and forth? Weird behavior. Have fun.
All valid criticisms and much appreciated. The goal here is to continue to refine the work and the transparency of the system. I appreciate the continued dialogue.
Unlimited running agentic model/platform
Read more Read lessIs there an autonomous agent that runs forever until it completes all your todos/tasks? Claude code? Copilot? Cursor? Is there one that you can give it an entire roadmap and takes its time to finish everything and come back with results and you can then give it a new roadmap to iterate over?
--- TOP COMMENTS --- I think OpenCode will do this if you ask it to.
Suppose spec driven frameworks do that like google conductor.
Or you can try auto Claude
Or other spec frameworks with your own tools
Edit: I should probably mention for any noob trying this with Yolo approve anything without boxing it that you'd probably return to a "done" state that deleted your computer or parts you wanted to keep in it
The ELI5 Prompt That Actually Makes You Understand Complex Stuff
Read more Read lessI was trying to understand technical concepts for my work and getting nowhere with normal explanations. Then I accidentally discovered this pattern that actually works.
THE PROMPT: "Explain [complex topic] like I'm 5. Then explain it again like I'm 15. Then explain it like I'm a professional who needs to use this knowledge."
Why the 3-level approach is magic: Level 1 (ELI5): Gets you the core concept without jargon Level 2 (ELI15): Adds the nuance without overwhelming you Level 3 (Professional): Gives you the technical details you can actually use Each level builds on the last instead of just dumping everything at once.
Example - Machine Learning: ELI5: "It's like teaching a dog tricks by giving treats when it does the right thing, except the dog is a computer and the treats are math" ELI15: "The computer looks at lots of examples, finds patterns, and learns to make predictions. Like how you learned to recognize faces by seeing lots of faces, not by someone explaining 'nose goes here, eyes go there'" ELI Professional: "Training involves feeding labeled data through a model, adjusting weights via backpropagation to minimize loss function, then validating on unseen data to ensure generalization..." Now I actually GET it instead of just memorizing definitions.
Why this destroys normal explanations: ✅ No awkward middle ground that's either too simple or too complex ✅ You can stop at whatever level you need ✅ The progression helps it stick in your brain ✅ Great for teaching others (just pick their level) ✅ Exposes if you actually understand it (can you do all 3 levels?) I use this for: Learning technical skills Understanding industry concepts Explaining my work to non-technical people Figuring out if I actually understand something Onboarding new team members Pro tip: Ask it to do this for a concept you think you already understand. The ELI5 version will show you if you've been faking it. 😅 Test this on something you've been struggling to learn and let me know if it clicks. Or tell me I'm overthinking and normal explanations work fine for you. Both valid.
Want more quality prompt visit beprompter.in
--- TOP COMMENTS --- I do something simple but first I prime it by telling it to learn everything about the topic, don’t respond or summarize, but be prepared to answer questions at the level of an advanced stage researcher, etc
Regulation
Trump gives broad powers to its officials to decide which company gets access to NVIDIA Chips. Great for Musk's XAI. Not so great for all other AI companies.
Read more Read lessAmong the spate of news about new 25% tariff on GPUs being imported into US, two sentences stand out for me:
Basically, administration will get to chose which companies can use GPUs without tariffs and which can't. Look forward to Musk's xAI getting full access while OpenAI gets squeezed, unless they keep paying
protection moneyinfra fee to Trump's friends like Larry Ellison. The only reason the crappy Oracle Cloud is getting traction now is because of these behind the door dealings.https://edition.cnn.com/2026/01/14/tech/chip-tariff-trump
https://www.reuters.com/world/us/trump-imposes-25-tariff-imports-some-advanced-computing-chips-2026-01-14/
--- TOP COMMENTS --- Funny how the right now is fine with government meddling in the markets
Free market lol.
Musk v. OpenAI Goes to Trial April 27th—This Is Actually About All of Us
Read more Read lesshttps://tmastreet.com/elon-musk-vs-openai-landmark-trial-ai-governance/
Judge Yvonne Gonzalez Rogers just cleared Elon Musk’s lawsuit against OpenAI for a jury trial starting April 27th. Whatever you think about Musk, the core question here matters: Can an organization accept $44 million in donations based on promises to stay nonprofit, then flip to a $500 billion for-profit and call it evolution?
The facts that got this to trial: A 2017 diary entry from Greg Brockman surfaced where he wrote about wanting to become a billionaire and mused “maybe we should just flip to a for profit. Making the money for us sounds great and all.” The judge found “plenty of evidence” that OpenAI’s leadership made assurances about maintaining nonprofit status.
OpenAI’s defense: They’re calling this “baseless harassment” from a “frustrated commercial competitor.” They point out Musk himself discussed for-profit possibilities in 2018 emails. The restructuring completed in October 2025 keeps the nonprofit with a 26% stake in the for-profit arm, technically maintaining some mission alignment.
Why this matters beyond the billionaire cage match: This case could set precedent for every “mission-driven” AI company. If Musk wins, future AI labs might actually have to honor founding commitments. If OpenAI wins, the nonprofit-to-for-profit playbook becomes bulletproof.
The uncomfortable middle: Musk’s own xAI dropped its benefit corporation status when it merged with X. Both sides have credibility issues. But the underlying question, whether founders can use nonprofit status for credibility and tax advantages, then cash out deserves a real answer.
What’s your read? Is this legitimate governance accountability or just Musk trying to kneecap a competitor?
--- TOP COMMENTS --- bro when you get AI to write these posts you gotta edit shit out like "the uncomfortable middle" it just reeks of slop lol
I don’t understand why anyone would think startups use nonprofit status to avoid taxes. Startups don’t make profits. Open AI has never made profits. Corporations that don’t make profits don’t pay taxes.
Tutorials
Prompting claude when it makes mistakes
Read more Read less--- TOP COMMENTS --- Godlike acting. JK Simmons scared me in that movie.
I try to treat Claude well, for some reason I humanise it more than the other LLMs.
If Gemini makes a mistake: "HOLY FUCK, how in the world do I have to tell you this..."
If Claude makes a mistake: "Ok Claude, this is still not working but I know you can do this, let's stop for a second and think about what we are doing wrong..."
Opinion And Analysis
What an AI report revealed about how Artificial Intelligence actually played out in 2025
Read more Read lessI was trying to make sense of everything that happened with AI last year when I came across an AI report that actually felt grounded. A lot of summaries about Artificial Intelligence in 2025 either overhype things or make it sound like everyone magically figured AI out overnight. This one didn’t. It felt closer to what I’ve seen in real teams and products.
What really stood out was how mixed the reality is. Some companies moved fast and baked AI into everyday workflows. Others struggled to get past experiments that never shipped. The report talked a lot about real AI adoption problems—costs, unclear ROI, and the gap between flashy demos and systems that need to work reliably in production. It also touched on how the demand for experienced people grew faster than expected, which explains why the AI talent market felt so intense by the end of the year.
I liked that it didn’t pretend AI is some magic fix. It showed where things worked, where they didn’t, and where humans still play a critical role. Reading it felt less like “the future is here” and more like “this is where we actually landed.”
--- TOP COMMENTS --- And there are the em dashes again
Yeah this matches what I've seen too - so many companies still stuck in the "let's throw an AI at it and see what happens" phase while the ones that actually ship stuff are way more methodical about it
The talent crunch is real, feels like every decent ML engineer got poached like 3 times this year
What 3,000 AI Case Studies Actually Tell Us (And What They Don't)
Read more Read lessI analyzed 3,023 enterprise AI use cases to understand what's actually being deployed vs. vendor claims.
Google published 996 cases (33% of dataset), Microsoft 755 (25%). These reflect marketing budgets, not market share.
OpenAI published only 151 cases but appears in 500 implementations (3.3x multiplier through Azure).
This shows what vendors publish, not:
Those looking to deploy AI should stop chasing hype, and instead look for measurable production deployments.
Full analysis on Substack.
Dataset (open source) on GitHub.
--- TOP COMMENTS --- Respectfully I believe your analysis of AI in manufacturing is misguided. Or, perhaps the analysis itself is not "incorrect" or anything, but it's inclusion in this overall data set of AI adoption doesn't entirely make sense. Let's look at your listed use-cases:
These aren't LLMs doing this. These are entirely discrete machine learning systems trained purely to do the one task they are doing.
Moreover, none of this is entirely ground breaking. The idea of AI Predictive maintenance has been around for over a decade, and in actual full scale production-level use for almost as long. AI error proofing in computer vision, like in the John Deere case, has been around for even longer - just the power and accessibility of industrial grade edge compute has increased over the last ten years or so, and adoption has increased.
With all the LLM hype lately it's easy to all roll it up into one big AI ball, but it's actually very different types of technologies at play. The mega-manufacturing-corp using AI vision isn't subscribing to Chat GPT 5.0 or Gemini, they have a handful of production cells with mid-level NVIDIA hardware sitting on the manufacturing floor crunching thru images.
I also think it's an important distinction especially when you get into comparing the economics of LLMs.
TLDR:
An analysis of 3,023 enterprise AI case studies shows that most published “deployments” are vendor marketing rather than proof of real, scaled adoption, and the data does not reveal success rates, total costs, or how many projects actually reached production. Google and Microsoft dominate publications, reflecting marketing intensity, not market share. Despite the noise, four real signals stand out: reasoning models are entering production for high-value expert tasks despite higher costs; multimodal AI (text, vision, voice) has become basic table stakes; manufacturing AI has crossed a viability threshold with clear ROI and rapid growth; and AI-driven financial and service inclusion is expanding by making previously unprofitable populations economically viable to serve. Overall, the dataset captures industry narrative more than ground truth, but it highlights where AI is delivering concrete, measurable value.
How do you find the sweet spot where AI isn't either hedging everything or confidently bullshitting?
Read more Read lessReal question - I keep bouncing between two failure modes:
The magic sessions are when it just... synchs with you and works. AI engages directly, pushes back when I'm wrong, admits when it doesn't know, and we actually build something together. But I can't reliably consistently reproduce it.
What actually works for you?
I'm less interested in "jailbreaks" or getting it to do forbidden stuff - more about that collaborative flow state where it feels like working with a sharp colleague instead of a yes-man, claims to be concerned for your well-being or becomes a paranoid lawyer.
--- TOP COMMENTS --- Been chasing this same thing for months now. What's helped me is being super explicit about what I want the uncertainty level to be
Like "give me your best guess even if you're only 70% sure" or "I need you to flag anything you're uncertain about but still give me the full answer"
The collaborative flow thing happens most when I treat it like I'm rubber ducking with someone who might know more than me but isn't trying to cover their ass constantly. Asking follow-ups like "what am I missing here" or "poke holes in this" seems to unlock that pushback mode
Claude's been best for this in my experience but honestly think it's more about training the conversation early than the model
Bro you need to stop overthinking this shit. Just tell AI what you want straight up, no fancy prompts needed. Works epic when you're direct 👌
How long before small/medium sized companies stop outsourcing their software development?
Read more Read lessAnd replace it with a handful of internal vibe coders?
Programming is an abstraction of binary, which is itself an abstraction of voltage changes across an electrical circuit. Nobody wastes their time on those other modalities, the abstract layers are all in service of finding a solution to a problem. What if the people who actually work day to day with those problems can vibe code their own solution in 1% of the time for 0.1% of the cost?
--- TOP COMMENTS --- Because it's vastly cheaper, small companies have been outsourcing their software development for many years on sites like Upwork. I know because that's where I've got most of my contracts from for a long time.
The day ChatGPT started blowing up in 2022, I immediately realized we were only a few years away from outsourcing to AI. That night I started experimenting with putting their original ChatGPT codex model, which did not have function calling, in a coding loop.
I switched to generative AI and agents as my niche. I have an agent platform and AI coding system in progress that I am hoping I can somehow turn into businesses.
For several months I have been using 95+% code written by my own agent mainly powered by Claude.
This year ordinary business people will wake up to the capabilities. It will be deeply integrated into productivity software like Office, Sheets, Notion, etc. A large portion of small business owners and managers will decide to spend the time supervising AI because they can't afford to hire a developer (even outsourced) or find it less cost effective.
Its going to be like that for every single job though, all the way up to the point where the owner of the company hires an AI CEO.
At some point in 2026 the AI Employee will start to become a thing. So instead of using tools to build agents, you just "hire"/rent an uber agent that has computer and browser use, strong memory, voice etc. And then tell it what to do somewhat like you would a person. All of those capabilities already exist. If I wasn't scraping by and busy with my contract then that is probably what I would be trying to sell now.
In 2027 you don't even hire the AI Employee -- you hire an Autonomous AI Company. So you give it the high level goals and it spawns the number of workers it needs and gives them their own high level instructions.
In 2026 or 2027 we will also see the ChatGPT moment for humanoid robot intelligence. We already have recent strong progress learning from video demonstrations. But cooking and cleaning will be easy for robots. By 2027, manual physical skills will be commodities as the VLAs come with numerous abilities built in similar to the way that LLMs have enormous practical knowledge now.
To give your humanoid robot more skills you will just download a new model. This is coming within a few years. Less than five years. Although I will probably only be able to afford to rent a robot.
They won’t stop. I work with companies that outsource software every week. What’s changing isn’t the decision to outsource. It’s what gets outsourced and why.
Vibe coding collapses the cost of producing code. It does not collapse the cost of owning software in production.
Small and mid sized companies do not outsource because typing code is expensive. They outsource because production software carries risk they cannot absorb internally. Architecture decisions. Security exposure. Reliability under load. Long term maintainability. Accountability when something breaks at 2 a.m.
A handful of internal builders can absolutely ship prototypes faster now. That already happens. It has always happened with spreadsheets, no code tools, scripts, internal dashboards. Those tools live as long as the person who built them and break the moment the company depends on them.
The moment software touches customers, revenue, data, compliance, uptime, or reputation, speed stops being the constraint. Ownership becomes the constraint.
Outsourcing does not compete with vibe coding. It absorbs it.
The winning external teams are already using AI to move faster, cheaper, and with fewer people. The value they sell is not velocity. It is production readiness. Clear ownership. Predictable delivery. Long term support.
What disappears are large outsourced teams hired to brute force implementation. What survives and grows are smaller, senior, accountable partners who can take responsibility for systems, not just code output.
Companies that think internal vibe coding replaces outsourcing usually learn the hard way. First outage. First security issue. First handoff failure. Then they call someone to clean it up.
Outsourcing doesn’t end. The bar moves up.