DeepSeek Open-Sources Its Training Stack and Chips at CUDA's Moat

DeepSeek did not just publish another model checkpoint. In a span of weeks, the Chinese lab pushed out DeepEP, DeepGEMM, FlashMLA and EPLB, four repositories that expose how it handles expert-parallel communication, matrix math, attention kernels and load balancing inside large-scale mixture-of-experts systems. The individual numbers are the sort that infrastructure engineers notice immediately: DeepEP says its intranode dispatch reaches 153 GB/s on H800 NVLink, its internode dispatch hits 58 GB/s over RDMA at 32-way expert parallelism, and its low-latency path can keep dispatch time at 194 microseconds even at 256-way expert parallelism. DeepGEMM says it can reach 1,550 TFLOPS on H800. FlashMLA says its updated kernels hit 660 TFLOPS on H800 SXM5. On GitHub, developers treated the releases less like research curiosities than like usable building blocks: FlashMLA had more than 12,500 stars, DeepEP about 9,200 and DeepGEMM nearly 7,000 as of April 24. That is why this story matters. DeepSeek is turning internal systems craft into public infrastructure. For rivals, startups and open labs, that offers a clearer path to reproducing frontier-scale efficiency outside the walls of OpenAI, Anthropic and Google. For Nvidia, the threat is subtler: not a sudden loss of chip demand, but the first credible signs that the software habits tying AI builders to CUDA can be loosened from the outside.

DeepEP Turns MoE Networking Into a Reusable Product

DeepSeek's new AI model appears to be one of the best 'open ...

DeepEP packages 153 GB/s intranode bandwidth and 194-microsecond dispatch into software other labs can actually deploy.

The mechanical significance of DeepSeek's open-sourcing push starts with DeepEP, because MoE systems live or die on the cost of moving tokens to the right experts and getting results back without wasting GPU cycles. DeepEP is built for all-to-all communication in expert parallelism, the part of the training and inference loop where activations have to be dispatched across GPUs and then combined. In practice, that is often where ambitious cluster designs become expensive bottlenecks. By publishing a library tuned for high-throughput and low-latency paths, DeepSeek is handing the market something far more valuable than a benchmark slide: an implementation.

Share:X

Briefing

The BossBlog Daily

Essential insights on AI, Finance, and Tech. Delivered every morning. No noise.

Unsubscribe anytime. No spam.

Tools mentioned

Affiliate

Selected partner tools related to this topic.

AI Copilot Suite

Content drafting, summarization, and workflow automation.

Try AI Copilot →

AI Model Monitoring

Track model quality, latency, and drift with alerts.

View Monitoring Tool →

Some links above are affiliate links. We earn a commission if you sign up through them, at no extra cost to you. Affiliate revenue does not influence editorial coverage. See methodology.

DeepSeek Open-Sources Its Training Stack and Chips at CUDA's Moat

DeepEP Turns MoE Networking Into a Reusable Product

The BossBlog Daily

Tools mentioned

Stanford's AI Index Finds $581 Billion Investment and Benchmarks at Human Frontier

Anthropic Crosses $30B ARR as Claude Overtakes OpenAI for the First Time

Meta and Microsoft Cut 20,000 Jobs to Fund a $700 Billion AI Bet