Home » Technology » Chinese AI Firm’s Breakthrough Creates Top AI Model, Defies US Sanctions with Less Compute!

Chinese AI Firm’s Breakthrough Creates Top AI Model, Defies US Sanctions with Less Compute!

Photo of author

By Harper Westfield

Chinese AI Firm’s Breakthrough Creates Top AI Model, Defies US Sanctions with Less Compute!

Photo of author

By Harper Westfield

DeepSeek, a trailblazing AI enterprise from China, has announced the development of an AI model that rivals top-tier models produced by industry giants such as OpenAI, Meta, and Anthropic, but with an astonishing 11-fold decrease in GPU computational needs and associated costs. Although these claims are yet to be independently verified, this announcement highlights the ingenuity of Chinese researchers in maximizing performance from limited hardware resources amidst US sanctions that restrict China’s access to advanced AI chips. The company has made the model and its parameters publicly available, paving the way for upcoming evaluations.

DeepSeek’s latest model, dubbed DeepSeek-V3 Mixture-of-Experts (MoE), boasts 671 billion parameters and was trained on a setup of 2,048 Nvidia H800 GPUs over a span of two months, totaling 2.8 million GPU hours. By contrast, Meta used 30.8 million GPU hours to train its Llama 3 model, which has 405 billion parameters, on a significantly larger cluster of 16,384 H100 GPUs over 54 days.

DeepSeek asserts that it has drastically cut down the computing and memory demands generally associated with such large-scale models through the use of sophisticated pipeline algorithms, an optimized communication framework, and the implementation of FP8 low-precision computation and communication techniques.

The company utilized a cluster equipped with Nvidia H800 GPUs, each linked by NVLink for GPU-to-GPU and InfiniBand for node-to-node communications. In such configurations, communication between GPUs is typically swift, whereas node-to-node communication can lag, making optimizations crucial for enhanced performance and efficiency. DeepSeek introduced numerous optimization strategies to lower the computational demands of its DeepSeek-V3, with several key technologies playing a pivotal role in achieving remarkable results.

See also  China Launches $47 Billion Big Fund III: Boosting Ecosystem and Fab Tools!

One significant optimization is the DualPipe algorithm, which overlaps computation and communication phases both within and across micro-batches in the forward and backward directions. This method effectively reduces pipeline inefficiencies. Specifically, operations like dispatch (routing tokens to experts) and combine (merging results) were executed in parallel with computation through custom PTX (Parallel Thread Execution) instructions tailored for Nvidia CUDA GPUs. This optimization minimized training bottlenecks, especially in cross-node expert parallelism required by the MoE architecture, allowing the system to process 14.8 trillion tokens during pre-training with negligible communication overhead, as per DeepSeek’s report.

Furthermore, DeepSeek limited each token to a maximum involvement of four nodes, which curtailed traffic and enabled more effective overlap of communication and computation tasks.

A key factor in reducing both computational and communication demands was the adoption of an FP8 mixed precision training framework. This approach allowed for faster computations and lower memory usage while maintaining numerical stability. Crucial operations like matrix multiplications were performed in FP8, though more sensitive components such as embeddings and normalization layers were kept at higher precision levels (BF16 or FP32) to maintain accuracy. This strategy effectively reduced memory demands while keeping the relative training loss error consistently below 0.25%.

In terms of performance, DeepSeek claims that its DeepSeek-V3 MoE language model matches or surpasses other models like GPT-4x, Claude-3.5-Sonnet, and LLlama-3.1, depending on the specific benchmarks used. Verification from third-party benchmarks is awaited as the model and weights are now open-sourced and available for testing.

Though DeepSeek-V3 might lag behind cutting-edge models such as GPT-4o or o3 in parameter count or reasoning capabilities, its development demonstrates the feasibility of training an advanced MoE language model with relatively limited resources. This achievement, however, necessitated extensive optimizations and sophisticated low-level programming, but the results are promising.

See also  Broadcom Plans 3 AI Supercomputers with Million-GPU Clusters by 2027!

The DeepSeek team acknowledges certain limitations with deploying the DeepSeek-V3 model, especially due to its need for advanced hardware and a complex deployment strategy that separates the prefilling and decoding stages. This might be challenging for smaller companies with limited resources. “While acknowledging its strong performance and cost-effectiveness, we also recognize that DeepSeek-V3 has some limitations, especially on the deployment,” the company’s paper noted. “Firstly, to ensure efficient inference, the recommended deployment unit for DeepSeek-V3 is relatively large, which might pose a burden for small-sized teams. Secondly, although our deployment strategy for DeepSeek-V3 has achieved an end-to-end generation speed of more than two times that of DeepSeek-V2, there still remains potential for further enhancement. Fortunately, these limitations are expected to be naturally addressed with the development of more advanced hardware.”

Similar Posts

Rate this post
Share this :

Leave a Comment