The episode of The AI Hardware Show reviews recent advancements in data center AI inference hardware from Intel, IBM, Qualcomm, Tensordyne, and Tenstorrent, highlighting innovations such as Intel’s Gaudi 3 accelerator, IBM’s Spire AIU, Qualcomm’s next-gen AI chips, Tensordyne’s logarithmic number system approach, and Tenstorrent’s scalable Black Hole system. These developments emphasize improved efficiency, scalability, and specialized architectures aimed at addressing the evolving demands of AI workloads, though challenges remain in software integration, market adoption, and ecosystem support.
In this episode of The AI Hardware Show, Ian Cutras and Salton discuss the latest developments in data center AI inference hardware from several key players. Intel’s Gaudi 3 AI accelerator, built on TSMC’s 5nm process, offers strong theoretical performance with 64 tensor cores and 128 GB of HBM2E memory, along with 24 ports of 200 Gbit Ethernet. Despite its technical capabilities and positioning as a cost-efficient alternative to Nvidia, Gaudi 3 has seen limited market adoption, partly due to software challenges and lack of integration with Intel’s broader ecosystem. Intel’s future plans include the Jaguar Shores chip in 2027, aiming to address current limitations.
IBM’s Spire AIU is a fully enabled data center inference chip designed for enterprise needs, built on a 5nm process with 32 AI cores and a 75-watt power envelope. It supports ultra-low precision formats down to 2-bit integers, optimizing inference efficiency while maintaining accuracy. Spire is integrated into IBM’s cloud offerings and targets mainstream inference tasks such as summarization and fraud detection. IBM continues to invest in its AI hardware center, focusing on developing chips, software stacks, and deployment frameworks, though wider availability outside IBM remains uncertain.
Qualcomm has announced its next-generation AI 200 and AI 250 chips, successors to the older AI 100 family originally designed for convolutional neural networks rather than modern transformer models. These new chips, optimized for large language and multimodal models, will be deployed in large-scale racks with up to 768 GB of LPDDR memory and high power usage per rack. Qualcomm’s approach emphasizes system-level deployment and cost efficiency, though detailed architectural information remains limited at this stage.
Tensordyne is taking a novel approach by using a logarithmic number system instead of traditional floating-point math, significantly reducing power and area requirements. Their technology promises up to four times savings in power and area while maintaining 16-bit inference accuracy at 4-bit compute power levels. Their upcoming chip, built on TSMC’s 3nm node with HBM3 memory and a high-speed mesh interconnect, targets large-scale inference workloads with impressive efficiency claims. However, adoption challenges remain due to the non-standard math approach requiring robust tooling and developer trust.
Tenstorrent’s Black Hole system emphasizes uniform scalability from chip to data center, using a native Ethernet mesh for low-latency, high-bandwidth interconnects without traditional switches. Each chip combines 140 AI cores with 16 RISC-V cores, enabling flexible workload handling without offloading to host CPUs. Their Galaxy server packs 32 chips delivering 24 petaflops of compute and 16 terabytes of memory. The fully open-source software stack and patents aim to challenge proprietary ecosystems like Nvidia’s CUDA. Additionally, Calray, a French AI chipmaker, offers a unique massively parallel processor array architecture optimized for real-time, low-latency AI workloads, with products deployed in telecom and industrial sectors, highlighting regional supply chain security and sovereign chip initiatives.