Groq, Etched, SambaNova, Taalas // The AI Hardware Show S2E4

The episode explores cutting-edge AI hardware companies like Groq, Etched, SambaNova, New Chips, Talas, and Posetron, each offering distinct architectures and solutions for large language model inference, ranging from highly specialized chips to flexible, integrated systems. It highlights the trade-offs between specialization, scalability, efficiency, and adaptability in the rapidly evolving AI inference landscape, emphasizing ongoing innovation and diverse approaches to meet varying enterprise and data center needs.

The AI Hardware Show episode focuses on the rapidly evolving sector of large language model (LLM) inference at data center scale, highlighting several innovative companies and their unique approaches to AI hardware. Groq, recently acquired by Nvidia, emphasizes a single-core architecture called the Language Processing Unit (LPU) designed for fast, predictable inference without caches or off-chip memory. Their current chip, built on 14nm technology, relies on networking many chips together to handle large models like LLaMA 70B, with a next-generation chip planned on Samsung’s 4nm node to address memory limitations. Groq has also transitioned into a cloud service provider, with significant deployments such as a 19,000-chip cluster in Saudi Arabia.

Etched is a startup focused exclusively on transformer-based inference with its Sohoo chip, which dedicates nearly all transistors to dense mathematical operations, eliminating caches and control flow to maximize throughput. Fabbed on TSMC’s 4nm node, Etched claims its chip can deliver around 500,000 LLaMA 70B tokens per second at 8-bit precision, outperforming Nvidia’s Blackwell P200 by an order of magnitude. However, this specialization comes with high ASIC risk, as the chip only supports transformers, making it vulnerable if the industry shifts to other architectures.

New Chips targets recommendation workloads with its Raptor N3000 chip, which balances flexibility and specialization. Available as an M.2 module or PCIe card, it offers 32GB of LPDDR5 memory and is optimized for enterprise inference where latency, power, and cost are critical. The chip supports a novel floating-point format called FFP8, which aims to maintain near FP32 accuracy while enabling high-speed computation. New Chips positions Raptor as a solution for human-facing edge clusters and enterprise racks, focusing on tight latency rather than raw throughput.

SambaNova introduces a unique approach with its SN40 L chip based on a coarse-grained reconfigurable array architecture, integrated with a tiered memory hierarchy to support models up to five trillion parameters. Unlike others, SambaNova sells vertically integrated AI systems rather than chips alone, targeting government labs and enterprises requiring data sovereignty and model ownership. Their full-stack solution eliminates the need for CUDA rewrites and supports rapid switching between multiple models, emphasizing flexibility and security over raw token throughput.

Finally, Talas and Posetron represent extremes in specialization and adaptability. Talas aims to compile specific AI models directly into silicon, promising massive efficiency gains but with the risk of obsolescence if models change. Posetron, founded by former Groq engineers, uses FPGA-based PCIe cards for flexible LLM inference and plans to develop a DDR-only ASIC for cost-effective scaling. Posetron’s FPGA design achieves high memory bandwidth utilization and claims significant performance and efficiency advantages over Nvidia’s Hopper systems. The episode concludes by noting the vast and growing landscape of AI inference hardware, with many more innovations to come.