MAMBA from Scratch: Neural Nets Better and Faster than Transformers

The text discusses Mamba, a new neural net architecture that surpasses transformers in language modelling by using less compute and addressing the limitations of traditional recurrent neural networks. Despite facing rejection from a prestigious conference due to critique on evaluation metrics and memory requirements, Mamba showcases potential for efficient long-range information processing and sparks debates on the peer review process within the academic community.

Mamba is a new neural net architecture that outperforms transformers in language modelling, showing promising results at small model sizes. It uses less compute than transformers, with Mamba only requiring O(nlog(n)) compute for an input sequence of n words compared to transformers’ O(n^2). The architecture of Mamba is presented as an extension of recurrent neural networks (RNNs) rather than the complex state-space models, making it more accessible and easier to understand. By incorporating linear recurrence and parallel computation methods, Mamba addresses the limitations of RNNs, such as slow training and parallelization issues, to achieve efficient and effective long-range information processing.

Linear RNNs, a key component of Mamba, are designed to be fast and capable of learning long sequences by leveraging matrix diagonalization techniques. This enables linear recurrence operations to be computed in parallel in O(log(n)) time, significantly improving efficiency. The stability of gradients in linear RNNs is maintained through weight initialization strategies that ensure the recurrent weights are close to 1 at the start of training, overcoming the challenges of vanishing and exploding gradients. These advancements contribute to the success of linear RNNs in achieving state-of-the-art performance on long-range reasoning tasks, showcasing their potential in handling extensive context.

The linear RNN architecture forms the basis of state-space models, which are essentially linear RNNs with a specific weight initialization scheme inspired by control theory. By expanding the output vectors and optimizing memory usage, Mamba enhances its performance in language modelling tasks. The dynamic linear recurrent neural network, Mamba, introduces a novel approach by using different weights for each input to control information retention and forgetting. This adaptability allows Mamba to excel in language modelling and outperform other models, including transformers, on various downstream tasks.

Despite the success and significance of the Mamba architecture, the paper faced rejection from a prestigious machine learning conference due to critiques that questioned its evaluation metrics and memory requirements. The peer review process raised debates within the academic community regarding the validity of rejection reasons and the effectiveness of current reviewing practices. Mamba’s rejection has sparked discussions on the importance of evaluating models based on relevant tasks and the necessity for improvements in the peer review process to ensure fair assessment and recognition of groundbreaking advancements in the field of machine learning.