Tiny Aya - Cohere's Mini Multilingual Models

The video introduces Cohere’s Tiny Aya suite—small, general-purpose multilingual language models designed to support over 70 languages, including many low-resource ones, with specialized variants for different global regions. These compact models address the lack of multilingual support in smaller models, making advanced language technology more accessible for underrepresented languages and resource-constrained devices.

The video discusses the challenges of finding good language models for less common, low-resource languages. While most large models now handle major European languages well, many languages around the world lack sufficient online data for effective model training. This scarcity is often due to limited internet presence or inactive Wikipedia communities in those languages, making it difficult for models to learn them. Additionally, poor tokenization—where a model breaks down text into inefficient or overly granular tokens—can further hinder performance in these languages.

Recent advancements have improved multilingual capabilities in large models, such as those from the Gemma and Quen families, which now use larger and more efficient tokenizers and incorporate more diverse training data. However, these improvements are mostly seen in larger, proprietary models, while smaller models often lag behind due to less training data and less focus on multilingual tasks during post-training. Specialized translation models like Translate Gemma have emerged, but they are primarily designed for direct translation rather than general-purpose multilingual use.

Cohere’s recent release of the Tiny Aya suite addresses this gap by offering small, general-purpose multilingual models. The base model, with around 3.3 billion parameters, is pre-trained on over 70 languages, including many low-resource ones. Building on this, Cohere has released four post-trained models: the base, the global (covering the widest range of languages), and three regionally specialized models—Earth, Fire, and Water—each fine-tuned for specific language groups based on geographic and linguistic relationships.

The Earth model focuses on West Asian, African, and some European languages, including Arabic, Turkish, Hebrew, and several African languages. The Fire model targets South Asian languages like Hindi, Bengali, Tamil, and Nepali, accounting for common code-switching with English. The Water model covers Asia-Pacific languages such as Tagalog, Bahasa, Vietnamese, Thai, Chinese, and even very low-resource languages like Khmer and Burmese. Cohere has also released multilingual training datasets and benchmarks to facilitate further fine-tuning.

These models are notable for their small size, making them suitable for mobile applications and devices with limited resources. Cohere developed a custom tokenizer that, in some cases, outperforms those used in other leading models. The video encourages viewers to experiment with these models, especially if they are working with languages not well supported by mainstream models. The presenter invites feedback on model performance in specific languages, emphasizing the importance of strong multilingual models for global accessibility.