Dolphin-Mixtral-1X-22B 🐬: MoE Expert Extraction for Enhanced Performance or Lobotomy?

The text discusses the conversion process of extracting eight experts from a mixture of experts architecture model like MixtRal 8X 22B and merging them into one expert to create the Dolphin 291 Mixol 1X 22B model. Despite concerns about potential performance issues, the Dolphin 291 model only showed a 5-8% drop in performance compared to the original model, showcasing the adaptability and resilience of the mixture of experts architecture for future experimentation.

The discussion revolves around the conversion of models from a mixture of experts architecture to a dense monolithic model, particularly focusing on the extraction of eight experts from a model like MixtRal 8X 22B and merging them into one expert. Lucas Atkins worked on this project, sponsored by Cruso Energy, resulting in the creation of the Dolphin 291 Mixol 1X 22B model. This conversion process involved using a method called slurp to intelligently combine weights from the experts into a single model, maintaining the architectural flow similar to MixtRal models.

The process of extracting eight experts into one involved a custom script and slurp algorithm, which smoothly interpolates weights to create a cohesive model. Despite initial concerns about potential performance issues, the resulting Dolphin 291 model demonstrated only a 5-8% drop in performance compared to the original MixtRal 8X 22B model. This highlights the robustness of the mixture of experts architecture and suggests potential for future experimentation with combining specific experts for improved performance.

The extraction process took approximately 27 GPU hours using eight H100s, showcasing efficient implementation. The model, available on Hugging Face as Dolphin 291 Mixol 1X 22B, retained a 64k context length and underwent full weight fine-tuning with a 16k sequence length. The slurp algorithm played a crucial role in determining which weights to extract and merge, ensuring a reasonable reflection of the model’s features.

Despite some performance drop, the experiment provides valuable insights into the flexibility of the mixture of experts architecture and the potential for manipulating expert combinations in models. While the conversion to a single expert may not be aimed at enhancing speed or performance, it offers a unique perspective on model architecture design. The experiment prompts discussions on the trade-offs and implications of converting between different architectural choices, shedding light on the adaptability and resilience of models like Dolphin 291 Mixol 1X 22B.