this post was submitted on 06 Jun 2025
0 points (50.0% liked)

Artificial Intelligence

1638 readers
70 users here now

Welcome to the AI Community!

Let's explore AI passionately, foster innovation, and learn together. Follow these guidelines for a vibrant and respectful community:

You can access the AI Wiki at the following link: AI Wiki

Let's create a thriving AI community together!

founded 2 years ago
MODERATORS
 

The project implements sparse multiplication and fuses up/down projections in the MLP layers through low rank weight activations. Work is based on Deja Vu and Apple's LLM in a Flash.

This approach avoids loading and computing activations with feed forward layer weights whose outputs will eventually be zeroed out.

It's a lossless approach as these weights anyway do not contribute in the current token prediction. It does however, need the predictors to be accurate in clustering the weights.

The result? We are seeing 5X faster MLP layer performance in transformers with 50% lesser memory consumption avoiding the sleeping nodes in every token prediction. For Llama 3.2, Feed forward layers accounted for 30% of total weights and forward pass computation resulting in 1.6-1.8x increase in throughput:

Sparse LLaMA 3.2 3B vs LLaMA 3.2 3B (on HuggingFace Implementation):

- Time to First Token (TTFT):  1.51× faster (1.209s → 0.803s)
- Output Generation Speed:     1.79× faster (0.7 → 1.2 tokens/sec)  
- Total Throughput:            1.78× faster (0.7 → 1.3 tokens/sec)
- Memory Usage:                26.4% reduction (6.125GB → 4.15GB)
no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here