Artificial Intelligence

1638 readers

70 users here now

Welcome to the AI Community!

Let's explore AI passionately, foster innovation, and learn together. Follow these guidelines for a vibrant and respectful community:

Be kind and respectful.
Share high-quality contributions.
Stay on-topic.
Enhance accessibility.
Verify information.
Encourage meaningful discussions.

You can access the AI Wiki at the following link: AI Wiki

Let's create a thriving AI community together!

founded 2 years ago

MODERATORS

ikidd@lemmy.world

Sparse Transformers: Run 2x faster LLM with 30% lesser memory (github.com)

submitted 2 days ago* (last edited 2 days ago) by cm0002@lemmy.world to c/ai_@lemmy.world

0 comments fedilink hide all child comments

The project implements sparse multiplication and fuses up/down projections in the MLP layers through low rank weight activations. Work is based on Deja Vu and Apple's LLM in a Flash.

This approach avoids loading and computing activations with feed forward layer weights whose outputs will eventually be zeroed out.

It's a lossless approach as these weights anyway do not contribute in the current token prediction. It does however, need the predictors to be accurate in clustering the weights.

The result? We are seeing 5X faster MLP layer performance in transformers with 50% lesser memory consumption avoiding the sleeping nodes in every token prediction. For Llama 3.2, Feed forward layers accounted for 30% of total weights and forward pass computation resulting in 1.6-1.8x increase in throughput:

Sparse LLaMA 3.2 3B vs LLaMA 3.2 3B (on HuggingFace Implementation):
- Time to First Token (TTFT):  1.51× faster (1.209s → 0.803s)
- Output Generation Speed:     1.79× faster (0.7 → 1.2 tokens/sec)  
- Total Throughput:            1.78× faster (0.7 → 1.3 tokens/sec)
- Memory Usage:                26.4% reduction (6.125GB → 4.15GB)

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here