Llama 3 Performance and Deployment

Evaluations, Quantization, Fine-tunings

Llama 3 70b Instruct is a very capable openly available model, that is changing the LLM landscape. Function calling support is missing and it would be an advancement. Llama 3 also is an open multilingual LLM along with others.

Meta updated the Llama to version Llama 3.1. Most of the below is relevant but slightly out-dated.

Llama 3 Evaluation Performance

Llama-3 vs GPT-4-Turbo vs GPT-4o Llama 70b Instruct is definitely not as good in complex instruction following as GPT-4-Turbo based on my manual tests as of 2024-05-05. I speculate that cheaper and faster GPT-4o is weaker than GPT-4-Turbo on hard tasks, while being stronger on easier tasks based on these posts: 1 and 2. Claude 3.5 Sonnet may very well be stronger than both GPT-4-Turbo and GPT-4o.

Llama-3 is definitely much weaker non-english (multilingual) generation setting based on simple experiments than both GPT-4-Turbo and GPT-4o.

Llama-3-70b-Instruct may have weaker fine-tuning now, which can still be improved in near future. Llama 8b seems a bit stronger than Mistral 7b various fine-tunes for its size.

There is also LMSYS leaderboard and evals from the original Facebook post below:

  • Llama 3 70b gained impressive 82.0 on MMLU 5-shot.
  • Llama 3 70b gained impressive 81.9 on HumanEval O-shot.
Benchmark/Model Meta Llama 3 8B Gemma 7B - It Measured Mistral 7B Instruct Measured
MMLU 5-shot 68.4 53.3 58.4
GPQA O-shot 34.2 21.4 26.3
HumanEval O-shot 62.2 30.5 36.6
GSM-8K 8-shot, CoT 79.6 30.6 39.9
MATH 4-shot, CoT 30.0 12.2 11.0
Benchmark/Model Meta Llama 3 70B Gemini Pro 1.5 Published Claude 3 Sonnet Published
MMLU 5-shot 82.0 81.9 79.0
GPQA O-shot 39.5 41.5 CoT 38.5 CoT
HumanEval O-shot 81.7 71.9 73.0
GSM-8K 8-shot, CoT 93.0 91.7 11-shot 92.3 O-shot
MATH 4-shot, CoT 50.4 58.5 Minerva prompt 40.5

Llama 3 Deployment Requirements

You can use OpenRouter proxy, or directly buy from a good provider like Fireworks, which likely will deliver native function calling ability soon. There is already one Llama 3 8B fine-tune for function-calling called Llama 3 8B Hermes 2 Pro.

People report that when AWQ-Quantized to 4bit of around 40GB ram the inferring speed is around 30 tokens/second on 2x4090s and probably more on A100 80GB.

At 2bit quantization of AQLM it requires around 21GB VRAM at a cost of quite high performance degradation so far.

Note that there is GPTQ quantization, but it should be slower and worse than AWQ when both are the same 4-bits.

Quantization Comparisons

Full model size seems to be hard to justify, except in cases where you have no options. But the quantization comes with their own trade-offs:

Created on 06 May 2024. Updated on: 24 Aug 2024.
Thank you










About Vaclav Kosar How many days left in this quarter? Twitter Bullet Points to Copy & Paste Averaging Stopwatch Privacy Policy
Copyright © Vaclav Kosar. All rights reserved. Not investment, financial, medical, or any other advice. No guarantee of information accuracy.