Llama 3 70b Instruct is a very capable openly available model, that is changing the LLM landscape. Function calling support is missing and it would be an advancement. Llama 3 also is an open multilingual LLM along with others.
Meta updated the Llama to version Llama 3.1. Most of the below is relevant but slightly out-dated.
Llama 3 Evaluation Performance
Llama-3 vs GPT-4-Turbo vs GPT-4o Llama 70b Instruct is definitely not as good in complex instruction following as GPT-4-Turbo based on my manual tests as of 2024-05-05. I speculate that cheaper and faster GPT-4o is weaker than GPT-4-Turbo on hard tasks, while being stronger on easier tasks based on these posts: 1 and 2. Claude 3.5 Sonnet may very well be stronger than both GPT-4-Turbo and GPT-4o.
Llama-3 is definitely much weaker non-english (multilingual) generation setting based on simple experiments than both GPT-4-Turbo and GPT-4o.
Llama-3-70b-Instruct may have weaker fine-tuning now, which can still be improved in near future. Llama 8b seems a bit stronger than Mistral 7b various fine-tunes for its size.
There is also LMSYS leaderboard and evals from the original Facebook post below:
- Llama 3 70b gained impressive 82.0 on MMLU 5-shot.
- Llama 3 70b gained impressive 81.9 on HumanEval O-shot.
Benchmark/Model | Meta Llama 3 8B | Gemma 7B - It Measured | Mistral 7B Instruct Measured |
---|---|---|---|
MMLU 5-shot | 68.4 | 53.3 | 58.4 |
GPQA O-shot | 34.2 | 21.4 | 26.3 |
HumanEval O-shot | 62.2 | 30.5 | 36.6 |
GSM-8K 8-shot, CoT | 79.6 | 30.6 | 39.9 |
MATH 4-shot, CoT | 30.0 | 12.2 | 11.0 |
Benchmark/Model | Meta Llama 3 70B | Gemini Pro 1.5 Published | Claude 3 Sonnet Published |
---|---|---|---|
MMLU 5-shot | 82.0 | 81.9 | 79.0 |
GPQA O-shot | 39.5 | 41.5 CoT | 38.5 CoT |
HumanEval O-shot | 81.7 | 71.9 | 73.0 |
GSM-8K 8-shot, CoT | 93.0 | 91.7 11-shot | 92.3 O-shot |
MATH 4-shot, CoT | 50.4 | 58.5 Minerva prompt | 40.5 |
Llama 3 Deployment Requirements
You can use OpenRouter proxy, or directly buy from a good provider like Fireworks, which likely will deliver native function calling ability soon. There is already one Llama 3 8B fine-tune for function-calling called Llama 3 8B Hermes 2 Pro.
People report that when AWQ-Quantized to 4bit of around 40GB ram the inferring speed is around 30 tokens/second on 2x4090s and probably more on A100 80GB.
At 2bit quantization of AQLM it requires around 21GB VRAM at a cost of quite high performance degradation so far.
Note that there is GPTQ quantization, but it should be slower and worse than AWQ when both are the same 4-bits.
Quantization Comparisons
Full model size seems to be hard to justify, except in cases where you have no options. But the quantization comes with their own trade-offs: