Llama 3 Performance and Deployment

Llama 70b Instruct is a very capable openly available model, that is changing the LLM landscape. Function calling support is missing and it would be an advancement.

Llama 3 Evaluation Performance

Llama 70b Instruct is definitely not as good in complex instruction following as GPT-4-Turbo based on my manual tests as of 2024-05-05. This may be due to weaker fine-tuning, which can still be improved. Llama 8b seems a bit stronger than Mistral 7b various fine-tunes.

There is also LMSYS leaderboard and evals from the original Facebook post below:

Llama 3 70b gained impressive 82.0 on MMLU 5-shot.
Llama 3 70b gained impressive 81.9 on HumanEval O-shot.

Benchmark/Model	Meta Llama 3 8B	Gemma 7B - It Measured	Mistral 7B Instruct Measured
MMLU 5-shot	68.4	53.3	58.4
GPQA O-shot	34.2	21.4	26.3
HumanEval O-shot	62.2	30.5	36.6
GSM-8K 8-shot, CoT	79.6	30.6	39.9
MATH 4-shot, CoT	30.0	12.2	11.0

Benchmark/Model	Meta Llama 3 70B	Gemini Pro 1.5 Published	Claude 3 Sonnet Published
MMLU 5-shot	82.0	81.9	79.0
GPQA O-shot	39.5	41.5 CoT	38.5 CoT
HumanEval O-shot	81.7	71.9	73.0
GSM-8K 8-shot, CoT	93.0	91.7 11-shot	92.3 O-shot
MATH 4-shot, CoT	50.4	58.5 Minerva prompt	40.5

Llama 3 Deployment Requirements

You can use OpenRouter proxy, or directly buy from a good provider like Fireworks, which likely will deliver native function calling ability soon. There is already one Llama 3 8B fine-tune for function-calling called Llama 3 8B Hermes 2 Pro.

People report that when AWQ-Quantized to 4bit of around 40GB ram the inferring speed is around 30 tokens/second on 2x4090s and probably more on A100 80GB.

At 2bit quantization of AQLM it requires around 21GB VRAM at a cost of quite high performance degradation so far.

Note that there is GPTQ quantization, but it should be slower and worse than AWQ when both are the same 4-bits.

Vaclav Kosar

Llama 3 Performance and Deployment

Llama 3 Evaluation Performance

Llama 3 Deployment Requirements

Vaclav Kosar

You'll love also...

Vaclav Kosar

Vaclav Kosar