Llama 70b Instruct is a very capable openly available model, that is changing the LLM landscape. Function calling support is missing and it would be an advancement.
Llama 3 Evaluation Performance
Llama 70b Instruct is definitely not as good in complex instruction following as GPT-4-Turbo based on my manual tests as of 2024-05-05. This may be due to weaker fine-tuning, which can still be improved. Llama 8b seems a bit stronger than Mistral 7b various fine-tunes.
There is also LMSYS leaderboard and evals from the original Facebook post below:
- Llama 3 70b gained impressive 82.0 on MMLU 5-shot.
- Llama 3 70b gained impressive 81.9 on HumanEval O-shot.
Benchmark/Model | Meta Llama 3 8B | Gemma 7B - It Measured | Mistral 7B Instruct Measured |
---|---|---|---|
MMLU 5-shot | 68.4 | 53.3 | 58.4 |
GPQA O-shot | 34.2 | 21.4 | 26.3 |
HumanEval O-shot | 62.2 | 30.5 | 36.6 |
GSM-8K 8-shot, CoT | 79.6 | 30.6 | 39.9 |
MATH 4-shot, CoT | 30.0 | 12.2 | 11.0 |
Benchmark/Model | Meta Llama 3 70B | Gemini Pro 1.5 Published | Claude 3 Sonnet Published |
---|---|---|---|
MMLU 5-shot | 82.0 | 81.9 | 79.0 |
GPQA O-shot | 39.5 | 41.5 CoT | 38.5 CoT |
HumanEval O-shot | 81.7 | 71.9 | 73.0 |
GSM-8K 8-shot, CoT | 93.0 | 91.7 11-shot | 92.3 O-shot |
MATH 4-shot, CoT | 50.4 | 58.5 Minerva prompt | 40.5 |
Llama 3 Deployment Requirements
You can use OpenRouter proxy, or directly buy from a good provider like Fireworks, which likely will deliver native function calling ability soon. There is already one Llama 3 8B fine-tune for function-calling called Llama 3 8B Hermes 2 Pro.
People report that when AWQ-Quantized to 4bit of around 40GB ram the inferring speed is around 30 tokens/second on 2x4090s and probably more on A100 80GB.
At 2bit quantization of AQLM it requires around 21GB VRAM at a cost of quite high performance degradation so far.
Note that there is GPTQ quantization, but it should be slower and worse than AWQ when both are the same 4-bits.