For given input, you want the model to correctly generate output. There is a natural path from the simplest most crude to the most advanced fine-tuning of the model. This post guides your through these techniques in a simple way.
Large language models (LLMs) like ChatGPT (GPT-3), Claude, Bard are trained to predict text continuation with extra tuning for the following conversations and instructions (RLHF). We steer the model with a small additional textual context, such that it learns in context without a large amount of training data. This additional context is called prompt. Systematic development of prompts using metric evaluation is called prompt engineering.
Trade-offs in Prompting
- Longer prompts are the more expensive in terms of latency and compute. For example, more examples provided, the longer prompt. Training a specific model or selecting samples intelligently are possible solutions.
- If the model changes the prompt may stop being optimal, in that case, there is little point over-optimizing it. For example, ChatGPT or GPT-4 is often changed by OpenAI. The models are meant to be general not specific to your problem.
- Control guardrails vs creative hallucinations. Certain prompts may be prone to more hallucinations over others.
- Prompts are a crude tool without nuance and can be “over prompted” (prompt injection) with user’s own instructions, whereas fine-tuning requires more initial investment, data, and is complicated.
Task Instruction
Also called Zero-Shot Prompting.
Describe the task:
- intent (detect product review sentiment)
- audience (5 year old)
- persona (expert marketer)
- specific and precise terms, e.g., avoiding generic word “not”.
Input-Output Examples
Also called One-shot, Few-Shot prompting.
Provide examples such that
- Changing order of examples can change results. Recent examples are more likely to be reproduced.
- Representatively ordered examples at least random - For multiple choice outputs you may want to debias the model to prevent repeating the last answer.
- Similar or relevant examples to the input - For an input search KNN clustering for finding semantically similar examples to provide into the prompt.
- Diverse examples between each other - If you have static prompt, instead select diverse examples with clustering.
- Difficult to answer examples - select most difficult questions for prompt based on difficulty to answer by the model.
Reasoning in Steps
Also called Chain-of-Thought (CoT) Prompting.
Your step-by-step instruction creates a momentum such that the model generates a text that guides it towards the correct answer. The reasoning steps increase interpretability. Append instruction “Let’s think step by step.” or provide reasoning examples. For example, multistep arithmetic, commonsense logical reasoning. Model’s ability to use CoT increases with model size (see PaLM and its ability to explain jokes).
Majority Vote Reasoning Steps
Also called Self-consistency with Chain-of-Thought (CoT-SC).
Generate multiple reasoning paths (chains of thought), then return the most common answer.
Self-evaluated Reasoning Search
Also called Tree of Thoughts Problem Solving (ToT).
Generate explicitly decomposable thoughts, evaluate progress of each unfinished thought chain, and efficiency explore with an search algorithm. This has analogies to AlphaZero for playing chess.
Criticism is that the evidence is low with only 3 toy examples, additional model generation is required for evaluation operations, and the technique requires additional problem-specific human inputs.
Thought Decomposition in ToT
Designed problem-specific meaningful thought size and separation. For example a paragraph, or an equation.
Thought and Value Generation in ToT
Designed problem-specific prompts thought-prompt. Propose or sample generate depending on the output size.
Evaluation-prompt in ToT
Designed problem-specific prompt for reflecting on the thoughts “state”. Either:
- Value of state: Generate value of a specific step or “chain”.
- Vote: across states: Based on all steps, the model compares and select the most promising.
Search Algorithm in ToT
Explore the most promising paths until solution, bad state, or depth limit:
- Breadth-first search (BFS): Keep only most promising states, then generate to deeper the level for all, prune. Iterate.
- Depth-first search (DFS): Keep going deeper until solution, bad state, or depth limit. Then backtrack, exclude already visited.
Examples in ToT
Game of 24
- Game of 24 is a mathematical reasoning challenge, where the goal is to use 4 numbers and basic arithmetic operations (+-*/) to obtain 24. For example, given input “4 9 10 13”, a solution output could be “(10 - 4) * (13 - 9) = 24”. We decompose by choosing the numbers from the left to the right.
Creative Writing
Graph of Thoughts
Tree of Thoughts with the human-specified ability to combine thoughts on top of Scoring & Ranking Thoughts. Core idea in both methods is reusing the already generated thoughts, but graph of thoughts has aggregation ability.
Generating Optimal Prompts
Models can be used to generate their own optimal prompts. For example, Large Language Models as Optimizers.
Tool Use
Models can use external tools by generating API calls, when it is advantageous. For example, if there is a question for some sort of calculation. Toolformer method can use a small training sets, you can teach it to call a calculate function, which it can use to do the calculation for it. It will get the results as into the text and instead of predicting the function output, it would get the output from the tool. It would sort of stop predicting for a couple tokens and get the result. With this you can improve actual performance on dedicated tasks for, you can do retrieval for question answering.
- TODO
Fine-Tuning Training
Nuanced behavior and stronger prompt injection protection can be only trained via fine-tuning. When we have enough data and compute, we can fine-tune the model weights to increase performance.
Parameter Efficient Methods
Cheaper to train, and switch between, and help to prevent catastrophic forgetting. Can help against catastrophic forgetting also.
Additive:
- Soft prompts: training a section of input sequence embeddings.
- Adapters
Re-parametrization:
- LoRA: Low-Rank Adaptation
Other Resources
Get more information from leading model providers: