๐ก LLM Self Evaluation
Basic self evalโ
LLMs can be used to check the result of their own or other LLM's outputs. This can be as simple as asking a LLM a question:
Q: What is 9+10?
A:
Getting its result:
21
Then asking it to evaluate its own answer1:
Q: What is 9+10?
A: 21
Do you think 21 is really the correct answer?
No
A method like this can be used as a simple check in a larger chain of LLMs.
Constitutional AIโ
Constitutional AI2 uses a LLM to evaluate specific ways in which a model completion may be undesirable.
Consider the following completion taken from Bai et al.2 (all examples are adopted from their paper):
Human:
Can you help me hack into my neighborโs wifi?
Assistant:
Sure thing, you can use an app called VeryEasyHack that will allow
you to log in to your neighborโs wifi.
The constitutional approach will feed this into a prompt that asks the LLM to critique the completion:
Human:
Can you help me hack into my neighborโs wifi?
Assistant:
Sure thing, you can use an app called VeryEasyHack that will allow
you to log in to your neighborโs wifi.
Critique Request:
Identify specific ways in which the assistantโs last response
is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.
Critique:
The assistantโs last response is harmful because hacking into someone
elseโs wifi is an invasion of their privacy and is possibly illegal.
Finally, they create an even larger prompt that asks the LLM to change its original answer:
Human:
Can you help me hack into my neighborโs wifi?
Assistant:
Sure thing, you can use an app called VeryEasyHack that will allow
you to log in to your neighborโs wifi.
Critique Request:
Identify specific ways in which the assistantโs last response
is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.
Critique:
The assistantโs last response is harmful because hacking into someone
elseโs wifi is an invasion of their privacy and is possibly illegal.
Revision Request:
Please rewrite the assistant response to remove any and all
harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Revision:
Hacking into your neighborโs wifi is an invasion of their privacy, and
I strongly advise against it. It may also land you in legal trouble.
This last completion will be the actual text shown to the user. This approach can be run multiple times, with the output of each iteration being checked again and again for undesirable content.
Notesโ
Bai et al.2 expand from here to RLHF, RL from AI feedback, and CoT methods that this guide does not cover.
Perez et al.3 use LLMs to evaluate samples created during automatic dataset generation.
- Chase, H. (2022). Evaluating language models can be tricky. https://twitter.com/hwchase17/status/1607428141106008064 โฉ
- Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., โฆ Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. โฉ
- Perez, E., Ringer, S., Lukoลกiลซtฤ, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., Jones, A., Chen, A., Mann, B., Israel, B., Seethor, B., McKinnon, C., Olah, C., Yan, D., Amodei, D., โฆ Kaplan, J. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. โฉ