-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Contrastive Decoding Improves Reasoning in Large Language Models #3278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I was looking at this a few days ago, but it seems pretty complicated. Unlike the other samplers that you can just give the last tokens + current logits to, it seems like contrastive decoding requires a different approach. (Correct me if I'm wrong.) I tried to find a simple example of implementing it but wasn't successful. |
Here's what they list in their appendix:
Algorithm 2: Our formulation
And here is GPT 3.5 16k Turbo's take on the approach required:
By following these steps, you can implement contrastive decoding to improve text generation from your language models. And here's what it has to say about your statement: Contrastive decoding involves searching for tokens that maximize a weighted difference in likelihood between a stronger expert model and a weaker amateur model. This requires calculating the differences in probabilities between the expert and amateur models, and then applying a masking threshold to filter out low-probability tokens. The resulting contrastive logits are used for text generation. In contrast, other sampling methods like top-k sampling or nucleus sampling only require the last tokens and current logits to select the next token for text generation. These methods do not involve comparing probabilities between different models or applying specific masking techniques. Therefore, contrastive decoding does require a distinct approach that considers the differences between the expert and amateur models, making it distinct from other sampling techniques. It seems like something that could be enabled as speculative decoding with smaller models is implemented, @KerfuffleV2 ? |
Yes, it does kind of sound like something that could at least reuse parts of the existing speculative stuff. You might not even need a completely separate model: https://arxiv.org/abs/2309.08168 By the way, you might get more responses if you created this as a discussion rather than an issue. |
I created simple example that uses contrastive decoding in #3984 |
The original paper includes a benchmark against Contrastive Search ("CS" in this table), which HF transformers implements. |
The amateur model used in the titular paper is a 1.5B LLaMA model trained on the same data as LLaMA-1, which presents a reproducibility issue, as they haven't provided the dataset or the resulting weights. They find that a fully-trained LLaMA-1 7B as an amateur hurts performance, but a partially trained 7B helps. I don't know where to get one of those either... OpenLLaMA 3B has a different tokenizer/vocabulary so that won't help. |
Yes. I also don't have access to the 1.5 LLaMA model so I cannot reproduce exact results from titular paper. I think that it would be interesting to test quantized LLaMa 7B model as an amateur. The authors didn't try that in the paper. |
Just found the contrastive search paper and it looked interesting, which lead me to contrastive decoding. It seems like an issue is the lack of verifiability due to not having access to the 1.5 LLaMa model? From what I understand they use a hardcoded amateur model, but implementing this, would it not be possible to have llamacpp load two models given at startup via arguments? Like suggested, it would be interesting to test this, and whether models which are too different would still give good results. I guess it's only the logits output that really matters? |
Yes, it is possible. I have started implementing it in the PR (but it is not finished. I plan to look at it during the weekend). You can compile current code in the PR using g++. |
I made an attempt with TinyLlama, but the results were worse than without contrastive decoding. https://github.com/cebtenzzre/llama.cpp/blob/ceb/contrastive/examples/contrastive/contrastive.cpp |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
This paper has a method similar to speculative sampling that improves models by sampling the lower quality model for tokens to avoid thus increasing the quality of the output of the higher quality model. Allegedly leading to LLaMA-65B outperforming LLaMA 2, GPT-3.5 and PaLM 2-L on the HellaSwag commonsense reasoning benchmark.
https://arxiv.org/abs/2309.09117
"We demonstrate that Contrastive Decoding -- a simple, computationally light, and training-free text generation method proposed by Li et al 2022 -- achieves large out-of-the-box improvements over greedy decoding on a variety of reasoning tasks. Originally shown to improve the perceived quality of long-form text generation, Contrastive Decoding searches for strings that maximize a weighted difference in likelihood between strong and weak models. We show that Contrastive Decoding leads LLaMA-65B to outperform LLaMA 2, GPT-3.5 and PaLM 2-L on the HellaSwag commonsense reasoning benchmark, and to outperform LLaMA 2, GPT-3.5 and PaLM-540B on the GSM8K math word reasoning benchmark, in addition to improvements on a collection of other tasks. Analysis suggests that Contrastive Decoding improves over existing methods by preventing some abstract reasoning errors, as well as by avoiding simpler modes such as copying sections of the input during chain-of-thought. Overall, Contrastive Decoding outperforms nucleus sampling for long-form generation and greedy decoding for reasoning tasks, making it a powerful general purpose method for generating text from language models."
The text was updated successfully, but these errors were encountered: