lookup: use hashmaps, select most frequent tokens, abort draft early if no good candidates #5462
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While I was working on #5398 I took a look at the conventional lookup example and noticed that it has some issues. This PR attempts to fix those. The changes are:
ngram_min
is fixed. On master withngram_min = 1
the minimal n-gram size is actually 2, with this PR it is 1. However, on master drafts based on the occurrence of only a single token are mostly useless anyways because their acceptance rate is very low. But the filters added with this PR greatly increase the acceptance rate of those drafts that pass the filters.I commands like this for testing:
The prompt is intentionally chosen in an adversarial way: it contains few token sequences that can be copied verbatim to the generation. The model is Miqu q5_K_M run on 3 P40s. I get these results:
Note: the batch size for lookup decoding is
--draft + 1
which is why the table only goes up to 7. With this PR ~90-95% as many tokens as on master get correctly drafted but with ~30-40% fewer incorrectly drafted tokens. As a consequence the total t/s increases. But because P40s have low compute relative to more modern GPUs they scale comparatively poorly with batch sizes > 1. So for this hardware and no prompt that already contains a lot of usable token sequences there is only a very small speedup, if any. I currently don't have a suitable instruct model on hand to test t/s on my RTX 3090 and non-instruct models (with these settings) tend to repeat themselves a lot which is good for lookup decoding but bad in terms of output quality.After this PR I intend to implement lookup decoding not just based on the current context but also based on general text statistics and previous user generations. I think the best results will be achieved with a hierarchical system: first look for suitable tokens in the current user session, then in the previous user session, then in a more general text corpus like wikitext. The tradeoff is between sample size and relevance to the current generation. You could potentially use the higher sample size statistics to select among multiple candidates with more relevance to the current generation. For this the hashmap data structure will be very useful because it only stores those n-gram -> token mappings that are actually observed and as such needs very little memory: with a prototype the hashmap based on ~500 MiB of wikitext was only ~1 MiB in size.
When it comes to the implementation considerations laid out in #4235 , the only issue that should arise with the implementation in this PR is that setting a fixed size for the n-gram cache would skew the statistics used for creating the draft. But I think the caches will be small enough that this will not be necessary in the first place.