You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* The "context" cache is built using the tokens in the current context of a user generation.
10
+
* The "dynamic" cache is built by merging the context caches of previous user generations.
11
+
* The "static" cache is built from a large text corpus with no relation to the current context.
6
12
7
-
The key parameters for lookup decoding are `ngram_min`, `ngram_max` and `n_draft`. The first two determine the size of the ngrams to search for in the prompt for a match. The latter specifies how many subsequent tokens to draft if a match is found.
13
+
The tradeoff between these caches lies in relevance to the current context vs. the emount of input data.
14
+
When trying to draft a new token using n-gram lookup the algorithm is as follows:
8
15
9
-
More info:
16
+
* Try to draft a suitable token from the context cache. If a static cache is available, use it to validate the draft candidates. This is done by simply multiplying the frequencies of the two caches.
17
+
* Try to draft a suitable token from the dynamic cache, validate with static cache if available.
18
+
* Try to draft a suitable token from the static cache.
Only a single token sequence with the most likely token candidates is drafted.
21
+
All tokens must pass thresholds for frequency and sample size in order to be drafted.
13
22
23
+
Relevant command line arguments:
24
+
25
+
-`--draft`: maximum number of additional tokens to draft using n-gram lookup. Default: 5. Set to 0 to disable n-gram lookup. **Results are not deterministic with n-gram lookup enabled due to varying batch size.**
26
+
-`-lcs FNAME, --lookup-cache-static FNAME`: optional path to static lookup cache to use for n-gram lookup. Created from a large, unspecific text corpus using `lookup-create`.
27
+
-`-lcd FNAME, --lookup-cache-dynamic FNAME`: optional path to dynamic lookup cache to use for n-gram lookup. Contains data from previous generations. Automatically created and filled while the server is running but by default discarded on server exit. Setting this argument tries to initialize the dynamic cache from a file and saves it to said file on server shutdown.
28
+
29
+
N-gram lookup caches saved to disk are compatible between models as long as they use the same tokenizer
30
+
(but for dynamic caches the resulting drafted tokens may be wrong which means there is no speedup).
31
+
Furthermore, the data format for both types of caches is the same so they can be used interchangeably (but probably not with good results).
32
+
33
+
## Usage Examples
34
+
35
+
### `lookup`
36
+
37
+
Generation using n-gram lookup:
38
+
39
+
```sh
40
+
./lookup --model models/opt/llama_2-7b-q4_0.gguf -ngl 99 --n-predict 256 --ignore-eos --draft 3 --color --prompt "Write a love story about two stars that tragically ends in a type Ia supernova. Use a lot of emotional and dramatic language."
41
+
```
42
+
43
+
The `--color` flag highlights the successfully predicted tokens.
44
+
The `--lookup-cache-static` and `--lookup-cache-dynamic` arguments can be set to provide static/dynamic caches.
45
+
46
+
### `lookup-stats`
47
+
48
+
Determine n-gram lookup effectiveness for a given text corpus (similar to `perplexity`):
0 commit comments