You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/main/README.md
+113-28
Original file line number
Diff line number
Diff line change
@@ -17,23 +17,45 @@ This example program allows you to use various LLaMA language models in an easy
17
17
18
18
To get started right away, run the following command, making sure to use the correct path for the model you have:
19
19
20
+
#### Unix-based systems (Linux, macOS, etc.):
21
+
20
22
```bash
21
23
./main -m models/7B/ggml-model.bin --prompt "Once upon a time"
22
24
```
23
25
24
-
The following command generates "infinite" text from a starting prompt (you can use `Ctrl-C` to stop it):
26
+
#### Windows:
25
27
26
-
```bash
27
-
./main -m models/7B/ggml-model.bin --ignore-eos --n_predict -1 --keep -1 --prompt "Once upon a time"
28
+
```powershell
29
+
main.exe -m models\7B\ggml-model.bin --prompt "Once upon a time"
28
30
```
29
31
30
32
For an interactive experience, try this command:
31
33
34
+
#### Unix-based systems (Linux, macOS, etc.):
35
+
32
36
```bash
33
-
./main -m models/7B/ggml-model.bin -n -1 --color -r "User:" --in-prefix "" --prompt $'User: Hi\nAI: Hello. I am an AI chatbot. Would you like to talk?\nUser: Sure!\nAI: What would you like to talk about?\nUser:'
37
+
./main -m models/7B/ggml-model.bin -n -1 --color -r "User:" --in-prefix "" --prompt 'User: Hi\nAI: Hello. I am an AI chatbot. Would you like to talk?\nUser: Sure!\nAI: What would you like to talk about?\nUser:'
34
38
```
35
39
36
-
Note that the newline characters in the prompt string above only work on Linux. On Windows, you will have to use the ``--file`` option (see below) to load a multi-line prompt from file instead.
40
+
#### Windows:
41
+
42
+
```powershell
43
+
main.exe -m models\7B\ggml-model.bin -n -1 --color -r "User:" --in-prefix " " --prompt "User: Hi\nAI: Hello. I am an AI chatbot. Would you like to talk?\nUser: Sure!\nAI: What would you like to talk about?\nUser:"
44
+
```
45
+
46
+
The following command generates "infinite" text from a starting prompt (you can use `Ctrl-C` to stop it):
@@ -42,7 +64,6 @@ In this section, we cover the most commonly used options for running the `main`
42
64
-`-m FNAME, --model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.bin`).
43
65
-`-i, --interactive`: Run the program in interactive mode, allowing you to provide input directly and receive real-time responses.
44
66
-`-ins, --instruct`: Run the program in instruction mode, which is particularly useful when working with Alpaca models.
45
-
-`-t N, --threads N`: Set the number of threads to use during computation. It is recommended to set this to the number of physical cores your CPU has.
46
67
-`-n N, --n_predict N`: Set the number of tokens to predict when generating text. Adjusting this value can influence the length of the generated text.
47
68
-`-c N, --ctx_size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
48
69
@@ -92,7 +113,7 @@ Instruction mode is particularly useful when working with Alpaca models, which a
92
113
93
114
-`-ins, --instruct`: Enable instruction mode to leverage the capabilities of Alpaca models in completing tasks based on user-provided instructions.
94
115
95
-
Technical detail: the user's input is internally prefixed with the reverse prompt (or ``### Instruction:`` as the default), and followed by ``### Response:`` (except if you just press Return without any input, to keep generating a longer response).
116
+
Technical detail: the user's input is internally prefixed with the reverse prompt (or `### Instruction:` as the default), and followed by `### Response:` (except if you just press Return without any input, to keep generating a longer response).
96
117
97
118
By understanding and utilizing these interaction options, you can create engaging and dynamic experiences with the LLaMA models, tailoring the text generation process to your specific needs.
98
119
@@ -116,69 +137,133 @@ By utilizing context management options like `--ctx_size` and `--keep`, you can
116
137
117
138
## Generation Flags
118
139
119
-
The following options are related to controlling the text generation process, influencing the diversity, creativity, and quality of the generated text. Understanding these options will help you fine-tune the output according to your needs:
140
+
The following options allow you to control the text generation process and fine-tune the diversity, creativity, and quality of the generated text according to your needs. By adjusting these options and experimenting with different combinations of values, you can find the best settings for your specific use case.
120
141
121
142
### Number of Tokens to Predict
122
143
123
144
-`-n N, --n_predict N`: Set the number of tokens to predict when generating text (default: 128, -1 = infinity).
124
145
125
146
The `--n_predict` option controls the number of tokens the model generates in response to the input prompt. By adjusting this value, you can influence the length of the generated text. A higher value will result in longer text, while a lower value will produce shorter text. A value of -1 will cause text to be generated without limit.
126
147
127
-
It is important to note that the generated text may be shorter than the specified number of tokens if an End-of-Sequence (EOS) token or a reverse prompt is encountered. In interactive mode text generation will pause and control will be returned to the user. In non-interactive mode, the program will end. In both cases, the text generation may stop before reaching the specified `n_predict` value. If you want the model to keep going without ever producing End-of-Sequence on its own, you can use the ``--ignore-eos`` parameter.
128
-
129
-
### RNG Seed
130
-
131
-
-`-s SEED, --seed SEED`: Set the random number generator (RNG) seed (default: -1).
132
-
133
-
The RNG seed is used to initialize the random number generator that influences the text generation process. By setting a specific seed value, you can obtain consistent and reproducible results across multiple runs with the same input and settings. This can be helpful for testing, debugging, or comparing the effects of different options on the generated text to see when they diverge. If the seed is set to a value less than 0, a random seed will be used, which will result in different outputs on each run.
148
+
It is important to note that the generated text may be shorter than the specified number of tokens if an End-of-Sequence (EOS) token or a reverse prompt is encountered. In interactive mode text generation will pause and control will be returned to the user. In non-interactive mode, the program will end. In both cases, the text generation may stop before reaching the specified `n_predict` value. If you want the model to keep going without ever producing End-of-Sequence on its own, you can use the `--ignore-eos` parameter.
134
149
135
150
### Temperature
136
151
137
152
-`--temp N`: Adjust the randomness of the generated text (default: 0.8).
138
153
139
154
Temperature is a hyperparameter that controls the randomness of the generated text. It affects the probability distribution of the model's output tokens. A higher temperature (e.g., 1.5) makes the output more random and creative, while a lower temperature (e.g., 0.5) makes the output more focused, deterministic, and conservative. The default value is 0.8, which provides a balance between randomness and determinism. At the extreme, a temperature of 0 will always pick the most likely next token, leading to identical outputs in each run.
140
155
141
-
Example usage: `--temp 0.8`
156
+
Example usage: `--temp 0.5`
142
157
143
158
### Repeat Penalty
144
159
145
160
-`--repeat_penalty N`: Control the repetition of token sequences in the generated text (default: 1.1).
161
+
-`--repeat_last_n N`: Last n tokens to consider for penalizing repetition (default: 64, 0 = disabled, -1 = ctx_size).
162
+
-`--no-penalize-nl`: Disable penalization for newline tokens when applying the repeat penalty.
163
+
164
+
The `repeat_penalty` option helps prevent the model from generating repetitive or monotonous text. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. The default value is 1.1.
146
165
147
-
Repeat penalty is a hyperparameter used to penalize the repetition of token sequences during text generation. It helps prevent the model from generating repetitive or monotonous text. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. The default value is 1.1.
166
+
The `repeat_last_n` option controls the number of tokens in the history to consider for penalizing repetition. A larger value will look further back in the generated text to prevent repetitions, while a smaller value will only consider recent tokens. A value of 0 disables the penalty, and a value of -1 sets the number of tokens considered equal to the context size (`ctx_size`).
148
167
149
-
Example usage: `--repeat_penalty 1.1`
168
+
Use the `--no-penalize-nl` option to disable newline penalization when applying the repeat penalty. This option is particularly useful for generating chat conversations, dialogues, code, poetry, or any text where newline tokens play a significant role in structure and formatting. Disabling newline penalization helps maintain the natural flow and intended formatting in these specific use cases.
169
+
170
+
Example usage: `--repeat_penalty 1.15 --repeat_last_n 128 --no-penalize-nl`
150
171
151
172
### Top-K Sampling
152
173
153
174
-`--top_k N`: Limit the next token selection to the K most probable tokens (default: 40).
154
175
155
176
Top-k sampling is a text generation method that selects the next token only from the top k most likely tokens predicted by the model. It helps reduce the risk of generating low-probability or nonsensical tokens, but it may also limit the diversity of the output. A higher value for top_k (e.g., 100) will consider more tokens and lead to more diverse text, while a lower value (e.g., 10) will focus on the most probable tokens and generate more conservative text. The default value is 40.
156
177
157
-
Example usage: `--top_k 40`
178
+
Example usage: `--top_k 30`
158
179
159
180
### Top-P Sampling
160
181
161
182
-`--top_p N`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.9).
162
183
163
184
Top-p sampling, also known as nucleus sampling, is another text generation method that selects the next token from a subset of tokens that together have a cumulative probability of at least p. This method provides a balance between diversity and quality by considering both the probabilities of tokens and the number of tokens to sample from. A higher value for top_p (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. The default value is 0.9.
164
185
165
-
Example usage: `--top_p 0.9`
186
+
Example usage: `--top_p 0.95`
187
+
188
+
### Tail Free Sampling (TFS)
189
+
190
+
-`--tfs N`: Enable tail free sampling with parameter z (default: 1.0, 1.0 = disabled).
191
+
192
+
Tail free sampling (TFS) is a text generation technique that aims to reduce the impact of less likely tokens, which may be less relevant, less coherent, or nonsensical, on the output. The method adjusts the logits (token probabilities) by raising them to the power of the parameter z. A higher value of z (e.g., 2.0) will further suppress less likely tokens from the tail of the distribution, while a value of 1.0 disables the effect of TFS. By setting the parameter z, you can control how much the probabilities of less likely tokens are reduced.
193
+
194
+
Example usage: `--tfs 2.0`
195
+
196
+
### Locally Typical Sampling
197
+
198
+
-`--typical N`: Enable locally typical sampling with parameter p (default: 1.0, 1.0 = disabled).
199
+
200
+
Locally typical sampling promotes the generation of contextually coherent and diverse text by sampling tokens that are typical or expected based on the surrounding context. By setting the parameter p between 0 and 1, you can control the balance between producing text that is locally coherent and diverse. A value closer to 1 will promote more contextually coherent tokens, while a value closer to 0 will promote more diverse tokens. A value equal to 1 disables locally typical sampling.
201
+
202
+
Example usage: `--typical 0.9`
166
203
167
-
By adjusting these options, you can control the diversity, quality, and creativity of the generated text to better suit your needs. You can experiment with different combinations of values to find the best settings for your specific use case.
-`--mirostat_lr N`: Set the Mirostat learning rate, parameter eta (default: 0.1).
208
+
-`--mirostat_ent N`: Set the Mirostat target entropy, parameter tau (default: 5.0).
209
+
210
+
Mirostat is an algorithm that actively maintains the quality of generated text within a desired range during text generation. It aims to strike a balance between coherence and diversity, avoiding low-quality output caused by excessive repetition (boredom traps) or incoherence (confusion traps).
211
+
212
+
The `--mirostat_lr` option sets the Mirostat learning rate (eta). The learning rate influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive. The default value is `0.1`.
213
+
214
+
The `--mirostat_ent` option sets the Mirostat target entropy (tau), which represents the desired perplexity value for the generated text. Adjusting the target entropy allows you to control the balance between coherence and diversity in the generated text. A lower value will result in more focused and coherent text, while a higher value will lead to more diverse and potentially less coherent text. The default value is `5.0`.
215
+
216
+
Example usage: `--mirostat 2 --mirostat_lr 0.05 --mirostat_ent 3.0`
217
+
218
+
### Logit Bias
219
+
220
+
-`-l TOKEN_ID(+/-)BIAS, --logit-bias TOKEN_ID(+/-)BIAS`: Modify the likelihood of a token appearing in the generated text completion.
221
+
222
+
The logit bias option allows you to manually adjust the likelihood of specific tokens appearing in the generated text. By providing a token ID and a positive or negative bias value, you can increase or decrease the probability of that token being generated.
223
+
224
+
For example, use `--logit-bias 15043+1` to increase the likelihood of the token 'Hello', or `--logit-bias 15043-1` to decrease its likelihood. Using a value of negative infinity, `--logit-bias 15043-inf` ensures that the token `Hello` is never produced.
225
+
226
+
A more practical use case might be to prevent the generation of `\code{begin}` and `\code{end}` by setting the `\` token (29905) to negative infinity with `-l 29905-inf`. (This is due to the prevalence of LaTeX codes that show up in LLaMA model inference.)
227
+
228
+
Example usage: `--logit-bias 29905-inf`
229
+
230
+
### RNG Seed
231
+
232
+
-`-s SEED, --seed SEED`: Set the random number generator (RNG) seed (default: -1, < 0 = random seed).
233
+
234
+
The RNG seed is used to initialize the random number generator that influences the text generation process. By setting a specific seed value, you can obtain consistent and reproducible results across multiple runs with the same input and settings. This can be helpful for testing, debugging, or comparing the effects of different options on the generated text to see when they diverge. If the seed is set to a value less than 0, a random seed will be used, which will result in different outputs on each run.
168
235
169
236
## Performance Tuning and Memory Options
170
237
171
-
These options help improve the performance and memory usage of the LLaMA models:
238
+
These options help improve the performance and memory usage of the LLaMA models. By adjusting these settings, you can fine-tune the model's behavior to better suit your system's capabilities and achieve optimal performance for your specific use case.
239
+
240
+
### Number of Threads
241
+
242
+
-`-t N, --threads N`: Set the number of threads to use during computation. For optimal performance, it is recommended to set this value to the number of physical CPU cores your system has (as opposed to the logical number of cores). Using the correct number of threads can greatly improve performance.
243
+
244
+
### Mlock
245
+
246
+
-`--mlock`: Lock the model in memory, preventing it from being swapped out when memory-mapped. This can improve performance but trades away some of the advantages of memory-mapping by requiring more RAM to run and potentially slowing down load times as the model loads into RAM.
247
+
248
+
### No Memory Mapping
249
+
250
+
-`--no-mmap`: Do not memory-map the model. By default, models are mapped into memory, which allows the system to load only the necessary parts of the model as needed. However, if the model is larger than your total amount of RAM or if your system is low on available memory, using mmap might increase the risk of pageouts, negatively impacting performance. Disabling mmap results in slower load times but may reduce pageouts if you're not using `--mlock`. Note that if the model is larger than the total amount of RAM, turning off mmap would prevent the model from loading at all.
251
+
252
+
### Memory Float 32
253
+
254
+
-`--memory_f32`: Use 32-bit floats instead of 16-bit floats for memory key+value, allowing higher quality inference at the cost of higher memory usage.
255
+
256
+
### Batch Size
172
257
173
-
-`-t N, --threads N`: Set the number of threads to use during computation. Using the correct number of threads can greatly improve performance. It is recommended to set this value to the number of CPU cores.
174
-
-`--mlock`: Lock the model in memory, preventing it from being swapped out when mmaped. This can improve performance.
175
-
-`--no-mmap`: Do not memory-map the model. This results in a slower load time but may reduce pageouts if you're not using `mlock`.
176
-
-`--memory_f32`: Use 32 bit floats instead of 16 bit floats for memory key+value, allowing higher quality inference at the cost of memory.
177
258
-`-b N, --batch_size N`: Set the batch size for prompt processing (default: 512). This large batch size benefits users who have BLAS installed and enabled it during the build. If you don't have BLAS enabled ("BLAS=0"), you can use a smaller number, such as 8, to see the prompt progress as it's evaluated in some situations.
178
259
179
-
For information about 4-bit quantization, which can significantly improve performance and reduce memory usage, please refer to llama.cpp's primary [README](../../README.md#prepare-data--run).
260
+
### Session Caching
180
261
181
-
By understanding and using these performance tuning settings, you can optimize the LLaMA model's behavior to achieve the best performance for your specific needs.
262
+
-`--session FNAME`: Specify a file to load/save the session, which caches the model state after the initial prompt. This can significantly speed up the startup time when you're using longer prompts. The session file is created during the first run and is reused in subsequent runs. If you change your prompt such that 75% or less of the session is reusable, the existing session file will be overwritten with a new, updated version to maintain optimal performance.
263
+
264
+
### Quantization
265
+
266
+
For information about 4-bit quantization, which can significantly improve performance and reduce memory usage, please refer to llama.cpp's primary [README](../../README.md#prepare-data--run).
0 commit comments