mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-07 11:19:44 -05:00
feat(llama.cpp): consolidate options and respect tokenizer template when enabled (#7120)
* feat(llama.cpp): expose env vars as options for consistency This allows to configure everything in the YAML file of the model rather than have global configurations Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama.cpp): respect usetokenizertemplate and use llama.cpp templating system to process messages Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Detect template exists if use tokenizer template is enabled Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Better recognization of chat Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixes to support tool calls while using templates from tokenizer Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop template guessing, fix passing tools to tokenizer Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Extract grammar and other options from chat template, add schema struct Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Automatically set use_jinja Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Cleanups, identify by default gguf models for chat Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
committed by
GitHub
parent
e5e86d0acb
commit
02cc8cbcaa
@@ -128,16 +128,44 @@ Models can be also preloaded or downloaded on demand. To learn about model galle
|
||||
|
||||
#### YAML configuration
|
||||
|
||||
To use the `llama.cpp` backend, specify `llama` as the backend in the YAML file:
|
||||
To use the `llama.cpp` backend, specify `llama-cpp` as the backend in the YAML file:
|
||||
|
||||
```yaml
|
||||
name: llama
|
||||
backend: llama
|
||||
backend: llama-cpp
|
||||
parameters:
|
||||
# Relative to the models path
|
||||
model: file.gguf
|
||||
```
|
||||
|
||||
#### Backend Options
|
||||
|
||||
The `llama.cpp` backend supports additional configuration options that can be specified in the `options` field of your model YAML configuration. These options allow fine-tuning of the backend behavior:
|
||||
|
||||
| Option | Type | Description | Example |
|
||||
|--------|------|-------------|---------|
|
||||
| `use_jinja` or `jinja` | boolean | Enable Jinja2 template processing for chat templates. When enabled, the backend uses Jinja2-based chat templates from the model for formatting messages. | `use_jinja:true` |
|
||||
| `context_shift` | boolean | Enable context shifting, which allows the model to dynamically adjust context window usage. | `context_shift:true` |
|
||||
| `cache_ram` | integer | Set the maximum RAM cache size in MiB for KV cache. Use `-1` for unlimited (default). | `cache_ram:2048` |
|
||||
| `parallel` or `n_parallel` | integer | Enable parallel request processing. When set to a value greater than 1, enables continuous batching for handling multiple requests concurrently. | `parallel:4` |
|
||||
| `grpc_servers` or `rpc_servers` | string | Comma-separated list of gRPC server addresses for distributed inference. Allows distributing workload across multiple llama.cpp workers. | `grpc_servers:localhost:50051,localhost:50052` |
|
||||
|
||||
**Example configuration with options:**
|
||||
|
||||
```yaml
|
||||
name: llama-model
|
||||
backend: llama
|
||||
parameters:
|
||||
model: model.gguf
|
||||
options:
|
||||
- use_jinja:true
|
||||
- context_shift:true
|
||||
- cache_ram:4096
|
||||
- parallel:2
|
||||
```
|
||||
|
||||
**Note:** The `parallel` option can also be set via the `LLAMACPP_PARALLEL` environment variable, and `grpc_servers` can be set via the `LLAMACPP_GRPC_SERVERS` environment variable. Options specified in the YAML file take precedence over environment variables.
|
||||
|
||||
#### Reference
|
||||
|
||||
- [llama](https://github.com/ggerganov/llama.cpp)
|
||||
|
||||
Reference in New Issue
Block a user