feat: docs revamp (#7313)

* docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small enhancements Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Enhancements * Default to zen-dark Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-01-06 10:39:55 -06:00 · 2025-11-19 22:21:20 +01:00
parent 77bbeed57e
commit 2cc4809b0d
61 changed files with 1125 additions and 1002 deletions
--- a/docs/content/features/GPU-acceleration.md
+++ b/docs/content/features/GPU-acceleration.md
@@ -0,0 +1,320 @@
+++
+disableToc = false
+title = "⚡ GPU acceleration"
+weight = 9
+url = "/features/gpu-acceleration/"
+++
+
+{{% notice context="warning" %}}
+Section under construction
+ {{% /notice %}}
+
+This section contains instruction on how to use LocalAI with GPU acceleration.
+
+{{% notice icon="⚡" context="warning" %}}
+For acceleration for AMD or Metal HW is still in development, for additional details see the [build]({{%relref "installation/build#Acceleration" %}})
+ {{% /notice %}}
+
+## Automatic Backend Detection
+
+When you install a model from the gallery (or a YAML file), LocalAI intelligently detects the required backend and your system's capabilities, then downloads the correct version for you. Whether you're running on a standard CPU, an NVIDIA GPU, an AMD GPU, or an Intel GPU, LocalAI handles it automatically.
+
+For advanced use cases or to override auto-detection, you can use the `LOCALAI_FORCE_META_BACKEND_CAPABILITY` environment variable. Here are the available options:
+
+- `default`: Forces CPU-only backend. This is the fallback if no specific hardware is detected.
+- `nvidia`: Forces backends compiled with CUDA support for NVIDIA GPUs.
+- `amd`: Forces backends compiled with ROCm support for AMD GPUs.
+- `intel`: Forces backends compiled with SYCL/oneAPI support for Intel GPUs.
+
+## Model configuration
+
+Depending on the model architecture and backend used, there might be different ways to enable GPU acceleration. It is required to configure the model you intend to use with a YAML config file. For example, for `llama.cpp` workloads a configuration file might look like this (where `gpu_layers` is the number of layers to offload to the GPU):
+
+```yaml
+name: my-model-name
+parameters:
+  # Relative to the models path
+  model: llama.cpp-model.ggmlv3.q5_K_M.bin
+
+context_size: 1024
+threads: 1
+
+f16: true # enable with GPU acceleration
+gpu_layers: 22 # GPU Layers (only used when built with cublas)
+
+```
+
+For diffusers instead, it might look like this instead:
+
+```yaml
+name: stablediffusion
+parameters:
+  model: toonyou_beta6.safetensors
+backend: diffusers
+step: 30
+f16: true
+diffusers:
+  pipeline_type: StableDiffusionPipeline
+  cuda: true
+  enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
+  scheduler_type: "k_dpmpp_sde"
+```
+
+## CUDA(NVIDIA) acceleration
+
+### Requirements
+
+Requirement: nvidia-container-toolkit (installation instructions [1](https://www.server-world.info/en/note?os=Ubuntu_22.04&p=nvidia&f=2) [2](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html))
+
+If using a system with SELinux, ensure you have the policies installed, such as those [provided by nvidia](https://github.com/NVIDIA/dgx-selinux/)
+
+To check what CUDA version do you need, you can either run `nvidia-smi` or `nvcc --version`.
+
+Alternatively, you can also check nvidia-smi with docker:
+
+```
+docker run --runtime=nvidia --rm nvidia/cuda:12.8.0-base-ubuntu24.04 nvidia-smi
+```
+
+To use CUDA, use the images with the `cublas` tag, for example.
+
+The image list is on [quay](https://quay.io/repository/go-skynet/local-ai?tab=tags):
+
+- CUDA `11` tags: `master-gpu-nvidia-cuda-11`, `v1.40.0-gpu-nvidia-cuda-11`, ...
+- CUDA `12` tags: `master-gpu-nvidia-cuda-12`, `v1.40.0-gpu-nvidia-cuda-12`, ...
+
+In addition to the commands to run LocalAI normally, you need to specify `--gpus all` to docker, for example:
+
+```bash
+docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -v $PWD/models:/models quay.io/go-skynet/local-ai:v1.40.0-gpu-nvidia-cuda12
+```
+
+If the GPU inferencing is working, you should be able to see something like:
+
+```
+5:22PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
+ggml_init_cublas: found 1 CUDA devices:
+  Device 0: Tesla T4
+llama.cpp: loading model from /models/open-llama-7b-q4_0.bin
+llama_model_load_internal: format     = ggjt v3 (latest)
+llama_model_load_internal: n_vocab    = 32000
+llama_model_load_internal: n_ctx      = 1024
+llama_model_load_internal: n_embd     = 4096
+llama_model_load_internal: n_mult     = 256
+llama_model_load_internal: n_head     = 32
+llama_model_load_internal: n_layer    = 32
+llama_model_load_internal: n_rot      = 128
+llama_model_load_internal: ftype      = 2 (mostly Q4_0)
+llama_model_load_internal: n_ff       = 11008
+llama_model_load_internal: n_parts    = 1
+llama_model_load_internal: model size = 7B
+llama_model_load_internal: ggml ctx size =    0.07 MB
+llama_model_load_internal: using CUDA for GPU acceleration
+llama_model_load_internal: mem required  = 4321.77 MB (+ 1026.00 MB per state)
+llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
+llama_model_load_internal: offloading 10 repeating layers to GPU
+llama_model_load_internal: offloaded 10/35 layers to GPU
+llama_model_load_internal: total VRAM used: 1598 MB
+...................................................................................................
+llama_init_from_file: kv self size  =  512.00 MB
+```
+
+## ROCM(AMD) acceleration
+
+There are a limited number of tested configurations for ROCm systems however most newer deditated GPU consumer grade devices seem to be supported under the current ROCm6 implementation.
+
+Due to the nature of ROCm it is best to run all implementations in containers as this limits the number of packages required for installation on host system, compatibility and package versions for dependencies across all variations of OS must be tested independently if desired, please refer to the [build]({{%relref "installation/build#Acceleration" %}}) documentation.
+
+### Requirements
+
+- `ROCm 6.x.x` compatible GPU/accelerator
+- OS: `Ubuntu` (22.04, 20.04), `RHEL` (9.3, 9.2, 8.9, 8.8), `SLES` (15.5, 15.4)
+- Installed to host: `amdgpu-dkms` and `rocm` >=6.0.0 as per ROCm documentation.
+
+### Recommendations
+
+- Make sure to do not use GPU assigned for compute for desktop rendering.
+- Ensure at least 100GB of free space on disk hosting container runtime and storing images prior to installation.
+
+### Limitations
+
+Ongoing verification testing of ROCm compatibility with integrated backends.
+Please note the following list of verified backends and devices.
+
+LocalAI hipblas images are built against the following targets: gfx900,gfx906,gfx908,gfx940,gfx941,gfx942,gfx90a,gfx1030,gfx1031,gfx1100,gfx1101
+
+If your device is not one of these you must specify the corresponding `GPU_TARGETS` and specify `REBUILD=true`. Otherwise you don't need to specify these in the commands below.
+
+### Verified
+
+The devices in the following list have been tested with `hipblas` images running `ROCm 6.0.0`
+
+| Backend | Verified | Devices |
+| ---- | ---- | ---- |
+| llama.cpp | yes | Radeon VII (gfx906) |
+| diffusers | yes | Radeon VII (gfx906) |
+| piper | yes | Radeon VII (gfx906) |
+| whisper | no | none |
+| bark | no | none |
+| coqui | no | none |
+| transformers | no | none |
+| exllama | no | none |
+| exllama2 | no | none |
+| mamba | no | none |
+| sentencetransformers | no | none |
+| transformers-musicgen | no | none |
+| vall-e-x | no | none |
+| vllm | no | none |
+
+**You can help by expanding this list.**
+
+### System Prep
+
+1. Check your GPU LLVM target is compatible with the version of ROCm. This can be found in the [LLVM Docs](https://llvm.org/docs/AMDGPUUsage.html).
+2. Check which ROCm version is compatible with your LLVM target and your chosen OS (pay special attention to supported kernel versions). See the following for compatibility for ([ROCm 6.0.0](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.0.0/reference/system-requirements.html)) or ([ROCm 6.0.2](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html))
+3. Install you chosen version of the `dkms` and `rocm` (it is recommended that the native package manager be used for this process for any OS as version changes are executed more easily via this method if updates are required). Take care to restart after installing `amdgpu-dkms` and before installing `rocm`, for details regarding this see the installation documentation for your chosen OS ([6.0.2](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/native-install/index.html) or [6.0.0](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.0.0/how-to/native-install/index.html))
+4. Deploy. Yes it's that easy.
+
+#### Setup Example (Docker/containerd)
+
+The following are examples of the ROCm specific configuration elements required.
+
+```yaml
+    # For full functionality select a non-'core' image, version locking the image is recommended for debug purposes.
+    image: quay.io/go-skynet/local-ai:master-aio-gpu-hipblas
+    environment:
+      - DEBUG=true
+      # If your gpu is not already included in the current list of default targets the following build details are required.
+      - REBUILD=true
+      - BUILD_TYPE=hipblas
+      - GPU_TARGETS=gfx906 # Example for Radeon VII
+    devices:
+      # AMD GPU only require the following devices be passed through to the container for offloading to occur.
+      - /dev/dri
+      - /dev/kfd
+```
+
+The same can also be executed as a `run` for your container runtime
+
+```
+docker run \
+ -e DEBUG=true \
+ -e REBUILD=true \
+ -e BUILD_TYPE=hipblas \
+ -e GPU_TARGETS=gfx906 \
+ --device /dev/dri \
+ --device /dev/kfd \
+ quay.io/go-skynet/local-ai:master-aio-gpu-hipblas
+```
+
+Please ensure to add all other required environment variables, port forwardings, etc to your `compose` file or `run` command.
+
+The rebuild process will take some time to complete when deploying these containers and it is recommended that you `pull` the image prior to deployment as depending on the version these images may be ~20GB in size.
+
+#### Example (k8s) (Advanced Deployment/WIP)
+
+For k8s deployments there is an additional step required before deployment, this is the deployment of the [ROCm/k8s-device-plugin](https://artifacthub.io/packages/helm/amd-gpu-helm/amd-gpu).
+For any k8s environment the documentation provided by AMD from the ROCm project should be successful. It is recommended that if you use rke2 or OpenShift that you deploy the SUSE or RedHat provided version of this resource to ensure compatibility.
+After this has been completed the [helm chart from go-skynet](https://github.com/go-skynet/helm-charts) can be configured and deployed mostly un-edited.
+
+The following are details of the changes that should be made to ensure proper function.
+While these details may be configurable in the `values.yaml` development of this Helm chart is ongoing and is subject to change.
+
+The following details indicate the final state of the localai deployment relevant to GPU function.
+
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: {NAME}-local-ai
+...
+spec:
+  ...
+  template:
+    ...
+    spec:
+      containers:
+        - env:
+            - name: HIP_VISIBLE_DEVICES
+              value: '0'
+              # This variable indicates the devices available to container (0:device1 1:device2 2:device3) etc.
+              # For multiple devices (say device 1 and 3) the value would be equivalent to HIP_VISIBLE_DEVICES="0,2"
+              # Please take note of this when an iGPU is present in host system as compatibility is not assured.
+          ...
+          resources:
+            limits:
+              amd.com/gpu: '1'
+            requests:
+              amd.com/gpu: '1'
+```
+
+This configuration has been tested on a 'custom' cluster managed by SUSE Rancher that was deployed on top of Ubuntu 22.04.4, certification of other configuration is ongoing and compatibility is not guaranteed.
+
+### Notes
+
+- When installing the ROCM kernel driver on your system ensure that you are installing an equal or newer version that that which is currently implemented in LocalAI (6.0.0 at time of writing).
+- AMD documentation indicates that this will ensure functionality however your mileage may vary depending on the GPU and distro you are using.
+- If you encounter an `Error 413` on attempting to upload an audio file or image for whisper or llava/bakllava on a k8s deployment, note that the ingress for your deployment may require the annotation `nginx.ingress.kubernetes.io/proxy-body-size: "25m"` to allow larger uploads. This may be included in future versions of the helm chart.
+
+## Intel acceleration (sycl)
+
+### Requirements
+
+If building from source, you need to install [Intel oneAPI Base Toolkit](https://software.intel.com/content/www/us/en/develop/tools/oneapi/base-toolkit/download.html) and have the Intel drivers available in the system.
+
+### Container images
+
+To use SYCL, use the images with `gpu-intel` in the tag, for example `{{< version >}}-gpu-intel`, ...
+
+The image list is on [quay](https://quay.io/repository/go-skynet/local-ai?tab=tags).
+
+#### Example
+
+To run LocalAI with Docker and sycl starting `phi-2`, you can use the following command as an example:
+
+```bash
+docker run -e DEBUG=true --privileged -ti -v $PWD/models:/models -p 8080:8080  -v /dev/dri:/dev/dri --rm quay.io/go-skynet/local-ai:master-gpu-intel phi-2
+```
+
+### Notes
+
+In addition to the commands to run LocalAI normally, you need to specify `--device /dev/dri` to docker, for example:
+
+```bash
+docker run --rm -ti --device /dev/dri -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -v $PWD/models:/models quay.io/go-skynet/local-ai:{{< version >}}-gpu-intel
+```
+
+Note also that sycl does have a known issue to hang with `mmap: true`. You have to disable it in the model configuration if explicitly enabled.
+
+## Vulkan acceleration
+
+### Requirements
+
+If using nvidia, follow the steps in the [CUDA](#cudanvidia-acceleration) section to configure your docker runtime to allow access to the GPU.
+
+### Container images
+
+To use Vulkan, use the images with the `vulkan` tag, for example `{{< version >}}-gpu-vulkan`.
+
+#### Example
+
+To run LocalAI with Docker and Vulkan, you can use the following command as an example:
+
+```bash
+docker run -p 8080:8080 -e DEBUG=true -v $PWD/models:/models localai/localai:latest-gpu-vulkan
+```
+
+### Notes
+
+In addition to the commands to run LocalAI normally, you need to specify additional flags to pass the GPU hardware to the container.
+
+These flags are the same as the sections above, depending on the hardware, for [nvidia](#cudanvidia-acceleration), [AMD](#rocmamd-acceleration) or [Intel](#intel-acceleration-sycl).
+
+If you have mixed hardware, you can pass flags for multiple GPUs, for example:
+
+```bash
+docker run -p 8080:8080 -e DEBUG=true -v $PWD/models:/models \
+--gpus=all \ # nvidia passthrough
+--device /dev/dri --device /dev/kfd \ # AMD/Intel passthrough
+localai/localai:latest-gpu-vulkan
+```
--- a/docs/content/features/_index.en.md
+++ b/docs/content/features/_index.en.md
@@ -0,0 +1,38 @@
+++
+disableToc = false
+title = "Features"
+weight = 8
+icon = "lightbulb"
+type = "chapter"
+url = "/features/"
+++
+
+LocalAI provides a comprehensive set of features for running AI models locally. This section covers all the capabilities and functionalities available in LocalAI.
+
+## Core Features
+
+- **[Text Generation](text-generation/)** - Generate text with GPT-compatible models using various backends
+- **[Image Generation](image-generation/)** - Create images with Stable Diffusion and other diffusion models
+- **[Audio Processing](audio-to-text/)** - Transcribe audio to text and generate speech from text
+- **[Embeddings](embeddings/)** - Generate vector embeddings for semantic search and RAG applications
+- **[GPT Vision](gpt-vision/)** - Analyze and understand images with vision-language models
+
+## Advanced Features
+
+- **[OpenAI Functions](openai-functions/)** - Use function calling and tools API with local models
+- **[Constrained Grammars](constrained_grammars/)** - Control model output format with BNF grammars
+- **[GPU Acceleration](GPU-acceleration/)** - Optimize performance with GPU support
+- **[Distributed Inference](distributed_inferencing/)** - Scale inference across multiple nodes
+- **[Model Context Protocol (MCP)](mcp/)** - Enable agentic capabilities with MCP integration
+
+## Specialized Features
+
+- **[Object Detection](object-detection/)** - Detect and locate objects in images
+- **[Reranker](reranker/)** - Improve retrieval accuracy with cross-encoder models
+- **[Stores](stores/)** - Vector similarity search for embeddings
+- **[Model Gallery](model-gallery/)** - Browse and install pre-configured models
+- **[Backends](backends/)** - Learn about available backends and how to manage them
+
+## Getting Started
+
+To start using these features, make sure you have [LocalAI installed](/installation/) and have [downloaded some models](/getting-started/models/). Then explore the feature pages above to learn how to use each capability.
--- a/docs/content/features/audio-to-text.md
+++ b/docs/content/features/audio-to-text.md
@@ -0,0 +1,44 @@
+++
+disableToc = false
+title = "🔈 Audio to text"
+weight = 16
+url = "/features/audio-to-text/"
+++
+
+Audio to text models are models that can generate text from an audio file.
+
+The transcription endpoint allows to convert audio files to text. The endpoint is based on [whisper.cpp](https://github.com/ggerganov/whisper.cpp), a C++ library for audio transcription. The endpoint input supports all the audio formats supported by `ffmpeg`.
+
+## Usage
+
+Once LocalAI is started and whisper models are installed, you can use the `/v1/audio/transcriptions` API endpoint.
+
+For instance, with cURL:
+
+```bash
+curl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@<FILE_PATH>" -F model="<MODEL_NAME>"
+```
+
+## Example
+
+Download one of the models from [here](https://huggingface.co/ggerganov/whisper.cpp/tree/main) in the `models` folder, and create a YAML file for your model:
+
+```yaml
+name: whisper-1
+backend: whisper
+parameters:
+  model: whisper-en
+```
+
+The transcriptions endpoint then can be tested like so:
+
+```bash
+## Get an example audio file
+wget --quiet --show-progress -O gb1.ogg https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg
+
+## Send the example audio file to the transcriptions endpoint
+curl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@$PWD/gb1.ogg" -F model="whisper-1"
+
+## Result
+{"text":"My fellow Americans, this day has brought terrible news and great sadness to our country.At nine o'clock this morning, Mission Control in Houston lost contact with our Space ShuttleColumbia.A short time later, debris was seen falling from the skies above Texas.The Columbia's lost.There are no survivors.One board was a crew of seven.Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark, Captain DavidBrown, Commander William McCool, Dr. Kultna Shavla, and Elon Ramon, a colonel in the IsraeliAir Force.These men and women assumed great risk in the service to all humanity.In an age when spaceflight has come to seem almost routine, it is easy to overlook thedangers of travel by rocket and the difficulties of navigating the fierce outer atmosphere ofthe Earth.These astronauts knew the dangers, and they faced them willingly, knowing they had a highand noble purpose in life.Because of their courage and daring and idealism, we will miss them all the more.All Americans today are thinking as well of the families of these men and women who havebeen given this sudden shock and grief.You're not alone.Our entire nation agrees with you, and those you loved will always have the respect andgratitude of this country.The cause in which they died will continue.Mankind has led into the darkness beyond our world by the inspiration of discovery andthe longing to understand.Our journey into space will go on.In the skies today, we saw destruction and tragedy.As farther than we can see, there is comfort and hope.In the words of the prophet Isaiah, \"Lift your eyes and look to the heavens who createdall these, he who brings out the starry hosts one by one and calls them each by name.\"Because of his great power and mighty strength, not one of them is missing.The same creator who names the stars also knows the names of the seven souls we mourntoday.The crew of the shuttle Columbia did not return safely to Earth yet we can pray that all aresafely home.May God bless the grieving families and may God continue to bless America.[BLANK_AUDIO]"}
+```
--- a/docs/content/features/backends.md
+++ b/docs/content/features/backends.md
@@ -0,0 +1,124 @@
+---
+title: "⚙️ Backends"
+description: "Learn how to use, manage, and develop backends in LocalAI"
+weight: 4
+url: "/backends/"
+---
+
+
+LocalAI supports a variety of backends that can be used to run different types of AI models. There are core Backends which are included, and there are containerized applications that provide the runtime environment for specific model types, such as LLMs, diffusion models, or text-to-speech models.
+
+## Managing Backends in the UI
+
+The LocalAI web interface provides an intuitive way to manage your backends:
+
+1. Navigate to the "Backends" section in the navigation menu
+2. Browse available backends from configured galleries
+3. Use the search bar to find specific backends by name, description, or type
+4. Filter backends by type using the quick filter buttons (LLM, Diffusion, TTS, Whisper)
+5. Install or delete backends with a single click
+6. Monitor installation progress in real-time
+
+Each backend card displays:
+- Backend name and description
+- Type of models it supports
+- Installation status
+- Action buttons (Install/Delete)
+- Additional information via the info button
+
+## Backend Galleries
+
+Backend galleries are repositories that contain backend definitions. They work similarly to model galleries but are specifically for backends.
+
+### Adding a Backend Gallery
+
+You can add backend galleries by specifying the **Environment Variable**  `LOCALAI_BACKEND_GALLERIES`:
+
+```bash
+export LOCALAI_BACKEND_GALLERIES='[{"name":"my-gallery","url":"https://raw.githubusercontent.com/username/repo/main/backends"}]'
+```
+The URL needs to point to a valid yaml file, for example:
+
+```yaml
+- name: "test-backend"
+  uri: "quay.io/image/tests:localai-backend-test"
+  alias: "foo-backend"
+```
+
+Where URI is the path to an OCI container image.
+
+### Backend Gallery Structure
+
+A backend gallery is a collection of YAML files, each defining a backend. Here's an example structure:
+
+```yaml
+name: "llm-backend"
+description: "A backend for running LLM models"
+uri: "quay.io/username/llm-backend:latest"
+alias: "llm"
+tags:
+  - "llm"
+  - "text-generation"
+```
+
+## Pre-installing Backends
+
+You can pre-install backends when starting LocalAI using the `LOCALAI_EXTERNAL_BACKENDS` environment variable:
+
+```bash
+export LOCALAI_EXTERNAL_BACKENDS="llm-backend,diffusion-backend"
+local-ai run
+```
+
+## Creating a Backend
+
+To create a new backend, you need to:
+
+1. Create a container image that implements the LocalAI backend interface
+2. Define a backend YAML file
+3. Publish your backend to a container registry
+
+### Backend Container Requirements
+
+Your backend container should:
+
+1. Implement the LocalAI backend interface (gRPC or HTTP)
+2. Handle model loading and inference
+3. Support the required model types
+4. Include necessary dependencies
+5. Have a top level `run.sh` file that will be used to run the backend
+6. Pushed to a registry so can be used in a gallery
+
+### Getting started
+
+For getting started, see the available backends in LocalAI here: https://github.com/mudler/LocalAI/tree/master/backend . 
+
+- For Python based backends there is a template that can be used as starting point: https://github.com/mudler/LocalAI/tree/master/backend/python/common/template . 
+- For Golang based backends, you can see the `bark-cpp` backend as an example: https://github.com/mudler/LocalAI/tree/master/backend/go/bark-cpp
+- For C++ based backends, you can see the `llama-cpp` backend as an example: https://github.com/mudler/LocalAI/tree/master/backend/cpp/llama-cpp
+
+### Publishing Your Backend
+
+1. Build your container image:
+   ```bash
+   docker build -t quay.io/username/my-backend:latest .
+   ```
+
+2. Push to a container registry:
+   ```bash
+   docker push quay.io/username/my-backend:latest
+   ```
+
+3. Add your backend to a gallery:
+   - Create a YAML entry in your gallery repository
+   - Include the backend definition
+   - Make the gallery accessible via HTTP/HTTPS
+
+## Backend Types
+
+LocalAI supports various types of backends:
+
+- **LLM Backends**: For running language models
+- **Diffusion Backends**: For image generation
+- **TTS Backends**: For text-to-speech conversion
+- **Whisper Backends**: For speech-to-text conversion
--- a/docs/content/features/constrained_grammars.md
+++ b/docs/content/features/constrained_grammars.md
@@ -0,0 +1,72 @@
+++
+disableToc = false
+title = "✍️ Constrained Grammars"
+weight = 15
+url = "/features/constrained_grammars/"
+++
+
+## Overview
+
+The `chat` endpoint supports the `grammar` parameter, which allows users to specify a grammar in Backus-Naur Form (BNF). This feature enables the Large Language Model (LLM) to generate outputs adhering to a user-defined schema, such as `JSON`, `YAML`, or any other format that can be defined using BNF. For more details about BNF, see [Backus-Naur Form on Wikipedia](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form).
+
+{{% notice note %}}
+**Compatibility Notice:** This feature is only supported by models that use the [llama.cpp](https://github.com/ggerganov/llama.cpp) backend. For a complete list of compatible models, refer to the [Model Compatibility]({{%relref "reference/compatibility-table" %}}) page. For technical details, see the related pull requests: [PR #1773](https://github.com/ggerganov/llama.cpp/pull/1773) and [PR #1887](https://github.com/ggerganov/llama.cpp/pull/1887).
+ {{% /notice %}}
+
+## Setup
+
+To use this feature, follow the installation and setup instructions on the [LocalAI Functions]({{%relref "features/openai-functions" %}}) page. Ensure that your local setup meets all the prerequisites specified for the llama.cpp backend.
+
+## 💡 Usage Example
+
+The following example demonstrates how to use the `grammar` parameter to constrain the model's output to either "yes" or "no". This can be particularly useful in scenarios where the response format needs to be strictly controlled.
+
+### Example: Binary Response Constraint
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+  "model": "gpt-4",
+  "messages": [{"role": "user", "content": "Do you like apples?"}],
+  "grammar": "root ::= (\"yes\" | \"no\")"
+}'
+```
+
+In this example, the `grammar` parameter is set to a simple choice between "yes" and "no", ensuring that the model's response adheres strictly to one of these options regardless of the context.
+
+### Example: JSON Output Constraint
+
+You can also use grammars to enforce JSON output format:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+  "model": "gpt-4",
+  "messages": [{"role": "user", "content": "Generate a person object with name and age"}],
+  "grammar": "root ::= \"{\" \"\\\"name\\\":\" string \",\\\"age\\\":\" number \"}\"\nstring ::= \"\\\"\" [a-z]+ \"\\\"\"\nnumber ::= [0-9]+"
+}'
+```
+
+### Example: YAML Output Constraint
+
+Similarly, you can enforce YAML format:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+  "model": "gpt-4",
+  "messages": [{"role": "user", "content": "Generate a YAML list of fruits"}],
+  "grammar": "root ::= \"fruits:\" newline (\"  - \" string newline)+\nstring ::= [a-z]+\nnewline ::= \"\\n\""
+}'
+```
+
+## Advanced Usage
+
+For more complex grammars, you can define multi-line BNF rules. The grammar parser supports:
+- Alternation (`|`)
+- Repetition (`*`, `+`)
+- Optional elements (`?`)
+- Character classes (`[a-z]`)
+- String literals (`"text"`)
+
+## Related Features
+
+- [OpenAI Functions]({{%relref "features/openai-functions" %}}) - Function calling with structured outputs
+- [Text Generation]({{%relref "features/text-generation" %}}) - General text generation capabilities
--- a/docs/content/features/distributed_inferencing.md
+++ b/docs/content/features/distributed_inferencing.md
@@ -0,0 +1,154 @@
+++
+disableToc = false
+title = "🆕🖧 Distributed Inference"
+weight = 15
+url = "/features/distribute/"
+++
+
+
+This functionality enables LocalAI to distribute inference requests across multiple worker nodes, improving efficiency and performance. Nodes are automatically discovered and connect via p2p by using a shared token which makes sure the communication is secure and private between the nodes of the network.
+
+LocalAI supports two modes of distributed inferencing via p2p:
+
+- **Federated Mode**: Requests are shared between the cluster and routed to a single worker node in the network based on the load balancer's decision.
+- **Worker Mode** (aka "model sharding" or "splitting weights"): Requests are processed by all the workers which contributes to the final inference result (by sharing the model weights).
+
+A list of global instances shared by the community is available at [explorer.localai.io](https://explorer.localai.io).
+
+## Usage
+
+Starting LocalAI with `--p2p` generates a shared token for connecting multiple instances: and that's all you need to create AI clusters, eliminating the need for intricate network setups. 
+
+Simply navigate to the "Swarm" section in the WebUI and follow the on-screen instructions.
+
+For fully shared instances, initiate LocalAI with --p2p --federated and adhere to the Swarm section's guidance. This feature, while still experimental, offers a tech preview quality experience.
+
+### Federated mode
+
+Federated mode allows to launch multiple LocalAI instances and connect them together in a federated network. This mode is useful when you want to distribute the load of the inference across multiple nodes, but you want to have a single point of entry for the API. In the Swarm section of the WebUI, you can see the instructions to connect multiple instances together.
+
+![346663124-1d2324fd-8b55-4fa2-9856-721a467969c2](https://github.com/user-attachments/assets/19ebd44a-20ff-412c-b92f-cfb8efbe4b21)
+
+To start a LocalAI server in federated mode, run:
+
+```bash
+local-ai run --p2p --federated
+```
+
+This will generate a token that you can use to connect other LocalAI instances to the network or others can use to join the network. If you already have a token, you can specify it using the `TOKEN` environment variable.
+
+To start a load balanced server that routes the requests to the network, run with the `TOKEN`:
+
+```bash
+local-ai federated
+```
+
+To see all the available options, run `local-ai federated --help`.
+
+The instructions are displayed in the "Swarm" section of the WebUI, guiding you through the process of connecting multiple instances.
+
+### Workers mode
+
+{{% notice note %}}
+This feature is available exclusively with llama-cpp compatible models.
+
+This feature was introduced in [LocalAI pull request #2324](https://github.com/mudler/LocalAI/pull/2324) and is based on the upstream work in [llama.cpp pull request #6829](https://github.com/ggerganov/llama.cpp/pull/6829).
+ {{% /notice %}}
+
+To connect multiple workers to a single LocalAI instance, start first a server in p2p mode:
+
+```bash
+local-ai run --p2p
+```
+
+And navigate the WebUI to the "Swarm" section to see the instructions to connect multiple workers to the network.
+
+![346663124-1d2324fd-8b55-4fa2-9856-721a467969c2](https://github.com/user-attachments/assets/b8cadddf-a467-49cf-a1ed-8850de95366d)
+
+### Without P2P
+
+To start workers for distributing the computational load, run:
+
+```bash
+local-ai worker llama-cpp-rpc --llama-cpp-args="-H <listening_address> -p <listening_port> -m <memory>" 
+```
+
+And you can specify the address of the workers when starting LocalAI with the `LLAMACPP_GRPC_SERVERS` environment variable:
+
+```bash
+LLAMACPP_GRPC_SERVERS="address1:port,address2:port" local-ai run
+```
+The workload on the LocalAI server will then be distributed across the specified nodes.
+
+Alternatively, you can build the RPC workers/server following the llama.cpp [README](https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/README.md), which is compatible with LocalAI.
+
+## Manual example (worker)
+
+Use the WebUI to guide you in the process of starting new workers. This example shows the manual steps to highlight the process.
+
+1. Start the server with `--p2p`:
+
+```bash
+./local-ai run --p2p
+```
+
+Copy the token from the WebUI or via API call (e.g., `curl http://localhost:8000/p2p/token`) and save it for later use.
+
+To reuse the same token later, restart the server with `--p2ptoken` or `P2P_TOKEN`.
+
+2. Start the workers. Copy the `local-ai` binary to other hosts and run as many workers as needed using the token:
+
+```bash
+TOKEN=XXX ./local-ai worker p2p-llama-cpp-rpc --llama-cpp-args="-m <memory>" 
+```
+
+(Note: You can also supply the token via command-line arguments)
+
+The server logs should indicate that new workers are being discovered.
+
+3. Start inference as usual on the server initiated in step 1.
+
+![output](https://github.com/mudler/LocalAI/assets/2420543/8ca277cf-c208-4562-8929-808b2324b584)
+
+
+## Environment Variables
+
+There are options that can be tweaked or parameters that can be set using environment variables
+
+| Environment Variable | Description |
+|----------------------|-------------|
+| **LOCALAI_P2P** | Set to "true" to enable p2p |
+| **LOCALAI_FEDERATED** | Set to "true" to enable federated mode |
+| **FEDERATED_SERVER** | Set to "true" to enable federated server |
+| **LOCALAI_P2P_DISABLE_DHT** | Set to "true" to disable DHT and enable p2p layer to be local only (mDNS) |
+| **LOCALAI_P2P_ENABLE_LIMITS** | Set to "true" to enable connection limits and resources management (useful when running with poor connectivity or want to limit resources consumption) |
+| **LOCALAI_P2P_LISTEN_MADDRS** | Set to comma separated list of multiaddresses to override default libp2p 0.0.0.0 multiaddresses |
+| **LOCALAI_P2P_DHT_ANNOUNCE_MADDRS** | Set to comma separated list of multiaddresses to override announcing of listen multiaddresses (useful when external address:port is remapped) |
+| **LOCALAI_P2P_BOOTSTRAP_PEERS_MADDRS** | Set to comma separated list of multiaddresses to specify custom DHT bootstrap nodes |
+| **LOCALAI_P2P_TOKEN** | Set the token for the p2p network |
+| **LOCALAI_P2P_LOGLEVEL** | Set the loglevel for the LocalAI p2p stack (default: info) |
+| **LOCALAI_P2P_LIB_LOGLEVEL** | Set the loglevel for the underlying libp2p stack (default: fatal) |
+
+
+## Architecture
+
+LocalAI uses https://github.com/libp2p/go-libp2p under the hood, the same project powering IPFS. Differently from other frameworks, LocalAI uses peer2peer without a single master server, but rather it uses sub/gossip and ledger functionalities to achieve consensus across different peers. 
+
+[EdgeVPN](https://github.com/mudler/edgevpn) is used as a library to establish the network and expose the ledger functionality under a shared token to ease out automatic discovery and have separated, private peer2peer networks.
+
+The weights are split proportional to the memory when running into worker mode, when in federation mode each request is split to every node which have to load the model fully.
+
+## Debugging
+
+To debug, it's often useful to run in debug mode, for instance:
+
+```
+LOCALAI_P2P_LOGLEVEL=debug LOCALAI_P2P_LIB_LOGLEVEL=debug LOCALAI_P2P_ENABLE_LIMITS=true LOCALAI_P2P_DISABLE_DHT=true LOCALAI_P2P_TOKEN="<TOKEN>" ./local-ai ...
+```
+
+## Notes
+
+- If running in p2p mode with container images, make sure you start the container with `--net host` or `network_mode: host` in the docker-compose file.
+- Only a single model is supported currently.
+- Ensure the server detects new workers before starting inference. Currently, additional workers cannot be added once inference has begun.
+- For more details on the implementation, refer to [LocalAI pull request #2343](https://github.com/mudler/LocalAI/pull/2343)
--- a/docs/content/features/embeddings.md
+++ b/docs/content/features/embeddings.md
@@ -0,0 +1,76 @@
+
+++
+disableToc = false
+title = "🧠 Embeddings"
+weight = 13
+url = "/features/embeddings/"
+++
+
+LocalAI supports generating embeddings for text or list of tokens.
+
+For the API documentation you can refer to the OpenAI docs: https://platform.openai.com/docs/api-reference/embeddings
+
+## Model compatibility
+
+The embedding endpoint is compatible with `llama.cpp` models, `bert.cpp` models and sentence-transformers models available in huggingface.
+
+## Manual Setup
+
+Create a `YAML` config file in the `models` directory. Specify the `backend` and the model file.
+
+```yaml
+name: text-embedding-ada-002 # The model name used in the API
+parameters:
+  model: <model_file>
+backend: "<backend>"
+embeddings: true
+```
+
+## Huggingface embeddings
+
+To use `sentence-transformers` and models in `huggingface` you can use the `sentencetransformers` embedding backend.
+
+```yaml
+name: text-embedding-ada-002
+backend: sentencetransformers
+embeddings: true
+parameters:
+  model: all-MiniLM-L6-v2
+```
+
+The `sentencetransformers` backend uses Python [sentence-transformers](https://github.com/UKPLab/sentence-transformers). For a list of all pre-trained models available see here: https://github.com/UKPLab/sentence-transformers#pre-trained-models
+
+{{% notice note %}}
+
+- The `sentencetransformers` backend is an optional backend of LocalAI and uses Python. If you are running `LocalAI` from the containers you are good to go and should be already configured for use.
+- For local execution, you also have to specify the extra backend in the `EXTERNAL_GRPC_BACKENDS` environment variable.
+    - Example: `EXTERNAL_GRPC_BACKENDS="sentencetransformers:/path/to/LocalAI/backend/python/sentencetransformers/sentencetransformers.py"`
+- The `sentencetransformers` backend does support only embeddings of text, and not of tokens. If you need to embed tokens you can use the `bert` backend or `llama.cpp`.
+- No models are required to be downloaded before using the `sentencetransformers` backend. The models will be downloaded automatically the first time the API is used.
+
+ {{% /notice %}}
+
+## Llama.cpp embeddings
+
+Embeddings with `llama.cpp` are supported with the `llama-cpp` backend, it needs to be enabled with `embeddings` set to `true`.
+
+```yaml
+name: my-awesome-model
+backend: llama-cpp
+embeddings: true
+parameters:
+  model: ggml-file.bin
+```
+
+Then you can use the API to generate embeddings:
+
+```bash
+curl http://localhost:8080/embeddings -X POST -H "Content-Type: application/json" -d '{
+  "input": "My text",
+  "model": "my-awesome-model"
+}' | jq "."
+```
+
+## 💡 Examples
+
+- Example that uses LLamaIndex and LocalAI as embedding: [here](https://github.com/mudler/LocalAI-examples/tree/main/query_data).
--- a/docs/content/features/gpt-vision.md
+++ b/docs/content/features/gpt-vision.md
@@ -0,0 +1,37 @@
+
+++
+disableToc = false
+title = "🥽 GPT Vision"
+weight = 14
+url = "/features/gpt-vision/"
+++
+
+LocalAI supports understanding images by using [LLaVA](https://llava.hliu.cc/), and implements the [GPT Vision API](https://platform.openai.com/docs/guides/vision) from OpenAI.
+
+![llava](https://github.com/mudler/LocalAI/assets/2420543/cb0a0897-3b58-4350-af66-e6f4387b58d3)
+
+## Usage
+
+OpenAI docs: https://platform.openai.com/docs/guides/vision
+
+To let LocalAI understand and reply with what sees in the image, use the `/v1/chat/completions` endpoint, for example with curl:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+     "model": "llava",
+     "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
+```
+
+Grammars and function tools can be used as well in conjunction with vision APIs:
+
+```bash
+ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+     "model": "llava", "grammar": "root ::= (\"yes\" | \"no\")",
+     "messages": [{"role": "user", "content": [{"type":"text", "text": "Is there some grass in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
+```
+
+### Setup
+
+All-in-One images have already shipped the llava model as `gpt-4-vision-preview`, so no setup is needed in this case. 
+
+To setup the LLaVa models, follow the full example in the [configuration examples](https://github.com/mudler/LocalAI-examples/blob/main/configurations/llava/llava.yaml).
--- a/docs/content/features/image-generation.md
+++ b/docs/content/features/image-generation.md
@@ -0,0 +1,345 @@
+
+++
+disableToc = false
+title = "🎨 Image generation"
+weight = 12
+url = "/features/image-generation/"
+++
+
+![anime_girl](https://github.com/go-skynet/LocalAI/assets/2420543/8aaca62a-e864-4011-98ae-dcc708103928)
+(Generated with [AnimagineXL](https://huggingface.co/Linaqruf/animagine-xl))
+
+LocalAI supports generating images with Stable diffusion, running on CPU using C++ and Python implementations.
+
+## Usage
+
+OpenAI docs: https://platform.openai.com/docs/api-reference/images/create
+
+To generate an image you can send a POST request to the `/v1/images/generations` endpoint with the instruction as the request body:
+
+```bash
+curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
+  "prompt": "A cute baby sea otter",
+  "size": "256x256"
+}'
+```
+
+Available additional parameters: `mode`, `step`.
+
+Note: To set a negative prompt, you can split the prompt with `|`, for instance: `a cute baby sea otter|malformed`.
+
+```bash
+curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
+  "prompt": "floating hair, portrait, ((loli)), ((one girl)), cute face, hidden hands, asymmetrical bangs, beautiful detailed eyes, eye shadow, hair ornament, ribbons, bowties, buttons, pleated skirt, (((masterpiece))), ((best quality)), colorful|((part of the head)), ((((mutated hands and fingers)))), deformed, blurry, bad anatomy, disfigured, poorly drawn face, mutation, mutated, extra limb, ugly, poorly drawn hands, missing limb, blurry, floating limbs, disconnected limbs, malformed hands, blur, out of focus, long neck, long body, Octane renderer, lowres, bad anatomy, bad hands, text",
+  "size": "256x256"
+}'
+```
+
+## Backends
+
+### stablediffusion-ggml
+
+This backend is based on [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp). Every model supported by that backend is supported indeed with LocalAI.
+
+
+#### Setup
+
+There are already several models in the gallery that are available to install and get up and running with this backend, you can for example run flux by searching it in the Model gallery (`flux.1-dev-ggml`) or start LocalAI with `run`:
+
+```bash
+local-ai run flux.1-dev-ggml
+```
+
+To use a custom model, you can follow these steps:
+
+1. Create a model file `stablediffusion.yaml` in the models folder:
+
+```yaml
+name: stablediffusion
+backend: stablediffusion-ggml
+parameters:
+  model: gguf_model.gguf
+step: 25
+cfg_scale: 4.5
+options:
+- "clip_l_path:clip_l.safetensors"
+- "clip_g_path:clip_g.safetensors"
+- "t5xxl_path:t5xxl-Q5_0.gguf"
+- "sampler:euler"
+```
+
+2. Download the required assets to the `models` repository
+3. Start LocalAI
+
+
+### Diffusers
+
+[Diffusers](https://huggingface.co/docs/diffusers/index) is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. LocalAI has a diffusers backend which allows image generation using the `diffusers` library.
+
+![anime_girl](https://github.com/go-skynet/LocalAI/assets/2420543/8aaca62a-e864-4011-98ae-dcc708103928)
+(Generated with [AnimagineXL](https://huggingface.co/Linaqruf/animagine-xl))
+
+#### Model setup
+
+The models will be downloaded the first time you use the backend from `huggingface` automatically.
+
+Create a model configuration file in the `models` directory, for instance to use `Linaqruf/animagine-xl` with CPU:
+
+```yaml
+name: animagine-xl
+parameters:
+  model: Linaqruf/animagine-xl
+backend: diffusers
+
+f16: false
+diffusers:
+  cuda: false # Enable for GPU usage (CUDA)
+  scheduler_type: euler_a
+```
+
+#### Dependencies
+
+This is an extra backend - in the container is already available and there is nothing to do for the setup. Do not use *core* images (ending with `-core`). If you are building manually, see the [build instructions]({{%relref "installation/build" %}}).
+
+#### Model setup
+
+The models will be downloaded the first time you use the backend from `huggingface` automatically.
+
+Create a model configuration file in the `models` directory, for instance to use `Linaqruf/animagine-xl` with CPU:
+
+```yaml
+name: animagine-xl
+parameters:
+  model: Linaqruf/animagine-xl
+backend: diffusers
+cuda: true
+f16: true
+diffusers:
+  scheduler_type: euler_a
+```
+
+#### Local models
+
+You can also use local models, or modify some parameters like `clip_skip`, `scheduler_type`, for instance:
+
+```yaml
+name: stablediffusion
+parameters:
+  model: toonyou_beta6.safetensors
+backend: diffusers
+step: 30
+f16: true
+cuda: true
+diffusers:
+  pipeline_type: StableDiffusionPipeline
+  enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
+  scheduler_type: "k_dpmpp_sde"
+  clip_skip: 11
+
+cfg_scale: 8
+```
+
+#### Configuration parameters
+
+The following parameters are available in the configuration file:
+
+| Parameter | Description | Default |
+| --- | --- | --- |
+| `f16` | Force the usage of `float16` instead of `float32` | `false` |
+| `step` | Number of steps to run the model for | `30` |
+| `cuda` | Enable CUDA acceleration | `false` |
+| `enable_parameters` | Parameters to enable for the model | `negative_prompt,num_inference_steps,clip_skip` |
+| `scheduler_type` | Scheduler type | `k_dpp_sde` |
+| `cfg_scale` | Configuration scale | `8` |
+| `clip_skip` | Clip skip | None |
+| `pipeline_type` | Pipeline type | `AutoPipelineForText2Image` |
+| `lora_adapters` | A list of lora adapters (file names relative to model directory) to apply | None |
+| `lora_scales` | A list of lora scales (floats) to apply | None |
+
+
+There are available several types of schedulers:
+
+| Scheduler | Description |
+| --- | --- |
+| `ddim` | DDIM |
+| `pndm` | PNDM |
+| `heun` | Heun |
+| `unipc` | UniPC |
+| `euler` | Euler |
+| `euler_a` | Euler a |
+| `lms` | LMS |
+| `k_lms` | LMS Karras |
+| `dpm_2` | DPM2 |
+| `k_dpm_2` | DPM2 Karras |
+| `dpm_2_a` | DPM2 a |
+| `k_dpm_2_a` | DPM2 a Karras |
+| `dpmpp_2m` | DPM++ 2M |
+| `k_dpmpp_2m` | DPM++ 2M Karras |
+| `dpmpp_sde` | DPM++ SDE |
+| `k_dpmpp_sde` | DPM++ SDE Karras |
+| `dpmpp_2m_sde` | DPM++ 2M SDE |
+| `k_dpmpp_2m_sde` | DPM++ 2M SDE Karras |
+
+Pipelines types available:
+
+| Pipeline type | Description |
+| --- | --- |
+| `StableDiffusionPipeline` | Stable diffusion pipeline |
+| `StableDiffusionImg2ImgPipeline` | Stable diffusion image to image pipeline |
+| `StableDiffusionDepth2ImgPipeline` | Stable diffusion depth to image pipeline |
+| `DiffusionPipeline` | Diffusion pipeline |
+| `StableDiffusionXLPipeline` | Stable diffusion XL pipeline |
+| `StableVideoDiffusionPipeline` | Stable video diffusion pipeline |
+| `AutoPipelineForText2Image` | Automatic detection pipeline for text to image |
+| `VideoDiffusionPipeline` | Video diffusion pipeline |
+| `StableDiffusion3Pipeline` | Stable diffusion 3 pipeline |
+| `FluxPipeline` | Flux pipeline |
+| `FluxTransformer2DModel` | Flux transformer 2D model |
+| `SanaPipeline` | Sana pipeline |
+
+##### Advanced: Additional parameters
+
+Additional arbitrarly parameters can be specified in the option field in key/value separated by `:`:
+
+```yaml
+name: animagine-xl
+options:
+- "cfg_scale:6"
+```
+
+**Note**: There is no complete parameter list. Any parameter can be passed arbitrarly and is passed to the model directly as argument to the pipeline. Different pipelines/implementations support different parameters.
+
+The example above, will result in the following python code when generating images:
+
+```python
+pipe(
+    prompt="A cute baby sea otter", # Options passed via API
+    size="256x256", # Options passed via API
+    cfg_scale=6 # Additional parameter passed via configuration file
+)
+```
+
+#### Usage
+
+#### Text to Image
+Use the `image` generation endpoint with the `model` name from the configuration file:
+
+```bash
+curl http://localhost:8080/v1/images/generations \
+    -H "Content-Type: application/json" \
+    -d '{
+      "prompt": "<positive prompt>|<negative prompt>", 
+      "model": "animagine-xl", 
+      "step": 51,
+      "size": "1024x1024" 
+    }'
+```
+
+#### Image to Image
+
+https://huggingface.co/docs/diffusers/using-diffusers/img2img
+
+An example model (GPU):
+```yaml
+name: stablediffusion-edit
+parameters:
+  model: nitrosocke/Ghibli-Diffusion
+backend: diffusers
+step: 25
+cuda: true
+f16: true
+diffusers:
+  pipeline_type: StableDiffusionImg2ImgPipeline
+  enable_parameters: "negative_prompt,num_inference_steps,image"
+```
+
+```bash
+IMAGE_PATH=/path/to/your/image
+(echo -n '{"file": "'; base64 $IMAGE_PATH; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-edit"}') |
+curl -H "Content-Type: application/json" -d @-  http://localhost:8080/v1/images/generations
+```
+
+##### 🖼️ Flux kontext with `stable-diffusion.cpp`
+
+LocalAI supports Flux Kontext and can be used to edit images via the API:
+
+Install with:
+
+```local-ai run flux.1-kontext-dev```
+
+To test:
+
+```
+curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
+  "model": "flux.1-kontext-dev",
+  "prompt": "change 'flux.cpp' to 'LocalAI'",
+  "size": "256x256",
+  "ref_images": [
+  	"https://raw.githubusercontent.com/leejet/stable-diffusion.cpp/master/assets/flux/flux1-dev-q8_0.png"
+  ]
+}'
+```
+
+#### Depth to Image
+
+https://huggingface.co/docs/diffusers/using-diffusers/depth2img
+
+```yaml
+name: stablediffusion-depth
+parameters:
+  model: stabilityai/stable-diffusion-2-depth
+backend: diffusers
+step: 50
+f16: true
+cuda: true
+diffusers:
+  pipeline_type: StableDiffusionDepth2ImgPipeline
+  enable_parameters: "negative_prompt,num_inference_steps,image"
+
+cfg_scale: 6
+```
+
+```bash
+(echo -n '{"file": "'; base64 ~/path/to/image.jpeg; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-depth"}') |
+curl -H "Content-Type: application/json" -d @-  http://localhost:8080/v1/images/generations
+```
+
+#### img2vid
+
+
+```yaml
+name: img2vid
+parameters:
+  model: stabilityai/stable-video-diffusion-img2vid
+backend: diffusers
+step: 25
+f16: true
+cuda: true
+diffusers:
+  pipeline_type: StableVideoDiffusionPipeline
+```
+
+```bash
+(echo -n '{"file": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png?download=true","size": "512x512","model":"img2vid"}') |
+curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations
+```
+
+#### txt2vid
+
+```yaml
+name: txt2vid
+parameters:
+  model: damo-vilab/text-to-video-ms-1.7b
+backend: diffusers
+step: 25
+f16: true
+cuda: true
+diffusers:
+  pipeline_type: VideoDiffusionPipeline
+  cuda: true
+```
+
+```bash
+(echo -n '{"prompt": "spiderman surfing","size": "512x512","model":"txt2vid"}') |
+curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations
+```
--- a/docs/content/features/mcp.md
+++ b/docs/content/features/mcp.md
@@ -0,0 +1,348 @@
+++
+title = "🔗 Model Context Protocol (MCP)"
+weight = 20
+toc = true
+description = "Agentic capabilities with Model Context Protocol integration"
+tags = ["MCP", "Agents", "Tools", "Advanced"]
+categories = ["Features"]
+++
+
+
+LocalAI now supports the **Model Context Protocol (MCP)**, enabling powerful agentic capabilities by connecting AI models to external tools and services. This feature allows your LocalAI models to interact with various MCP servers, providing access to real-time data, APIs, and specialized tools.
+
+## What is MCP?
+
+The Model Context Protocol is a standard for connecting AI models to external tools and data sources. It enables AI agents to:
+
+- Access real-time information from external APIs
+- Execute commands and interact with external systems
+- Use specialized tools for specific tasks
+- Maintain context across multiple tool interactions
+
+## Key Features
+
+- **🔄 Real-time Tool Access**: Connect to external MCP servers for live data
+- **🛠️ Multiple Server Support**: Configure both remote HTTP and local stdio servers
+- **⚡ Cached Connections**: Efficient tool caching for better performance
+- **🔒 Secure Authentication**: Support for bearer token authentication
+- **🎯 OpenAI Compatible**: Uses the familiar `/mcp/v1/chat/completions` endpoint
+- **🧠 Advanced Reasoning**: Configurable reasoning and re-evaluation capabilities
+- **📋 Auto-Planning**: Break down complex tasks into manageable steps
+- **🎯 MCP Prompts**: Specialized prompts for better MCP server interaction
+- **🔄 Plan Re-evaluation**: Dynamic plan adjustment based on results
+- **⚙️ Flexible Agent Control**: Customizable execution limits and retry behavior
+
+## Configuration
+
+MCP support is configured in your model's YAML configuration file using the `mcp` section:
+
+```yaml
+name: my-agentic-model
+backend: llama-cpp
+parameters:
+  model: qwen3-4b.gguf
+
+mcp:
+  remote: |
+    {
+      "mcpServers": {
+        "weather-api": {
+          "url": "https://api.weather.com/v1",
+          "token": "your-api-token"
+        },
+        "search-engine": {
+          "url": "https://search.example.com/mcp",
+          "token": "your-search-token"
+        }
+      }
+    }
+  
+  stdio: |
+    {
+      "mcpServers": {
+        "file-manager": {
+          "command": "python",
+          "args": ["-m", "mcp_file_manager"],
+          "env": {
+            "API_KEY": "your-key"
+          }
+        },
+        "database-tools": {
+          "command": "node",
+          "args": ["database-mcp-server.js"],
+          "env": {
+            "DB_URL": "postgresql://localhost/mydb"
+          }
+        }
+      }
+    }
+
+agent:
+  max_attempts: 3        # Maximum number of tool execution attempts
+  max_iterations: 3     # Maximum number of reasoning iterations
+  enable_reasoning: true # Enable tool reasoning capabilities
+  enable_planning: false # Enable auto-planning capabilities
+  enable_mcp_prompts: false # Enable MCP prompts
+  enable_plan_re_evaluator: false # Enable plan re-evaluation
+```
+
+### Configuration Options
+
+#### Remote Servers (`remote`)
+Configure HTTP-based MCP servers:
+
+- **`url`**: The MCP server endpoint URL
+- **`token`**: Bearer token for authentication (optional)
+
+#### STDIO Servers (`stdio`)
+Configure local command-based MCP servers:
+
+- **`command`**: The executable command to run
+- **`args`**: Array of command-line arguments
+- **`env`**: Environment variables (optional)
+
+#### Agent Configuration (`agent`)
+Configure agent behavior and tool execution:
+
+- **`max_attempts`**: Maximum number of tool execution attempts (default: 3)
+- **`max_iterations`**: Maximum number of reasoning iterations (default: 3)
+- **`enable_reasoning`**: Enable tool reasoning capabilities (default: false)
+- **`enable_planning`**: Enable auto-planning capabilities (default: false)
+- **`enable_mcp_prompts`**: Enable MCP prompts (default: false)
+- **`enable_plan_re_evaluator`**: Enable plan re-evaluation (default: false)
+
+## Usage
+
+### API Endpoint
+
+Use the MCP-enabled completion endpoint:
+
+```bash
+curl http://localhost:8080/mcp/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "my-agentic-model",
+    "messages": [
+      {"role": "user", "content": "What is the current weather in New York?"}
+    ],
+    "temperature": 0.7
+  }'
+```
+
+### Example Response
+
+```json
+{
+  "id": "chatcmpl-123",
+  "created": 1699123456,
+  "model": "my-agentic-model",
+  "choices": [
+    {
+      "text": "The current weather in New York is 72°F (22°C) with partly cloudy skies. The humidity is 65% and there's a light breeze from the west at 8 mph."
+    }
+  ],
+  "object": "text_completion"
+}
+```
+
+## Example Configurations
+
+
+### Docker-based Tools
+
+```yaml
+name: docker-agent
+backend: llama-cpp
+parameters:
+  model: qwen3-4b.gguf
+
+mcp:
+  stdio: |
+    {
+      "mcpServers": {
+        "searxng": {
+          "command": "docker",
+          "args": [
+            "run", "-i", "--rm",
+            "quay.io/mudler/tests:duckduckgo-localai"
+          ]
+        }
+      }
+    }
+
+agent:
+  max_attempts: 5
+  max_iterations: 5
+  enable_reasoning: true
+  enable_planning: true
+  enable_mcp_prompts: true
+  enable_plan_re_evaluator: true
+```
+
+## Agent Configuration Details
+
+The `agent` section controls how the AI model interacts with MCP tools:
+
+### Execution Control
+- **`max_attempts`**: Limits how many times a tool can be retried if it fails. Higher values provide more resilience but may increase response time.
+- **`max_iterations`**: Controls the maximum number of reasoning cycles the agent can perform. More iterations allow for complex multi-step problem solving.
+
+### Reasoning Capabilities
+- **`enable_reasoning`**: When enabled, the agent uses advanced reasoning to better understand tool results and plan next steps.
+
+### Planning Capabilities
+- **`enable_planning`**: When enabled, the agent uses auto-planning to break down complex tasks into manageable steps and execute them systematically. The agent will automatically detect when planning is needed.
+- **`enable_mcp_prompts`**: When enabled, the agent uses specialized prompts exposed by the MCP servers to interact with the exposed tools.
+- **`enable_plan_re_evaluator`**: When enabled, the agent can re-evaluate and adjust its execution plan based on intermediate results.
+
+### Recommended Settings
+- **Simple tasks**: `max_attempts: 2`, `max_iterations: 2`, `enable_reasoning: false`, `enable_planning: false`
+- **Complex tasks**: `max_attempts: 5`, `max_iterations: 5`, `enable_reasoning: true`, `enable_planning: true`, `enable_mcp_prompts: true`
+- **Advanced planning**: `max_attempts: 5`, `max_iterations: 5`, `enable_reasoning: true`, `enable_planning: true`, `enable_mcp_prompts: true`, `enable_plan_re_evaluator: true`
+- **Development/Debugging**: `max_attempts: 1`, `max_iterations: 1`, `enable_reasoning: true`, `enable_planning: true`
+
+## How It Works
+
+1. **Tool Discovery**: LocalAI connects to configured MCP servers and discovers available tools
+2. **Tool Caching**: Tools are cached per model for efficient reuse
+3. **Agent Execution**: The AI model uses the [Cogito](https://github.com/mudler/cogito) framework to execute tools
+4. **Response Generation**: The model generates responses incorporating tool results
+
+## Supported MCP Servers
+
+LocalAI is compatible with any MCP-compliant server.
+
+## Best Practices
+
+### Security
+- Use environment variables for sensitive tokens
+- Validate MCP server endpoints before deployment
+- Implement proper authentication for remote servers
+
+### Performance
+- Cache frequently used tools
+- Use appropriate timeout values for external APIs
+- Monitor resource usage for stdio servers
+
+### Error Handling
+- Implement fallback mechanisms for tool failures
+- Log tool execution for debugging
+- Handle network timeouts gracefully
+
+### With External Applications
+
+Use MCP-enabled models in your applications:
+
+```python
+import openai
+
+client = openai.OpenAI(
+    base_url="http://localhost:8080/mcp/v1",
+    api_key="your-api-key"
+)
+
+response = client.chat.completions.create(
+    model="my-agentic-model",
+    messages=[
+        {"role": "user", "content": "Analyze the latest research papers on AI"}
+    ]
+)
+```
+
+### MCP and adding packages
+
+It might be handy to install packages before starting the container to setup the environment. This is an example on how you can do that with docker-compose (installing and configuring docker)
+
+```yaml
+services:
+  local-ai:
+    image: localai/localai:latest
+    #image: localai/localai:latest-gpu-nvidia-cuda-12
+    container_name: local-ai
+    restart: always
+    entrypoint: [ "/bin/bash" ]
+    command: >
+     -c "apt-get update &&
+         apt-get install -y docker.io &&
+         /entrypoint.sh"
+    environment:
+      - DEBUG=true
+      - LOCALAI_WATCHDOG_IDLE=true
+      - LOCALAI_WATCHDOG_BUSY=true
+      - LOCALAI_WATCHDOG_IDLE_TIMEOUT=15m
+      - LOCALAI_WATCHDOG_BUSY_TIMEOUT=15m
+      - LOCALAI_API_KEY=my-beautiful-api-key
+      - DOCKER_HOST=tcp://docker:2376
+      - DOCKER_TLS_VERIFY=1
+      - DOCKER_CERT_PATH=/certs/client
+    ports:
+      - "8080:8080"
+    volumes:
+      - /data/models:/models
+      - /data/backends:/backends
+      - certs:/certs:ro
+    # uncomment for nvidia
+    # deploy:
+    #   resources:
+    #     reservations:
+    #       devices:
+    #         - capabilities: [gpu]
+    #           device_ids: ['7']
+    # runtime: nvidia
+
+  docker:
+    image: docker:dind
+    privileged: true
+    container_name: docker
+    volumes:
+      - certs:/certs
+    healthcheck:
+      test: ["CMD", "docker", "info"]
+      interval: 10s
+      timeout: 5s
+volumes:
+  certs:
+```
+
+An example model config (to append to any existing model you have) can be:
+
+```yaml
+mcp:
+  stdio: |
+     {
+      "mcpServers": {
+        "weather": {
+          "command": "docker",
+          "args": [
+            "run", "-i", "--rm",
+            "ghcr.io/mudler/mcps/weather:master"
+          ]
+        },
+        "memory": {
+          "command": "docker",
+          "env": {
+            "MEMORY_FILE_PATH": "/data/memory.json"
+          },
+          "args": [
+            "run", "-i", "--rm", "-v", "/host/data:/data",
+            "ghcr.io/mudler/mcps/memory:master"
+          ]
+        },
+        "ddg": {
+          "command": "docker",
+          "env": {
+            "MAX_RESULTS": "10"
+          },
+          "args": [
+            "run", "-i", "--rm", "-e", "MAX_RESULTS",
+            "ghcr.io/mudler/mcps/duckduckgo:master"
+          ]
+        }
+      }
+     }
+```
+
+### Links
+
+- [Awesome MCPs](https://github.com/punkpeye/awesome-mcp-servers)
+- [A list of MCPs by mudler](https://github.com/mudler/MCPs)
--- a/docs/content/features/model-gallery.md
+++ b/docs/content/features/model-gallery.md
@@ -0,0 +1,458 @@
+
+++
+disableToc = false
+title = "🖼️ Model gallery"
+weight = 18
+url = '/models'
+++
+
+The model gallery is a curated collection of models configurations for [LocalAI](https://github.com/go-skynet/LocalAI) that enables one-click install of models directly from the LocalAI Web interface.
+
+A list of the models available can also be browsed at [the Public LocalAI Gallery](https://models.localai.io).
+
+LocalAI to ease out installations of models provide a way to preload models on start and downloading and installing them in runtime. You can install models manually by copying them over the `models` directory, or use the API or the Web interface to configure, download and verify the model assets for you. 
+
+
+{{% notice note %}}
+The models in this gallery are not directly maintained by LocalAI. If you find a model that is not working, please open an issue on the model gallery repository.
+ {{% /notice %}}
+
+{{% notice note %}}
+GPT and text generation models might have a license which is not permissive for commercial use or might be questionable or without any license at all. Please check the model license before using it. The official gallery contains only open licensed models.
+ {{% /notice %}}
+
+![output](https://github.com/mudler/LocalAI/assets/2420543/7b16676e-d5b1-4c97-89bd-9fa5065c21ad)
+
+## Useful Links and resources
+
+- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) - here you can find a list of the most performing models on the Open LLM benchmark. Keep in mind models compatible with LocalAI must be quantized in the `gguf` format.
+
+## How it works
+
+Navigate the WebUI interface in the "Models" section from the navbar at the top. Here you can find a list of models that can be installed, and you can install them by clicking the "Install" button.
+
+## Add other galleries
+
+You can add other galleries by setting the `GALLERIES` environment variable. The `GALLERIES` environment variable is a list of JSON objects, where each object has a `name` and a `url` field. The `name` field is the name of the gallery, and the `url` field is the URL of the gallery's index file, for example:
+
+```json
+GALLERIES=[{"name":"<GALLERY_NAME>", "url":"<GALLERY_URL"}]
+```
+
+The models in the gallery will be automatically indexed and available for installation.
+
+## API Reference
+
+### Model repositories
+
+You can install a model in runtime, while the API is running and it is started already, or before starting the API by preloading the models.
+
+To install a model in runtime you will need to use the `/models/apply` LocalAI API endpoint.
+
+By default LocalAI is configured with the `localai` repository.
+
+To use additional repositories you need to start `local-ai` with the `GALLERIES` environment variable:
+
+```
+GALLERIES=[{"name":"<GALLERY_NAME>", "url":"<GALLERY_URL"}]
+```
+
+For example, to enable the default `localai` repository, you can start `local-ai` with:
+
+```
+GALLERIES=[{"name":"localai", "url":"github:mudler/localai/gallery/index.yaml"}]
+```
+
+where `github:mudler/localai/gallery/index.yaml` will be expanded automatically to `https://raw.githubusercontent.com/mudler/LocalAI/main/index.yaml`.
+
+Note: the url are expanded automatically for `github` and `huggingface`, however `https://` and `http://` prefix works as well.
+
+{{% notice note %}}
+
+If you want to build your own gallery, there is no documentation yet. However you can find the source of the default gallery in the [LocalAI repository](https://github.com/mudler/LocalAI/tree/master/gallery).
+ {{% /notice %}}
+
+
+### List Models
+
+To list all the available models, use the `/models/available` endpoint:
+
+```bash
+curl http://localhost:8080/models/available
+```
+
+To search for a model, you can use `jq`:
+
+```bash
+curl http://localhost:8080/models/available | jq '.[] | select(.name | contains("replit"))'
+
+curl http://localhost:8080/models/available | jq '.[] | .name | select(contains("localmodels"))'
+
+curl http://localhost:8080/models/available | jq '.[] | .urls | select(. != null) | add | select(contains("orca"))'
+```
+
+### How to install a model from the repositories
+
+Models can be installed by passing the full URL of the YAML config file, or either an identifier of the model in the gallery. The gallery is a repository of models that can be installed by passing the model name.
+
+To install a model from the gallery repository, you can pass the model name in the `id` field. For instance, to install the `bert-embeddings` model, you can use the following command:
+
+```bash
+LOCALAI=http://localhost:8080
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "id": "localai@bert-embeddings"
+   }'  
+```
+
+where:
+- `localai` is the repository. It is optional and can be omitted. If the repository is omitted LocalAI will search the model by name in all the repositories. In the case the same model name is present in both galleries the first match wins.
+- `bert-embeddings` is the model name in the gallery
+  (read its [config here](https://github.com/mudler/LocalAI/tree/master/gallery/blob/main/bert-embeddings.yaml)).
+
+### How to install a model not part of a gallery
+
+If you don't want to set any gallery repository, you can still install models by loading a model configuration file.
+
+In the body of the request you must specify the model configuration file URL (`url`), optionally a name to install the model (`name`), extra files to install (`files`), and configuration overrides (`overrides`). When calling the API endpoint, LocalAI will download the models files and write the configuration to the folder used to store models.
+
+```bash
+LOCALAI=http://localhost:8080
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "config_url": "<MODEL_CONFIG_FILE_URL>"
+   }' 
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "id": "<GALLERY>@<MODEL_NAME>"
+   }' 
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "url": "<MODEL_CONFIG_FILE_URL>"
+   }' 
+```
+
+An example that installs hermes-2-pro-mistral can be:
+   
+```bash
+LOCALAI=http://localhost:8080
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "config_url": "https://raw.githubusercontent.com/mudler/LocalAI/v2.25.0/embedded/models/hermes-2-pro-mistral.yaml"
+   }' 
+```
+
+The API will return a job `uuid` that you can use to track the job progress:
+```
+{"uuid":"1059474d-f4f9-11ed-8d99-c4cbe106d571","status":"http://localhost:8080/models/jobs/1059474d-f4f9-11ed-8d99-c4cbe106d571"}
+```
+
+For instance, a small example bash script that waits a job to complete can be (requires `jq`):
+
+```bash
+response=$(curl -s http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{"url": "$model_url"}')
+
+job_id=$(echo "$response" | jq -r '.uuid')
+
+while [ "$(curl -s http://localhost:8080/models/jobs/"$job_id" | jq -r '.processed')" != "true" ]; do 
+  sleep 1
+done
+
+echo "Job completed"
+```
+
+To preload models on start instead you can use the `PRELOAD_MODELS` environment variable.
+
+<details>
+
+To preload models on start, use the `PRELOAD_MODELS` environment variable by setting it to a JSON array of model uri:
+
+```bash
+PRELOAD_MODELS='[{"url": "<MODEL_URL>"}]'
+```
+
+Note: `url` or `id` must be specified. `url` is used to a url to a model gallery configuration, while an `id` is used to refer to models inside repositories. If both are specified, the `id` will be used.
+
+For example:
+
+```bash
+PRELOAD_MODELS=[{"url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"}]
+```
+
+or as arg:
+
+```bash
+local-ai --preload-models '[{"url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"}]'
+```
+
+or in a YAML file:
+
+```bash
+local-ai --preload-models-config "/path/to/yaml"
+```
+
+YAML:
+```yaml
+- url: github:mudler/LocalAI/gallery/stablediffusion.yaml@master
+```
+
+</details>
+
+{{% notice note %}}
+
+You can find already some open licensed models in the [LocalAI gallery](https://github.com/mudler/LocalAI/tree/master/gallery).
+
+If you don't find the model in the gallery you can try to use the "base" model and provide an URL to LocalAI:
+
+<details>
+
+```
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "url": "github:mudler/LocalAI/gallery/base.yaml@master",
+     "name": "model-name",
+     "files": [
+        {
+            "uri": "<URL>",
+            "sha256": "<SHA>",
+            "filename": "model"
+        }
+     ]
+   }'
+```
+
+</details>
+
+ {{% /notice %}}
+
+### Override a model name
+
+To install a model with a different name, specify a `name` parameter in the request body.
+
+```bash
+LOCALAI=http://localhost:8080
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "url": "<MODEL_CONFIG_FILE>",
+     "name": "<MODEL_NAME>"
+   }'  
+```
+
+For example, to install a model as `gpt-3.5-turbo`:
+   
+```bash
+LOCALAI=http://localhost:8080
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+      "url": "github:mudler/LocalAI/gallery/gpt4all-j.yaml",
+      "name": "gpt-3.5-turbo"
+   }'  
+```
+### Additional Files
+
+<details>
+
+To download additional files with the model, use the `files` parameter:
+
+```bash
+LOCALAI=http://localhost:8080
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "url": "<MODEL_CONFIG_FILE>",
+     "name": "<MODEL_NAME>",
+     "files": [
+        {
+            "uri": "<additional_file_url>",
+            "sha256": "<additional_file_hash>",
+            "filename": "<additional_file_name>"
+        }
+     ]
+   }'  
+```
+
+</details>
+
+### Overriding configuration files
+
+<details>
+
+To override portions of the configuration file, such as the backend or the model file, use the `overrides` parameter:
+
+```bash
+LOCALAI=http://localhost:8080
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "url": "<MODEL_CONFIG_FILE>",
+     "name": "<MODEL_NAME>",
+     "overrides": {
+        "backend": "llama",
+        "f16": true,
+        ...
+     }
+   }'  
+```
+
+</details>
+
+
+
+## Examples
+
+### Embeddings: Bert
+
+<details>
+
+```bash
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
+     "id": "bert-embeddings",
+     "name": "text-embedding-ada-002"
+   }'  
+```
+
+To test it:
+
+```bash
+LOCALAI=http://localhost:8080
+curl $LOCALAI/v1/embeddings -H "Content-Type: application/json" -d '{
+    "input": "Test",
+    "model": "text-embedding-ada-002"
+  }'
+```
+
+</details>
+
+### Image generation: Stable diffusion
+
+URL: https://github.com/EdVince/Stable-Diffusion-NCNN
+
+{{< tabs >}}
+{{% tab name="Prepare the model in runtime" %}}
+
+While the API is running, you can install the model by using the `/models/apply` endpoint and point it to the `stablediffusion` model in the [models-gallery](https://github.com/mudler/LocalAI/tree/master/gallery#image-generation-stable-diffusion):
+```bash
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{         
+     "url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"
+   }'
+```
+
+{{% /tab %}}
+{{% tab name="Automatically prepare the model before start" %}}
+
+You can set the `PRELOAD_MODELS` environment variable:
+
+```bash
+PRELOAD_MODELS=[{"url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"}]
+```
+
+or as arg:
+
+```bash
+local-ai --preload-models '[{"url": "github:mudler/LocalAI/gallery/stablediffusion.yaml@master"}]'
+```
+
+or in a YAML file:
+
+```bash
+local-ai --preload-models-config "/path/to/yaml"
+```
+
+YAML:
+```yaml
+- url: github:mudler/LocalAI/gallery/stablediffusion.yaml@master
+```
+
+{{% /tab %}}
+{{< /tabs >}}
+
+Test it:
+
+```
+curl $LOCALAI/v1/images/generations -H "Content-Type: application/json" -d '{
+            "prompt": "floating hair, portrait, ((loli)), ((one girl)), cute face, hidden hands, asymmetrical bangs, beautiful detailed eyes, eye shadow, hair ornament, ribbons, bowties, buttons, pleated skirt, (((masterpiece))), ((best quality)), colorful|((part of the head)), ((((mutated hands and fingers)))), deformed, blurry, bad anatomy, disfigured, poorly drawn face, mutation, mutated, extra limb, ugly, poorly drawn hands, missing limb, blurry, floating limbs, disconnected limbs, malformed hands, blur, out of focus, long neck, long body, Octane renderer, lowres, bad anatomy, bad hands, text",
+            "mode": 2,  "seed":9000,
+            "size": "256x256", "n":2
+}'
+```
+
+### Audio transcription: Whisper
+
+URL: https://github.com/ggerganov/whisper.cpp
+
+{{< tabs >}}
+{{% tab name="Prepare the model in runtime" %}}
+
+```bash
+curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{         
+     "url": "github:mudler/LocalAI/gallery/whisper-base.yaml@master",
+     "name": "whisper-1"
+   }'
+```
+
+{{% /tab %}}
+{{% tab name="Automatically prepare the model before start" %}}
+
+You can set the `PRELOAD_MODELS` environment variable:
+
+```bash
+PRELOAD_MODELS=[{"url": "github:mudler/LocalAI/gallery/whisper-base.yaml@master", "name": "whisper-1"}]
+```
+
+or as arg:
+
+```bash
+local-ai --preload-models '[{"url": "github:mudler/LocalAI/gallery/whisper-base.yaml@master", "name": "whisper-1"}]'
+```
+
+or in a YAML file:
+
+```bash
+local-ai --preload-models-config "/path/to/yaml"
+```
+
+YAML:
+```yaml
+- url: github:mudler/LocalAI/gallery/whisper-base.yaml@master
+  name: whisper-1
+```
+
+{{% /tab %}}
+{{< /tabs >}}
+
+### Note
+
+LocalAI will create a batch process that downloads the required files from a model definition and automatically reload itself to include the new model. 
+
+Input: `url` or `id` (required), `name` (optional), `files` (optional)
+
+```bash
+curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
+     "url": "<MODEL_DEFINITION_URL>",
+     "id": "<GALLERY>@<MODEL_NAME>",
+     "name": "<INSTALLED_MODEL_NAME>",
+     "files": [
+        {
+            "uri": "<additional_file>",
+            "sha256": "<additional_file_hash>",
+            "filename": "<additional_file_name>"
+        },
+      "overrides": { "backend": "...", "f16": true }
+     ]
+   }
+```
+
+An optional, list of additional files can be specified to be downloaded within `files`. The `name` allows to override the model name. Finally it is possible to override the model config file with `override`.
+
+The `url` is a full URL, or a github url (`github:org/repo/file.yaml`), or a local file (`file:///path/to/file.yaml`).
+The `id` is a string in the form `<GALLERY>@<MODEL_NAME>`, where `<GALLERY>` is the name of the gallery, and `<MODEL_NAME>` is the name of the model in the gallery. Galleries can be specified during startup with the `GALLERIES` environment variable.
+
+Returns an `uuid` and an `url` to follow up the state of the process:
+
+```json
+{ "uuid":"251475c9-f666-11ed-95e0-9a8a4480ac58", "status":"http://localhost:8080/models/jobs/251475c9-f666-11ed-95e0-9a8a4480ac58"}
+```
+
+To see a collection example of curated models definition files, see the [LocalAI repository](https://github.com/mudler/LocalAI/tree/master/gallery).
+
+#### Get model job state `/models/jobs/<uid>`
+
+This endpoint returns the state of the batch job associated to a model installation.
+
+```bash
+curl http://localhost:8080/models/jobs/<JOB_ID>
+```
+
+Returns a json containing the error, and if the job is being processed:
+
+```json
+{"error":null,"processed":true,"message":"completed"}
+```
--- a/docs/content/features/object-detection.md
+++ b/docs/content/features/object-detection.md
@@ -0,0 +1,191 @@
+++
+disableToc = false
+title = "🔍 Object detection"
+weight = 13
+url = "/features/object-detection/"
+++
+
+LocalAI supports object detection through various backends. This feature allows you to identify and locate objects within images with high accuracy and real-time performance. Currently, [RF-DETR](https://github.com/roboflow/rf-detr) is available as an implementation.
+
+## Overview
+
+Object detection in LocalAI is implemented through dedicated backends that can identify and locate objects within images. Each backend provides different capabilities and model architectures.
+
+**Key Features:**
+- Real-time object detection
+- High accuracy detection with bounding boxes
+- Support for multiple hardware accelerators (CPU, NVIDIA GPU, Intel GPU, AMD GPU)
+- Structured detection results with confidence scores
+- Easy integration through the `/v1/detection` endpoint
+
+## Usage
+
+### Detection Endpoint
+
+LocalAI provides a dedicated `/v1/detection` endpoint for object detection tasks. This endpoint is specifically designed for object detection and returns structured detection results with bounding boxes and confidence scores.
+
+### API Reference
+
+To perform object detection, send a POST request to the `/v1/detection` endpoint:
+
+```bash
+curl -X POST http://localhost:8080/v1/detection \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "rfdetr-base",
+    "image": "https://media.roboflow.com/dog.jpeg"
+  }'
+```
+
+### Request Format
+
+The request body should contain:
+
+- `model`: The name of the object detection model (e.g., "rfdetr-base")
+- `image`: The image to analyze, which can be:
+  - A URL to an image
+  - A base64-encoded image
+
+### Response Format
+
+The API returns a JSON response with detected objects:
+
+```json
+{
+  "detections": [
+    {
+      "x": 100.5,
+      "y": 150.2,
+      "width": 200.0,
+      "height": 300.0,
+      "confidence": 0.95,
+      "class_name": "dog"
+    },
+    {
+      "x": 400.0,
+      "y": 200.0,
+      "width": 150.0,
+      "height": 250.0,
+      "confidence": 0.87,
+      "class_name": "person"
+    }
+  ]
+}
+```
+
+Each detection includes:
+- `x`, `y`: Coordinates of the bounding box top-left corner
+- `width`, `height`: Dimensions of the bounding box
+- `confidence`: Detection confidence score (0.0 to 1.0)
+- `class_name`: The detected object class
+
+## Backends
+
+### RF-DETR Backend
+
+The RF-DETR backend is implemented as a Python-based gRPC service that integrates seamlessly with LocalAI. It provides object detection capabilities using the RF-DETR model architecture and supports multiple hardware configurations:
+
+- **CPU**: Optimized for CPU inference
+- **NVIDIA GPU**: CUDA acceleration for NVIDIA GPUs
+- **Intel GPU**: Intel oneAPI optimization
+- **AMD GPU**: ROCm acceleration for AMD GPUs
+- **NVIDIA Jetson**: Optimized for ARM64 NVIDIA Jetson devices
+
+#### Setup
+
+1. **Using the Model Gallery (Recommended)**
+
+   The easiest way to get started is using the model gallery. The `rfdetr-base` model is available in the official LocalAI gallery:
+
+   ```bash
+   # Install and run the rfdetr-base model
+   local-ai run rfdetr-base
+   ```
+
+   You can also install it through the web interface by navigating to the Models section and searching for "rfdetr-base".
+
+2. **Manual Configuration**
+
+   Create a model configuration file in your `models` directory:
+
+   ```yaml
+   name: rfdetr
+   backend: rfdetr
+   parameters:
+     model: rfdetr-base
+   ```
+
+#### Available Models
+
+Currently, the following model is available in the [Model Gallery]({{%relref "features/model-gallery" %}}):
+
+- **rfdetr-base**: Base model with balanced performance and accuracy
+
+You can browse and install this model through the LocalAI web interface or using the command line.
+
+## Examples
+
+### Basic Object Detection
+
+```bash
+curl -X POST http://localhost:8080/v1/detection \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "rfdetr-base",
+    "image": "https://example.com/image.jpg"
+  }'
+```
+
+### Base64 Image Detection
+
+```bash
+base64_image=$(base64 -w 0 image.jpg)
+curl -X POST http://localhost:8080/v1/detection \
+  -H "Content-Type: application/json" \
+  -d "{
+    \"model\": \"rfdetr-base\",
+    \"image\": \"data:image/jpeg;base64,$base64_image\"
+  }"
+```
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Model Loading Errors**
+   - Ensure the model file is properly downloaded
+   - Check available disk space
+   - Verify model compatibility with your backend version
+
+2. **Low Detection Accuracy**
+   - Ensure good image quality and lighting
+   - Check if objects are clearly visible
+   - Consider using a larger model for better accuracy
+
+3. **Slow Performance**
+   - Enable GPU acceleration if available
+   - Use a smaller model for faster inference
+   - Optimize image resolution
+
+### Debug Mode
+
+Enable debug logging for troubleshooting:
+
+```bash
+local-ai run --debug rfdetr-base
+```
+
+## Object Detection Category
+
+LocalAI includes a dedicated **object-detection** category for models and backends that specialize in identifying and locating objects within images. This category currently includes:
+
+- **RF-DETR**: Real-time transformer-based object detection
+
+Additional object detection models and backends will be added to this category in the future. You can filter models by the `object-detection` tag in the model gallery to find all available object detection models.
+
+## Related Features
+
+- [🎨 Image generation]({{%relref "features/image-generation" %}}): Generate images with AI
+- [📖 Text generation]({{%relref "features/text-generation" %}}): Generate text with language models
+- [🔍 GPT Vision]({{%relref "features/gpt-vision" %}}): Analyze images with language models
+- [🚀 GPU acceleration]({{%relref "features/GPU-acceleration" %}}): Optimize performance with GPU acceleration
--- a/docs/content/features/openai-functions.md
+++ b/docs/content/features/openai-functions.md
@@ -0,0 +1,264 @@
+
+++
+disableToc = false
+title = "🔥 OpenAI functions and tools"
+weight = 17
+url = "/features/openai-functions/"
+++
+
+LocalAI supports running OpenAI [functions and tools API](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools) with `llama.cpp` compatible models.
+
+![localai-functions-1](https://github.com/ggerganov/llama.cpp/assets/2420543/5bd15da2-78c1-4625-be90-1e938e6823f1)
+
+To learn more about OpenAI functions, see also the [OpenAI API blog post](https://openai.com/blog/function-calling-and-other-api-updates).
+
+LocalAI is also supporting [JSON mode](https://platform.openai.com/docs/guides/text-generation/json-mode) out of the box with llama.cpp-compatible models.
+
+💡 Check out also [LocalAGI](https://github.com/mudler/LocalAGI) for an example on how to use LocalAI functions.
+
+## Setup
+
+OpenAI functions are available only with `ggml` or `gguf` models compatible with `llama.cpp`.
+
+You don't need to do anything specific - just use `ggml` or `gguf` models.
+
+
+## Usage example
+
+You can configure a model manually with a YAML config file in the models directory, for example:
+
+```yaml
+name: gpt-3.5-turbo
+parameters:
+  # Model file name
+  model: ggml-openllama.bin
+  top_p: 80
+  top_k: 0.9
+  temperature: 0.1
+```
+
+To use the functions with the OpenAI client in python:
+
+```python
+from openai import OpenAI
+
+messages = [{"role": "user", "content": "What is the weather like in Beijing now?"}]
+tools = [
+    {
+        "type": "function",
+        "function": {
+            "name": "get_current_weather",
+            "description": "Return the temperature of the specified region specified by the user",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "location": {
+                        "type": "string",
+                        "description": "User specified region",
+                    },
+                    "unit": {
+                        "type": "string",
+                        "enum": ["celsius", "fahrenheit"],
+                        "description": "temperature unit"
+                    },
+                },
+                "required": ["location"],
+            },
+        },
+    }
+]
+
+client = OpenAI(
+    # This is the default and can be omitted
+    api_key="test",
+    base_url="http://localhost:8080/v1/"
+)
+
+response =client.chat.completions.create(
+    messages=messages,
+    tools=tools,
+    tool_choice ="auto",
+    model="gpt-4",
+)
+#...
+```
+
+For example, with curl:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+  "model": "gpt-4",
+  "messages": [{"role": "user", "content": "What is the weather like in Beijing now?"}],
+  "tools": [
+        {
+            "type": "function",
+            "function": {
+                "name": "get_current_weather",
+                "description": "Return the temperature of the specified region specified by the user",
+                "parameters": {
+                    "type": "object",
+                    "properties": {
+                        "location": {
+                            "type": "string",
+                            "description": "User specified region"
+                        },
+                        "unit": {
+                            "type": "string",
+                            "enum": ["celsius", "fahrenheit"],
+                            "description": "temperature unit"
+                        }
+                    },
+                    "required": ["location"]
+                }
+            }
+        }
+    ],
+    "tool_choice":"auto"
+}'
+```
+
+Return data：
+
+```json
+{
+    "created": 1724210813,
+    "object": "chat.completion",
+    "id": "16b57014-477c-4e6b-8d25-aad028a5625e",
+    "model": "gpt-4",
+    "choices": [
+        {
+            "index": 0,
+            "finish_reason": "tool_calls",
+            "message": {
+                "role": "assistant",
+                "content": "",
+                "tool_calls": [
+                    {
+                        "index": 0,
+                        "id": "16b57014-477c-4e6b-8d25-aad028a5625e",
+                        "type": "function",
+                        "function": {
+                            "name": "get_current_weather",
+                            "arguments": "{\"location\":\"Beijing\",\"unit\":\"celsius\"}"
+                        }
+                    }
+                ]
+            }
+        }
+    ],
+    "usage": {
+        "prompt_tokens": 221,
+        "completion_tokens": 26,
+        "total_tokens": 247
+    }
+}
+```
+
+## Advanced
+
+### Use functions without grammars
+
+The functions calls maps automatically to grammars which are currently supported only by llama.cpp, however, it is possible to turn off the use of grammars, and extract tool arguments from the LLM responses, by specifying in the YAML file `no_grammar` and a regex to map the response from the LLM:
+
+```yaml
+name: model_name
+parameters:
+  # Model file name
+  model: model/name
+
+function:
+  # set to true to not use grammars
+  no_grammar: true
+  # set one or more regexes used to extract the function tool arguments from the LLM response
+  response_regex:
+  - "(?P<function>\w+)\s*\((?P<arguments>.*)\)"
+```
+
+The response regex have to be a regex with named parameters to allow to scan the function name and the arguments. For instance, consider:
+
+```
+(?P<function>\w+)\s*\((?P<arguments>.*)\)
+```
+
+will catch
+
+```
+function_name({ "foo": "bar"})
+```
+
+### Parallel tools calls
+
+This feature is experimental and has to be configured in the YAML of the model by enabling `function.parallel_calls`:
+
+```yaml
+name: gpt-3.5-turbo
+parameters:
+  # Model file name
+  model: ggml-openllama.bin
+  top_p: 80
+  top_k: 0.9
+  temperature: 0.1
+
+function:
+  # set to true to allow the model to call multiple functions in parallel
+  parallel_calls: true
+```
+
+### Use functions with grammar
+
+It is possible to also specify the full function signature (for debugging, or to use with other clients).
+
+The chat endpoint accepts the `grammar_json_functions` additional parameter which takes a JSON schema object.
+
+For example, with curl:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+     "model": "gpt-4",
+     "messages": [{"role": "user", "content": "How are you?"}],
+     "temperature": 0.1,
+     "grammar_json_functions": {
+        "oneOf": [
+            {
+                "type": "object",
+                "properties": {
+                    "function": {"const": "create_event"},
+                    "arguments": {
+                        "type": "object",
+                        "properties": {
+                            "title": {"type": "string"},
+                            "date": {"type": "string"},
+                            "time": {"type": "string"}
+                        }
+                    }
+                }
+            },
+            {
+                "type": "object",
+                "properties": {
+                    "function": {"const": "search"},
+                    "arguments": {
+                        "type": "object",
+                        "properties": {
+                            "query": {"type": "string"}
+                        }
+                    }
+                }
+            }
+        ]
+    }
+   }'
+```
+
+Grammars and function tools can be used as well in conjunction with vision APIs:
+
+```bash
+ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+     "model": "llava", "grammar": "root ::= (\"yes\" | \"no\")",
+     "messages": [{"role": "user", "content": [{"type":"text", "text": "Is there some grass in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
+```
+
+
+## 💡 Examples
+
+A full e2e example with `docker-compose` is available [here](https://github.com/mudler/LocalAI-examples/tree/main/functions).
--- a/docs/content/features/reranker.md
+++ b/docs/content/features/reranker.md
@@ -0,0 +1,53 @@
+
+++
+disableToc = false
+title = "📈 Reranker"
+weight = 11
+url = "/features/reranker/"
+++
+
+A **reranking** model, often referred to as a cross-encoder, is a core component in the two-stage retrieval systems used in information retrieval and natural language processing tasks.
+Given a query and a set of documents, it will output similarity scores.
+
+We can use then the score to reorder the documents by relevance in our RAG system to increase its overall accuracy and filter out non-relevant results.
+
+![output](https://github.com/mudler/LocalAI/assets/2420543/ede67b25-fac4-4833-ae4f-78290e401e60)
+
+LocalAI supports reranker models, and you can use them by using the `rerankers` backend, which uses [rerankers](https://github.com/AnswerDotAI/rerankers).
+
+## Usage
+
+You can test `rerankers` by using container images with python (this does **NOT** work with `core` images) and a model config file like this, or by installing `cross-encoder` from the gallery in the UI:
+
+```yaml
+name: jina-reranker-v1-base-en
+backend: rerankers
+parameters:
+  model: cross-encoder
+
+```
+
+and test it with:
+
+```bash
+
+    curl http://localhost:8080/v1/rerank \
+      -H "Content-Type: application/json" \
+      -d '{
+      "model": "jina-reranker-v1-base-en",
+      "query": "Organic skincare products for sensitive skin",
+      "documents": [
+        "Eco-friendly kitchenware for modern homes",
+        "Biodegradable cleaning supplies for eco-conscious consumers",
+        "Organic cotton baby clothes for sensitive skin",
+        "Natural organic skincare range for sensitive skin",
+        "Tech gadgets for smart homes: 2024 edition",
+        "Sustainable gardening tools and compost solutions",
+        "Sensitive skin-friendly facial cleansers and toners",
+        "Organic food wraps and storage solutions",
+        "All-natural pet food for dogs with allergies",
+        "Yoga mats made from recycled materials"
+      ],
+      "top_n": 3
+    }'
+```
--- a/docs/content/features/stores.md
+++ b/docs/content/features/stores.md
@@ -0,0 +1,96 @@
+
+++
+disableToc = false
+title = "💾 Stores"
+weight = 18
+url = '/stores'
+++
+
+Stores are an experimental feature to help with querying data using similarity search. It is
+a low level API that consists of only `get`, `set`, `delete` and `find`.
+
+For example if you have an embedding of some text and want to find text with similar embeddings.
+You can create embeddings for chunks of all your text then compare them against the embedding of the text you
+are searching on.
+
+An embedding here meaning a vector of numbers that represent some information about the text. The
+embeddings are created from an A.I. model such as BERT or a more traditional method such as word
+frequency.
+
+Previously you would have to integrate with an external vector database or library directly.
+With the stores feature you can now do it through the LocalAI API. 
+
+Note however that doing a similarity search on embeddings is just one way to do retrieval. A higher level
+API can take this into account, so this may not be the best place to start.
+
+## API overview
+
+There is an internal gRPC API and an external facing HTTP JSON API. We'll just discuss the external HTTP API,
+however the HTTP API mirrors the gRPC API. Consult `pkg/store/client` for internal usage.
+
+Everything is in columnar format meaning that instead of getting an array of objects with a key and a value each. 
+You instead get two separate arrays of keys and values.
+
+Keys are arrays of floating point numbers with a maximum width of 32bits. Values are strings (in gRPC they are bytes).
+
+The key vectors must all be the same length and it's best for search performance if they are normalized. When
+addings keys it will be detected if they are not normalized and what length they are.
+
+All endpoints accept a `store` field which specifies which store to operate on. Presently they are created
+on the fly and there is only one store backend so no configuration is required.
+
+## Set
+
+To set some keys you can do
+
+```
+curl -X POST http://localhost:8080/stores/set \
+     -H "Content-Type: application/json" \
+     -d '{"keys": [[0.1, 0.2], [0.3, 0.4]], "values": ["foo", "bar"]}'
+```
+
+Setting the same keys again will update their values.
+
+On success 200 OK is returned with no body.
+
+## Get
+
+To get some keys you can do
+
+```
+curl -X POST http://localhost:8080/stores/get \
+     -H "Content-Type: application/json" \
+     -d '{"keys": [[0.1, 0.2]]}'
+```
+
+Both the keys and values are returned, e.g: `{"keys":[[0.1,0.2]],"values":["foo"]}`
+
+The order of the keys is not preserved! If a key does not exist then nothing is returned.
+
+## Delete
+
+To delete keys and values you can do
+
+```
+curl -X POST http://localhost:8080/stores/delete \
+     -H "Content-Type: application/json" \
+     -d '{"keys": [[0.1, 0.2]]}'
+```
+
+If a key doesn't exist then it is ignored.
+
+On success 200 OK is returned with no body.
+
+## Find
+
+To do a similarity search you can do
+
+```
+curl -X POST http://localhost:8080/stores/find 
+     -H "Content-Type: application/json" \
+     -d '{"topk": 2, "key": [0.2, 0.1]}'
+```
+
+`topk` limits the number of results returned. The result value is the same as `get`,
+except that it also includes an array of `similarities`. Where `1.0` is the maximum similarity.
+They are returned in the order of most similar to least.
--- a/docs/content/features/text-generation.md
+++ b/docs/content/features/text-generation.md
@@ -0,0 +1,386 @@
+
+++
+disableToc = false
+title = "📖 Text generation (GPT)"
+weight = 10
+url = "/features/text-generation/"
+++
+
+LocalAI supports generating text with GPT with `llama.cpp` and other backends (such as `rwkv.cpp` as ) see also the [Model compatibility]({{%relref "reference/compatibility-table" %}}) for an up-to-date list of the supported model families.
+
+Note:
+
+- You can also specify the model name as part of the OpenAI token.
+- If only one model is available, the API will use it for all the requests.
+
+## API Reference
+
+### Chat completions
+
+https://platform.openai.com/docs/api-reference/chat
+
+For example, to generate a chat completion, you can send a POST request to the `/v1/chat/completions` endpoint with the instruction as the request body:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+  "model": "ggml-koala-7b-model-q4_0-r2.bin",
+  "messages": [{"role": "user", "content": "Say this is a test!"}],
+  "temperature": 0.7
+}'
+```
+
+Available additional parameters: `top_p`, `top_k`, `max_tokens`
+
+### Edit completions
+
+https://platform.openai.com/docs/api-reference/edits
+
+To generate an edit completion you can send a POST request to the `/v1/edits` endpoint with the instruction as the request body:
+
+```bash
+curl http://localhost:8080/v1/edits -H "Content-Type: application/json" -d '{
+  "model": "ggml-koala-7b-model-q4_0-r2.bin",
+  "instruction": "rephrase",
+  "input": "Black cat jumped out of the window",
+  "temperature": 0.7
+}'
+```
+
+Available additional parameters: `top_p`, `top_k`, `max_tokens`.
+
+### Completions
+
+https://platform.openai.com/docs/api-reference/completions
+
+To generate a completion, you can send a POST request to the `/v1/completions` endpoint with the instruction as per the request body:
+
+```bash
+curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
+  "model": "ggml-koala-7b-model-q4_0-r2.bin",
+  "prompt": "A long time ago in a galaxy far, far away",
+  "temperature": 0.7
+}'
+```
+
+Available additional parameters: `top_p`, `top_k`, `max_tokens`
+
+### List models
+
+You can list all the models available with:
+
+```bash
+curl http://localhost:8080/v1/models
+```
+
+## Backends
+
+### RWKV
+
+RWKV support is available through llama.cpp (see below)
+
+### llama.cpp
+
+[llama.cpp](https://github.com/ggerganov/llama.cpp) is a popular port of Facebook's LLaMA model in C/C++.
+
+{{% notice note %}}
+
+The `ggml` file format has been deprecated. If you are using `ggml` models and you are configuring your model with a YAML file, specify, use a LocalAI version older than v2.25.0. For `gguf` models, use the `llama` backend. The go backend is deprecated as well but still available as `go-llama`.
+
+ {{% /notice %}}
+
+#### Features
+
+The `llama.cpp` model supports the following features:
+- [📖 Text generation (GPT)]({{%relref "features/text-generation" %}})
+- [🧠 Embeddings]({{%relref "features/embeddings" %}})
+- [🔥 OpenAI functions]({{%relref "features/openai-functions" %}})
+- [✍️ Constrained grammars]({{%relref "features/constrained_grammars" %}})
+
+#### Setup
+
+LocalAI supports `llama.cpp` models out of the box. You can use the `llama.cpp` model in the same way as any other model. 
+
+##### Manual setup
+
+It is sufficient to copy the `ggml` or `gguf` model files in the `models` folder. You can refer to the model in the `model` parameter in the API calls.
+
+[You can optionally create an associated YAML]({{%relref "advanced" %}}) model config file to tune the model's parameters or apply a template to the prompt.
+
+Prompt templates are useful for models that are fine-tuned towards a specific prompt. 
+
+##### Automatic setup
+
+LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for `ggml` or `gguf` models.
+
+For instance, if you have the galleries enabled and LocalAI already running, you can just start chatting with models in huggingface by running:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+     "model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
+     "messages": [{"role": "user", "content": "Say this is a test!"}],
+     "temperature": 0.1
+   }'
+```
+
+LocalAI will automatically download and configure the model in the `model` directory.
+
+Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the [model gallery documentation]({{%relref "features/model-gallery" %}}).
+
+#### YAML configuration
+
+To use the `llama.cpp` backend, specify `llama-cpp` as the backend in the YAML file:
+
+```yaml
+name: llama
+backend: llama-cpp
+parameters:
+  # Relative to the models path
+  model: file.gguf
+```
+
+#### Backend Options
+
+The `llama.cpp` backend supports additional configuration options that can be specified in the `options` field of your model YAML configuration. These options allow fine-tuning of the backend behavior:
+
+| Option | Type | Description | Example |
+|--------|------|-------------|---------|
+| `use_jinja` or `jinja` | boolean | Enable Jinja2 template processing for chat templates. When enabled, the backend uses Jinja2-based chat templates from the model for formatting messages. | `use_jinja:true` |
+| `context_shift` | boolean | Enable context shifting, which allows the model to dynamically adjust context window usage. | `context_shift:true` |
+| `cache_ram` | integer | Set the maximum RAM cache size in MiB for KV cache. Use `-1` for unlimited (default). | `cache_ram:2048` |
+| `parallel` or `n_parallel` | integer | Enable parallel request processing. When set to a value greater than 1, enables continuous batching for handling multiple requests concurrently. | `parallel:4` |
+| `grpc_servers` or `rpc_servers` | string | Comma-separated list of gRPC server addresses for distributed inference. Allows distributing workload across multiple llama.cpp workers. | `grpc_servers:localhost:50051,localhost:50052` |
+
+**Example configuration with options:**
+
+```yaml
+name: llama-model
+backend: llama
+parameters:
+  model: model.gguf
+options:
+  - use_jinja:true
+  - context_shift:true
+  - cache_ram:4096
+  - parallel:2
+```
+
+**Note:** The `parallel` option can also be set via the `LLAMACPP_PARALLEL` environment variable, and `grpc_servers` can be set via the `LLAMACPP_GRPC_SERVERS` environment variable. Options specified in the YAML file take precedence over environment variables.
+
+#### Reference
+
+- [llama](https://github.com/ggerganov/llama.cpp)
+
+
+### exllama/2
+
+[Exllama](https://github.com/turboderp/exllama) is a "A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights". Both `exllama` and `exllama2` are supported.
+
+#### Model setup
+
+Download the model as a folder inside the `model ` directory and create a YAML file specifying the `exllama` backend. For instance with the `TheBloke/WizardLM-7B-uncensored-GPTQ` model:
+
+```
+$ git lfs install
+$ cd models && git clone https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GPTQ
+$ ls models/                                                                 
+.keep                        WizardLM-7B-uncensored-GPTQ/ exllama.yaml
+$ cat models/exllama.yaml                                                     
+name: exllama
+parameters:
+  model: WizardLM-7B-uncensored-GPTQ
+backend: exllama
+```
+
+Test with:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{                                                                                                         
+   "model": "exllama",
+   "messages": [{"role": "user", "content": "How are you?"}],
+   "temperature": 0.1
+ }'
+```
+
+### vLLM
+
+[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference.
+
+LocalAI has a built-in integration with vLLM, and it can be used to run models. You can check out `vllm` performance [here](https://github.com/vllm-project/vllm#performance).
+
+#### Setup
+
+Create a YAML file for the model you want to use with `vllm`.
+
+To setup a model, you need to just specify the model name in the YAML config file:
+```yaml
+name: vllm
+backend: vllm
+parameters:
+    model: "facebook/opt-125m"
+
+```
+
+The backend will automatically download the required files in order to run the model.
+
+
+#### Usage
+
+Use the `completions` endpoint by specifying the `vllm` backend:
+```
+curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{   
+   "model": "vllm",
+   "prompt": "Hello, my name is",
+   "temperature": 0.1, "top_p": 0.1
+ }'
+```
+
+### Transformers
+
+[Transformers](https://huggingface.co/docs/transformers/index) is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX.
+
+LocalAI has a built-in integration with Transformers, and it can be used to run models.
+
+This is an extra backend - in the container images (the `extra` images already contains python dependencies for Transformers) is already available and there is nothing to do for the setup.
+
+#### Setup
+
+Create a YAML file for the model you want to use with `transformers`.
+
+To setup a model, you need to just specify the model name in the YAML config file:
+```yaml
+name: transformers
+backend: transformers
+parameters:
+    model: "facebook/opt-125m"
+type: AutoModelForCausalLM
+quantization: bnb_4bit # One of: bnb_8bit, bnb_4bit, xpu_4bit, xpu_8bit (optional)
+```
+
+The backend will automatically download the required files in order to run the model.
+
+#### Parameters
+
+##### Type
+
+| Type | Description |
+| --- | --- |
+| `AutoModelForCausalLM` | `AutoModelForCausalLM` is a model that can be used to generate sequences. Use it for NVIDIA CUDA and Intel GPU with Intel Extensions for Pytorch acceleration |
+| `OVModelForCausalLM` | for Intel CPU/GPU/NPU OpenVINO Text Generation models |
+| `OVModelForFeatureExtraction` | for Intel CPU/GPU/NPU OpenVINO Embedding acceleration |
+| N/A | Defaults to `AutoModel` |
+
+- `OVModelForCausalLM` requires OpenVINO IR [Text Generation](https://huggingface.co/models?library=openvino&pipeline_tag=text-generation) models from Hugging face
+- `OVModelForFeatureExtraction` works with any Safetensors Transformer [Feature Extraction](https://huggingface.co/models?pipeline_tag=feature-extraction&library=transformers,safetensors) model from Huggingface (Embedding Model)
+
+Please note that streaming is currently not implemente in `AutoModelForCausalLM` for Intel GPU.
+AMD GPU support is not implemented.
+Although AMD CPU is not officially supported by OpenVINO there are reports that it works: YMMV.
+
+##### Embeddings
+Use `embeddings: true` if the model is an embedding model
+
+##### Inference device selection
+Transformer backend tries to automatically select the best device for inference, anyway you can override the decision manually overriding with the `main_gpu` parameter.
+
+| Inference Engine | Applicable Values |
+| --- | --- |
+| CUDA | `cuda`, `cuda.X` where X is the GPU device like in `nvidia-smi -L` output |
+| OpenVINO | Any applicable value from [Inference Modes](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html) like `AUTO`,`CPU`,`GPU`,`NPU`,`MULTI`,`HETERO` |
+
+Example for CUDA:
+`main_gpu: cuda.0`
+
+Example for OpenVINO:
+`main_gpu: AUTO:-CPU`
+
+This parameter applies to both Text Generation and Feature Extraction (i.e. Embeddings) models.
+
+##### Inference Precision
+Transformer backend automatically select the fastest applicable inference precision according to the device support.
+CUDA backend can manually enable *bfloat16* if your hardware support it with the following parameter:
+
+`f16: true`
+
+##### Quantization
+
+| Quantization | Description |
+| --- | --- |
+| `bnb_8bit` | 8-bit quantization |
+| `bnb_4bit` | 4-bit quantization |
+| `xpu_8bit` | 8-bit quantization for Intel XPUs |
+| `xpu_4bit` | 4-bit quantization for Intel XPUs |
+
+##### Trust Remote Code
+Some models like Microsoft Phi-3 requires external code than what is provided by the transformer library.
+By default it is disabled for security.
+It can be manually enabled with:
+`trust_remote_code: true`
+
+##### Maximum Context Size
+Maximum context size in bytes can be specified with the parameter: `context_size`. Do not use values higher than what your model support.
+
+Usage example:
+`context_size: 8192`
+
+##### Auto Prompt Template
+Usually chat template is defined by the model author in the `tokenizer_config.json` file.
+To enable it use the `use_tokenizer_template: true` parameter in the `template` section.
+
+Usage example:
+```
+template:
+  use_tokenizer_template: true
+```
+
+##### Custom Stop Words
+Stopwords are usually defined in `tokenizer_config.json` file.
+They can be overridden with the `stopwords` parameter in case of need like in llama3-Instruct model.
+
+Usage example:
+```
+stopwords:
+- "<|eot_id|>"
+- "<|end_of_text|>"
+```
+
+#### Usage
+
+Use the `completions` endpoint by specifying the `transformers` model:
+```
+curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{   
+   "model": "transformers",
+   "prompt": "Hello, my name is",
+   "temperature": 0.1, "top_p": 0.1
+ }'
+```
+
+#### Examples
+
+##### OpenVINO
+
+A model configuration file for openvion and starling model:
+
+```yaml
+name: starling-openvino
+backend: transformers
+parameters:
+  model: fakezeta/Starling-LM-7B-beta-openvino-int8
+context_size: 8192
+threads: 6
+f16: true
+type: OVModelForCausalLM
+stopwords:
+- <|end_of_turn|>
+- <|endoftext|>
+prompt_cache_path: "cache"
+prompt_cache_all: true
+template:
+  chat_message: |
+    {{if eq .RoleName "system"}}{{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "assistant"}}<|end_of_turn|>GPT4 Correct Assistant: {{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "user"}}GPT4 Correct User: {{.Content}}{{end}}
+
+  chat: |
+    {{.Input}}<|end_of_turn|>GPT4 Correct Assistant:
+
+  completion: |
+    {{.Input}}
+```
--- a/docs/content/features/text-to-audio.md
+++ b/docs/content/features/text-to-audio.md
@@ -0,0 +1,216 @@
+
+++
+disableToc = false
+title = "🗣 Text to audio (TTS)"
+weight = 11
+url = "/features/text-to-audio/"
+++
+
+## API Compatibility
+
+The LocalAI TTS API is compatible with the [OpenAI TTS API](https://platform.openai.com/docs/guides/text-to-speech) and the [Elevenlabs](https://api.elevenlabs.io/docs) API.
+
+## LocalAI API
+
+The `/tts` endpoint can also be used to generate speech from text.
+
+## Usage
+
+Input: `input`, `model`
+
+For example, to generate an audio file, you can send a POST request to the `/tts` endpoint with the instruction as the request body:
+
+```bash
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
+  "input": "Hello world",
+  "model": "tts"
+}'
+```
+
+Returns an `audio/wav` file.
+
+
+## Backends
+
+### 🐸 Coqui
+
+Required: Don't use `LocalAI` images ending with the `-core` tag,. Python dependencies are required in order to use this backend.
+
+Coqui works without any configuration, to test it, you can run the following curl command:
+
+```
+    curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
+        "backend": "coqui",
+        "model": "tts_models/en/ljspeech/glow-tts",
+        "input":"Hello, this is a test!"
+        }'
+```
+
+You can use the env variable COQUI_LANGUAGE to set the language used by the coqui backend.
+
+You can also use config files to configure tts models (see section below on how to use config files).
+
+### Bark
+
+[Bark](https://github.com/suno-ai/bark) allows to generate audio from text prompts.
+
+This is an extra backend - in the container is already available and there is nothing to do for the setup.
+
+#### Model setup
+
+There is nothing to be done for the model setup. You can already start to use bark. The models will be downloaded the first time you use the backend.
+
+#### Usage
+
+Use the `tts` endpoint by specifying the `bark` backend:
+
+```
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
+     "backend": "bark",
+     "input":"Hello!"
+   }' | aplay
+```
+
+To specify a voice from https://github.com/suno-ai/bark#-voice-presets ( https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c ), use the `model` parameter:
+
+```
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
+     "backend": "bark",
+     "input":"Hello!",
+     "model": "v2/en_speaker_4"
+   }' | aplay
+```
+
+### Piper
+
+To install the `piper` audio models manually:
+
+- Download Voices from https://github.com/rhasspy/piper/releases/tag/v0.0.2
+- Extract the `.tar.tgz` files (.onnx,.json) inside `models`
+- Run the following command to test the model is working
+
+To use the tts endpoint, run the following command. You can specify a backend with the `backend` parameter. For example, to use the `piper` backend:
+```bash
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
+  "model":"it-riccardo_fasol-x-low.onnx",
+  "backend": "piper",
+  "input": "Ciao, sono Ettore"
+}' | aplay
+```
+
+Note:
+
+- `aplay` is a Linux command. You can use other tools to play the audio file.
+- The model name is the filename with the extension.
+- The model name is case sensitive.
+- LocalAI must be compiled with the `GO_TAGS=tts` flag.
+
+### Transformers-musicgen
+
+LocalAI also has experimental support for `transformers-musicgen` for the generation of short musical compositions. Currently, this is implemented via the same requests used for text to speech:
+
+```
+curl --request POST \
+  --url http://localhost:8080/tts \
+  --header 'Content-Type: application/json' \
+  --data '{
+    "backend": "transformers-musicgen",
+    "model": "facebook/musicgen-medium",
+    "input": "Cello Rave"
+}' | aplay
+```
+
+Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.
+
+### Vall-E-X
+
+[VALL-E-X](https://github.com/Plachtaa/VALL-E-X) is an open source implementation of Microsoft's VALL-E X zero-shot TTS model.
+
+#### Setup
+
+The backend will automatically download the required files in order to run the model.
+
+This is an extra backend - in the container is already available and there is nothing to do for the setup. If you are building manually, you need to install Vall-E-X manually first.
+
+#### Usage
+
+Use the tts endpoint by specifying the vall-e-x backend:
+
+```
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
+     "backend": "vall-e-x",
+     "input":"Hello!"
+   }' | aplay
+```
+
+#### Voice cloning
+
+In order to use voice cloning capabilities you must create a `YAML` configuration file to setup a model:
+
+```yaml
+name: cloned-voice
+backend: vall-e-x
+parameters:
+  model: "cloned-voice"
+tts:
+    vall-e:
+      # The path to the audio file to be cloned
+      # relative to the models directory
+      # Max 15s
+      audio_path: "audio-sample.wav"
+```
+
+Then you can specify the model name in the requests:
+
+```
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
+     "model": "cloned-voice",
+     "input":"Hello!"
+   }' | aplay
+```
+
+## Using config files
+
+You can also use a `config-file` to specify TTS models and their parameters.
+
+In the following example we define a custom config to load the `xtts_v2` model, and specify a voice and language.
+
+```yaml
+
+name: xtts_v2
+backend: coqui
+parameters:
+  language: fr
+  model: tts_models/multilingual/multi-dataset/xtts_v2
+
+tts:
+  voice: Ana Florence
+```
+
+With this config, you can now use the following curl command to generate a text-to-speech audio file:
+```bash
+curl -L http://localhost:8080/tts \
+    -H "Content-Type: application/json" \
+    -d '{
+"model": "xtts_v2",
+"input": "Bonjour, je suis Ana Florence. Comment puis-je vous aider?"
+}' | aplay
+```
+
+## Response format
+
+To provide some compatibility with OpenAI API regarding `response_format`, ffmpeg must be installed (or a docker image including ffmpeg used) to leverage converting the generated wav file before the api provide its response.
+
+Warning regarding a change in behaviour. Before this addition, the parameter was ignored and a wav file was always returned, with potential codec errors later in the integration (like trying to decode a mp3 file from a wav, which is the default format used by OpenAI)
+
+Supported format thanks to ffmpeg are `wav`, `mp3`, `aac`, `flac`, `opus`, defaulting to `wav` if an unknown or no format is provided.
+
+```bash
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
+  "input": "Hello world",
+  "model": "tts",
+  "response_format": "mp3"
+}'
+```
+
+If a `response_format` is added in the query (other than `wav`) and ffmpeg is not available, the call will fail.