feat(realtime): Add audio conversations (#6245)

* feat(realtime): Add audio conversations Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(realtime): Vendor the updated API and modify for server side Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(realtime): Update to the GA realtime API Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore: Document realtime API and add docs to AGENTS.md Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat: Filter reasoning from spoken output Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(realtime): Send delta and done events for tool calls and audio transcripts Ensure that content is sent in both deltas and done events for function call arguments and audio transcripts. This fixes compatibility with clients that rely on delta events for parsing. 💘 Generated with Crush Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(realtime): Improve tool call handling and error reporting - Refactor Model interface to accept []types.ToolUnion and *types.ToolChoiceUnion instead of JSON strings, eliminating unnecessary marshal/unmarshal cycles - Fix Parameters field handling: support both map[string]any and JSON string formats - Add PredictConfig() method to Model interface for accessing model configuration - Add comprehensive debug logging for tool call parsing and function config - Add missing return statement after prediction error (critical bug fix) - Add warning logs for NoAction function argument parsing failures - Improve error visibility throughout generateResponse function 💘 Generated with Crush Assisted-by: Claude Sonnet 4.5 via Crush <crush@charm.land> Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-02-04 08:59:05 -06:00 · 2026-01-29 07:44:53 +00:00
parent 48e08772f3
commit dd8e74a486
13 changed files with 4626 additions and 2083 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -280,3 +280,11 @@ Always check `llama.cpp` for new model configuration options that should be supp
 - `llama.cpp/common/chat-parser.cpp` - Format presets and model-specific handlers
 - `llama.cpp/common/chat.h` - Format enums and parameter structures
 - `llama.cpp/tools/server/server-context.cpp` - Server configuration options
+
+# Documentation
+
+The project documentation is located in `docs/content`. When adding new features or changing existing functionality, it is crucial to update the documentation to reflect these changes. This helps users understand how to use the new capabilities and ensures the documentation stays relevant.
+
+- **Feature Documentation**: If you add a new feature (like a new backend or API endpoint), create a new markdown file in `docs/content/features/` explaining what it is, how to configure it, and how to use it.
+- **Configuration**: If you modify configuration options, update the relevant sections in `docs/content/`.
+- **Examples**: providing concrete examples (like YAML configuration blocks) is highly encouraged to help users get started quickly.
--- a/README.md
+++ b/README.md
@@ -239,6 +239,7 @@ Roadmap items: [List of issues](https://github.com/mudler/LocalAI/issues?q=is%3A
 - 🔈 [Audio to Text](https://localai.io/features/audio-to-text/) (Audio transcription with `whisper.cpp`)
 - 🎨 [Image generation](https://localai.io/features/image-generation)
 - 🔥 [OpenAI-alike tools API](https://localai.io/features/openai-functions/) 
+- ⚡ [Realtime API](https://localai.io/features/openai-realtime/) (Speech-to-speech) 
 - 🧠 [Embeddings generation for vector databases](https://localai.io/features/embeddings/)
 - ✍️ [Constrained grammars](https://localai.io/features/constrained_grammars/)
 - 🖼️ [Download Models directly from Huggingface ](https://localai.io/models/)
--- a/core/http/endpoints/openai/realtime.go
+++ b/core/http/endpoints/openai/realtime.go
--- a/core/http/endpoints/openai/realtime_model.go
+++ b/core/http/endpoints/openai/realtime_model.go
@@ -2,20 +2,23 @@ package openai

 import (
 	"context"
+	"encoding/json"
 	"fmt"

 	"github.com/mudler/LocalAI/core/backend"
 	"github.com/mudler/LocalAI/core/config"
-	grpcClient "github.com/mudler/LocalAI/pkg/grpc"
+	"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
+	"github.com/mudler/LocalAI/core/schema"
+	"github.com/mudler/LocalAI/core/templates"
+	"github.com/mudler/LocalAI/pkg/functions"
 	"github.com/mudler/LocalAI/pkg/grpc/proto"
 	model "github.com/mudler/LocalAI/pkg/model"
 	"github.com/mudler/xlog"
-	"google.golang.org/grpc"
 )

 var (
 	_ Model = new(wrappedModel)
-	_ Model = new(anyToAnyModel)
+	_ Model = new(transcriptOnlyModel)
 )

 // wrappedModel represent a model which does not support Any-to-Any operations
@@ -25,12 +28,12 @@ type wrappedModel struct {
 	TTSConfig           *config.ModelConfig
 	TranscriptionConfig *config.ModelConfig
 	LLMConfig           *config.ModelConfig
-	TTSClient           grpcClient.Backend
-	TranscriptionClient grpcClient.Backend
-	LLMClient           grpcClient.Backend
-
 	VADConfig *config.ModelConfig
-	VADClient grpcClient.Backend
+
+	appConfig   *config.ApplicationConfig
+	modelLoader *model.ModelLoader
+	confLoader  *config.ModelConfigLoader
+	evaluator   *templates.Evaluator
 }

 // anyToAnyModel represent a model which supports Any-to-Any operations
@@ -38,71 +41,158 @@ type wrappedModel struct {
 // In the future there could be models that accept continous audio input only so this design will be useful for that
 type anyToAnyModel struct {
 	LLMConfig *config.ModelConfig
-	LLMClient grpcClient.Backend
-
 	VADConfig *config.ModelConfig
-	VADClient grpcClient.Backend
+
+	appConfig   *config.ApplicationConfig
+	modelLoader *model.ModelLoader
+	confLoader  *config.ModelConfigLoader
 }

 type transcriptOnlyModel struct {
 	TranscriptionConfig *config.ModelConfig
-	TranscriptionClient grpcClient.Backend
 	VADConfig           *config.ModelConfig
-	VADClient           grpcClient.Backend
+
+	appConfig   *config.ApplicationConfig
+	modelLoader *model.ModelLoader
+	confLoader  *config.ModelConfigLoader
 }

-func (m *transcriptOnlyModel) VAD(ctx context.Context, in *proto.VADRequest, opts ...grpc.CallOption) (*proto.VADResponse, error) {
-	return m.VADClient.VAD(ctx, in)
+func (m *transcriptOnlyModel) VAD(ctx context.Context, request *schema.VADRequest) (*schema.VADResponse, error) {
+	return backend.VAD(request, ctx, m.modelLoader, m.appConfig, *m.VADConfig)
 }

-func (m *transcriptOnlyModel) Transcribe(ctx context.Context, in *proto.TranscriptRequest, opts ...grpc.CallOption) (*proto.TranscriptResult, error) {
-	return m.TranscriptionClient.AudioTranscription(ctx, in, opts...)
+func (m *transcriptOnlyModel) Transcribe(ctx context.Context, audio, language string, translate bool, diarize bool, prompt string) (*schema.TranscriptionResult, error) {
+	return backend.ModelTranscription(audio, language, translate, diarize, prompt, m.modelLoader, *m.TranscriptionConfig, m.appConfig)
 }

-func (m *transcriptOnlyModel) Predict(ctx context.Context, in *proto.PredictOptions, opts ...grpc.CallOption) (*proto.Reply, error) {
+func (m *transcriptOnlyModel) Predict(ctx context.Context, messages schema.Messages, images, videos, audios []string, tokenCallback func(string, backend.TokenUsage) bool, tools []types.ToolUnion, toolChoice *types.ToolChoiceUnion, logprobs *int, topLogprobs *int, logitBias map[string]float64) (func() (backend.LLMResponse, error), error) {
 	return nil, fmt.Errorf("predict operation not supported in transcript-only mode")
 }

-func (m *transcriptOnlyModel) PredictStream(ctx context.Context, in *proto.PredictOptions, f func(reply *proto.Reply), opts ...grpc.CallOption) error {
-	return fmt.Errorf("predict stream operation not supported in transcript-only mode")
+func (m *transcriptOnlyModel) TTS(ctx context.Context, text, voice, language string) (string, *proto.Result, error) {
+	return "", nil, fmt.Errorf("TTS not supported in transcript-only mode")
 }

-func (m *wrappedModel) VAD(ctx context.Context, in *proto.VADRequest, opts ...grpc.CallOption) (*proto.VADResponse, error) {
-	return m.VADClient.VAD(ctx, in)
+func (m *transcriptOnlyModel) PredictConfig() *config.ModelConfig {
+	return nil
 }

-func (m *anyToAnyModel) VAD(ctx context.Context, in *proto.VADRequest, opts ...grpc.CallOption) (*proto.VADResponse, error) {
-	return m.VADClient.VAD(ctx, in)
+func (m *wrappedModel) VAD(ctx context.Context, request *schema.VADRequest) (*schema.VADResponse, error) {
+	return backend.VAD(request, ctx, m.modelLoader, m.appConfig, *m.VADConfig)
 }

-func (m *wrappedModel) Transcribe(ctx context.Context, in *proto.TranscriptRequest, opts ...grpc.CallOption) (*proto.TranscriptResult, error) {
-	return m.TranscriptionClient.AudioTranscription(ctx, in, opts...)
+func (m *wrappedModel) Transcribe(ctx context.Context, audio, language string, translate bool, diarize bool, prompt string) (*schema.TranscriptionResult, error) {
+	return backend.ModelTranscription(audio, language, translate, diarize, prompt, m.modelLoader, *m.TranscriptionConfig, m.appConfig)
 }

-func (m *anyToAnyModel) Transcribe(ctx context.Context, in *proto.TranscriptRequest, opts ...grpc.CallOption) (*proto.TranscriptResult, error) {
-	// TODO: Can any-to-any models transcribe?
-	return m.LLMClient.AudioTranscription(ctx, in, opts...)
+func (m *wrappedModel) Predict(ctx context.Context, messages schema.Messages, images, videos, audios []string, tokenCallback func(string, backend.TokenUsage) bool, tools []types.ToolUnion, toolChoice *types.ToolChoiceUnion, logprobs *int, topLogprobs *int, logitBias map[string]float64) (func() (backend.LLMResponse, error), error) {
+	input := schema.OpenAIRequest{
+		Messages: messages,
+	}
+
+	var predInput string
+	var funcs []functions.Function
+	if !m.LLMConfig.TemplateConfig.UseTokenizerTemplate {
+		if len(tools) > 0 {
+			for _, t := range tools {
+				if t.Function != nil {
+					var params map[string]any
+
+					switch p := t.Function.Parameters.(type) {
+					case map[string]any:
+						params = p
+					case string:
+						if err := json.Unmarshal([]byte(p), &params); err != nil {
+							xlog.Warn("Failed to parse parameters JSON string", "error", err, "function", t.Function.Name)
+						}
+					}
+
+					funcs = append(funcs, functions.Function{
+						Name:        t.Function.Name,
+						Description: t.Function.Description,
+						Parameters:  params,
+					})
+				}
+			}
+		}
+
+		predInput = m.evaluator.TemplateMessages(input, input.Messages, m.LLMConfig, funcs, len(funcs) > 0)
+
+		xlog.Debug("Prompt (after templating)", "prompt", predInput)
+		if m.LLMConfig.Grammar != "" {
+			xlog.Debug("Grammar", "grammar", m.LLMConfig.Grammar)
+		}
+	}
+
+	// Generate grammar for function calling if tools are provided and grammar generation is enabled
+	shouldUseFn := len(tools) > 0 && m.LLMConfig.ShouldUseFunctions()
+
+	if !m.LLMConfig.FunctionsConfig.GrammarConfig.NoGrammar && shouldUseFn {
+		// Allow the user to set custom actions via config file
+		noActionName := "answer"
+		noActionDescription := "use this action to answer without performing any action"
+
+		if m.LLMConfig.FunctionsConfig.NoActionFunctionName != "" {
+			noActionName = m.LLMConfig.FunctionsConfig.NoActionFunctionName
+		}
+		if m.LLMConfig.FunctionsConfig.NoActionDescriptionName != "" {
+			noActionDescription = m.LLMConfig.FunctionsConfig.NoActionDescriptionName
+		}
+
+		noActionGrammar := functions.Function{
+			Name:        noActionName,
+			Description: noActionDescription,
+			Parameters: map[string]interface{}{
+				"properties": map[string]interface{}{
+					"message": map[string]interface{}{
+						"type":        "string",
+						"description": "The message to reply the user with",
+					},
+				},
+			},
+		}
+
+		if !m.LLMConfig.FunctionsConfig.DisableNoAction {
+			funcs = append(funcs, noActionGrammar)
+		}
+
+		// Force picking one of the functions by the request
+		if m.LLMConfig.FunctionToCall() != "" {
+			funcs = functions.Functions(funcs).Select(m.LLMConfig.FunctionToCall())
+		}
+
+		// Generate grammar from function definitions
+		jsStruct := functions.Functions(funcs).ToJSONStructure(m.LLMConfig.FunctionsConfig.FunctionNameKey, m.LLMConfig.FunctionsConfig.FunctionNameKey)
+		g, err := jsStruct.Grammar(m.LLMConfig.FunctionsConfig.GrammarOptions()...)
+		if err == nil {
+			m.LLMConfig.Grammar = g
+			xlog.Debug("Generated grammar for function calling", "grammar", g)
+		} else {
+			xlog.Error("Failed generating grammar", "error", err)
+		}
+	}
+
+	var toolsJSON string
+	if len(tools) > 0 {
+		b, _ := json.Marshal(tools)
+		toolsJSON = string(b)
+	}
+
+	var toolChoiceJSON string
+	if toolChoice != nil {
+		b, _ := json.Marshal(toolChoice)
+		toolChoiceJSON = string(b)
+	}
+
+	return backend.ModelInference(ctx, predInput, messages, images, videos, audios, m.modelLoader, m.LLMConfig, m.confLoader, m.appConfig, tokenCallback, toolsJSON, toolChoiceJSON, logprobs, topLogprobs, logitBias, )
 }

-func (m *wrappedModel) Predict(ctx context.Context, in *proto.PredictOptions, opts ...grpc.CallOption) (*proto.Reply, error) {
-	// TODO: Convert with pipeline (audio to text, text to llm, result to tts, and return it)
-	// sound.BufferAsWAV(audioData, "audio.wav")
-
-	return m.LLMClient.Predict(ctx, in)
+func (m *wrappedModel) TTS(ctx context.Context, text, voice, language string) (string, *proto.Result, error) {
+	return backend.ModelTTS(text, voice, language, m.modelLoader, m.appConfig, *m.TTSConfig)
 }

-func (m *wrappedModel) PredictStream(ctx context.Context, in *proto.PredictOptions, f func(reply *proto.Reply), opts ...grpc.CallOption) error {
-	// TODO: Convert with pipeline (audio to text, text to llm, result to tts, and return it)
-
-	return m.LLMClient.PredictStream(ctx, in, f)
-}
-
-func (m *anyToAnyModel) Predict(ctx context.Context, in *proto.PredictOptions, opts ...grpc.CallOption) (*proto.Reply, error) {
-	return m.LLMClient.Predict(ctx, in)
-}
-
-func (m *anyToAnyModel) PredictStream(ctx context.Context, in *proto.PredictOptions, f func(reply *proto.Reply), opts ...grpc.CallOption) error {
-	return m.LLMClient.PredictStream(ctx, in, f)
+func (m *wrappedModel) PredictConfig() *config.ModelConfig {
+	return m.LLMConfig
 }

 func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) (Model, *config.ModelConfig, error) {
@@ -116,12 +206,6 @@ func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfig
 		return nil, nil, fmt.Errorf("failed to validate config: %w", err)
 	}

-	opts := backend.ModelOptions(*cfgVAD, appConfig)
-	VADClient, err := ml.Load(opts...)
-	if err != nil {
-		return nil, nil, fmt.Errorf("failed to load tts model: %w", err)
-	}
-
 	cfgSST, err := cl.LoadModelConfigFileByName(pipeline.Transcription, ml.ModelPath)
 	if err != nil {

@@ -132,22 +216,19 @@ func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfig
 		return nil, nil, fmt.Errorf("failed to validate config: %w", err)
 	}

-	opts = backend.ModelOptions(*cfgSST, appConfig)
-	transcriptionClient, err := ml.Load(opts...)
-	if err != nil {
-		return nil, nil, fmt.Errorf("failed to load SST model: %w", err)
-	}
-
 	return &transcriptOnlyModel{
-		VADConfig:           cfgVAD,
-		VADClient:           VADClient,
 		TranscriptionConfig: cfgSST,
-		TranscriptionClient: transcriptionClient,
+		VADConfig: cfgVAD,
+
+		confLoader: cl,
+		modelLoader: ml,
+		appConfig: appConfig,
 	}, cfgSST, nil
 }

 // returns and loads either a wrapped model or a model that support audio-to-audio
-func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) (Model, error) {
+func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig, evaluator *templates.Evaluator) (Model, error) {
+	xlog.Debug("Creating new model pipeline model", "pipeline", pipeline)

 	cfgVAD, err := cl.LoadModelConfigFileByName(pipeline.VAD, ml.ModelPath)
 	if err != nil {
@@ -159,12 +240,6 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
 		return nil, fmt.Errorf("failed to validate config: %w", err)
 	}

-	opts := backend.ModelOptions(*cfgVAD, appConfig)
-	VADClient, err := ml.Load(opts...)
-	if err != nil {
-		return nil, fmt.Errorf("failed to load tts model: %w", err)
-	}
-
 	// TODO: Do we always need a transcription model? It can be disabled. Note that any-to-any instruction following models don't transcribe as such, so if transcription is required it is a separate process
 	cfgSST, err := cl.LoadModelConfigFileByName(pipeline.Transcription, ml.ModelPath)
 	if err != nil {
@@ -176,38 +251,24 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
 		return nil, fmt.Errorf("failed to validate config: %w", err)
 	}

-	opts = backend.ModelOptions(*cfgSST, appConfig)
-	transcriptionClient, err := ml.Load(opts...)
-	if err != nil {
-		return nil, fmt.Errorf("failed to load SST model: %w", err)
-	}
-
 	// TODO: Decide when we have a real any-to-any model
-	if false {
-
-		cfgAnyToAny, err := cl.LoadModelConfigFileByName(pipeline.LLM, ml.ModelPath)
-		if err != nil {
-
-			return nil, fmt.Errorf("failed to load backend config: %w", err)
-		}
-
-		if valid, _ := cfgAnyToAny.Validate(); !valid {
-			return nil, fmt.Errorf("failed to validate config: %w", err)
-		}
-
-		opts := backend.ModelOptions(*cfgAnyToAny, appConfig)
-		anyToAnyClient, err := ml.Load(opts...)
-		if err != nil {
-			return nil, fmt.Errorf("failed to load tts model: %w", err)
-		}
-
-		return &anyToAnyModel{
-			LLMConfig: cfgAnyToAny,
-			LLMClient: anyToAnyClient,
-			VADConfig: cfgVAD,
-			VADClient: VADClient,
-		}, nil
-	}
+	// if false {
+	//
+	// 	cfgAnyToAny, err := cl.LoadModelConfigFileByName(pipeline.LLM, ml.ModelPath)
+	// 	if err != nil {
+	//
+	// 		return nil, fmt.Errorf("failed to load backend config: %w", err)
+	// 	}
+	//
+	// 	if valid, _ := cfgAnyToAny.Validate(); !valid {
+	// 		return nil, fmt.Errorf("failed to validate config: %w", err)
+	// 	}
+	//
+	// 	return &anyToAnyModel{
+	// 		LLMConfig: cfgAnyToAny,
+	// 		VADConfig: cfgVAD,
+	// 	}, nil
+	// }

 	xlog.Debug("Loading a wrapped model")

@@ -232,27 +293,15 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
 		return nil, fmt.Errorf("failed to validate config: %w", err)
 	}

-	opts = backend.ModelOptions(*cfgTTS, appConfig)
-	ttsClient, err := ml.Load(opts...)
-	if err != nil {
-		return nil, fmt.Errorf("failed to load tts model: %w", err)
-	}
-
-	opts = backend.ModelOptions(*cfgLLM, appConfig)
-	llmClient, err := ml.Load(opts...)
-	if err != nil {
-		return nil, fmt.Errorf("failed to load LLM model: %w", err)
-	}
-
 	return &wrappedModel{
 		TTSConfig:           cfgTTS,
 		TranscriptionConfig: cfgSST,
 		LLMConfig:           cfgLLM,
-		TTSClient:           ttsClient,
-		TranscriptionClient: transcriptionClient,
-		LLMClient:           llmClient,
-
 		VADConfig: cfgVAD,
-		VADClient: VADClient,
+
+		confLoader: cl,
+		modelLoader: ml,
+		appConfig: appConfig,
+		evaluator: evaluator,
 	}, nil
 }
--- a/core/http/endpoints/openai/types/client_events.go
+++ b/core/http/endpoints/openai/types/client_events.go
@@ -0,0 +1,413 @@
+package types
+
+import "encoding/json"
+
+// ClientEventType is the type of client event. See https://platform.openai.com/docs/guides/realtime/client-events
+type ClientEventType string
+
+const (
+	ClientEventTypeSessionUpdate            ClientEventType = "session.update"
+	ClientEventTypeInputAudioBufferAppend   ClientEventType = "input_audio_buffer.append"
+	ClientEventTypeInputAudioBufferCommit   ClientEventType = "input_audio_buffer.commit"
+	ClientEventTypeInputAudioBufferClear    ClientEventType = "input_audio_buffer.clear"
+	ClientEventTypeConversationItemCreate   ClientEventType = "conversation.item.create"
+	ClientEventTypeConversationItemRetrieve ClientEventType = "conversation.item.retrieve"
+	ClientEventTypeConversationItemTruncate ClientEventType = "conversation.item.truncate"
+	ClientEventTypeConversationItemDelete   ClientEventType = "conversation.item.delete"
+	ClientEventTypeResponseCreate           ClientEventType = "response.create"
+	ClientEventTypeResponseCancel           ClientEventType = "response.cancel"
+	ClientEventTypeOutputAudioBufferClear   ClientEventType = "output_audio_buffer.clear"
+)
+
+// ClientEvent is the interface for client event.
+type ClientEvent interface {
+	ClientEventType() ClientEventType
+}
+
+// EventBase is the base struct for all client events.
+type EventBase struct {
+	Type string `json:"type"`
+	// Optional client-generated ID used to identify this event.
+	EventID string `json:"event_id,omitempty"`
+}
+
+// Send this event to update the session’s configuration. The client may send this event at any time to update any field except for voice and model. voice can be updated only if there have been no other audio outputs yet.
+//
+// When the server receives a session.update, it will respond with a session.updated event showing the full, effective configuration. Only the fields that are present in the session.update are updated. To clear a field like instructions, pass an empty string. To clear a field like tools, pass an empty array. To clear a field like turn_detection, pass null.//
+//
+// See https://platform.openai.com/docs/api-reference/realtime-client-events/session/update
+type SessionUpdateEvent struct {
+	EventBase
+	// Session configuration to update.
+	Session SessionUnion `json:"session"`
+}
+
+func (m SessionUpdateEvent) ClientEventType() ClientEventType {
+	return ClientEventTypeSessionUpdate
+}
+
+func (m SessionUpdateEvent) MarshalJSON() ([]byte, error) {
+	type typeAlias SessionUpdateEvent
+	type typeWrapper struct {
+		typeAlias
+		Type ClientEventType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.ClientEventType(),
+	}
+	return json.Marshal(shadow)
+}
+
+type NoiseReductionType string
+
+const (
+	NoiseReductionNearField NoiseReductionType = "near_field"
+	NoiseReductionFarField  NoiseReductionType = "far_field"
+)
+
+// Send this event to append audio bytes to the input audio buffer. The audio buffer is temporary storage you can write to and later commit. A "commit" will create a new user message item in the conversation history from the buffer content and clear the buffer. Input audio transcription (if enabled) will be generated when the buffer is committed.
+//
+// If VAD is enabled the audio buffer is used to detect speech and the server will decide when to commit. When Server VAD is disabled, you must commit the audio buffer manually. Input audio noise reduction operates on writes to the audio buffer.
+//
+// The client may choose how much audio to place in each event up to a maximum of 15 MiB, for example streaming smaller chunks from the client may allow the VAD to be more responsive. Unlike most other client events, the server will not send a confirmation response to this event.
+//
+// See https://platform.openai.com/docs/api-reference/realtime-client-events/input_audio_buffer/append
+type InputAudioBufferAppendEvent struct {
+	EventBase
+	Audio string `json:"audio"` // Base64-encoded audio bytes.
+}
+
+func (m InputAudioBufferAppendEvent) ClientEventType() ClientEventType {
+	return ClientEventTypeInputAudioBufferAppend
+}
+
+func (m InputAudioBufferAppendEvent) MarshalJSON() ([]byte, error) {
+	type typeAlias InputAudioBufferAppendEvent
+	type typeWrapper struct {
+		typeAlias
+		Type ClientEventType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.ClientEventType(),
+	}
+	return json.Marshal(shadow)
+}
+
+// Send this event to commit the user input audio buffer, which will create a new user message item in the conversation. This event will produce an error if the input audio buffer is empty. When in Server VAD mode, the client does not need to send this event, the server will commit the audio buffer automatically.
+//
+// Committing the input audio buffer will trigger input audio transcription (if enabled in session configuration), but it will not create a response from the model. The server will respond with an input_audio_buffer.committed event.
+//
+// See https://platform.openai.com/docs/api-reference/realtime-client-events/input_audio_buffer/commit
+type InputAudioBufferCommitEvent struct {
+	EventBase
+}
+
+func (m InputAudioBufferCommitEvent) ClientEventType() ClientEventType {
+	return ClientEventTypeInputAudioBufferCommit
+}
+
+func (m InputAudioBufferCommitEvent) MarshalJSON() ([]byte, error) {
+	type typeAlias InputAudioBufferCommitEvent
+	type typeWrapper struct {
+		typeAlias
+		Type ClientEventType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.ClientEventType(),
+	}
+	return json.Marshal(shadow)
+}
+
+// Send this event to clear the audio bytes in the buffer. The server will respond with an input_audio_buffer.cleared event.
+//
+// See https://platform.openai.com/docs/api-reference/realtime-client-events/input_audio_buffer/clear
+type InputAudioBufferClearEvent struct {
+	EventBase
+}
+
+func (m InputAudioBufferClearEvent) ClientEventType() ClientEventType {
+	return ClientEventTypeInputAudioBufferClear
+}
+
+func (m InputAudioBufferClearEvent) MarshalJSON() ([]byte, error) {
+	type typeAlias InputAudioBufferClearEvent
+	type typeWrapper struct {
+		typeAlias
+		Type ClientEventType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.ClientEventType(),
+	}
+	return json.Marshal(shadow)
+}
+
+// Send this event to clear the audio bytes in the buffer. The server will respond with an input_audio_buffer.cleared event.
+//
+// See https://platform.openai.com/docs/api-reference/realtime-client-events/output_audio_buffer/clear
+
+type OutputAudioBufferClearEvent struct {
+	EventBase
+}
+
+func (m OutputAudioBufferClearEvent) ClientEventType() ClientEventType {
+	return ClientEventTypeOutputAudioBufferClear
+}
+
+func (m OutputAudioBufferClearEvent) MarshalJSON() ([]byte, error) {
+	type typeAlias OutputAudioBufferClearEvent
+	type typeWrapper struct {
+		typeAlias
+		Type ClientEventType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.ClientEventType(),
+	}
+	return json.Marshal(shadow)
+}
+
+// Add a new Item to the Conversation's context, including messages, function calls, and function call responses. This event can be used both to populate a "history" of the conversation and to add new items mid-stream, but has the current limitation that it cannot populate assistant audio messages.
+//
+// If successful, the server will respond with a conversation.item.created event, otherwise an error event will be sent.
+//
+// See https://platform.openai.com/docs/api-reference/realtime-client-events/conversation/item/create
+type ConversationItemCreateEvent struct {
+	EventBase
+	// The ID of the preceding item after which the new item will be inserted.
+	PreviousItemID string `json:"previous_item_id,omitempty"`
+	// The item to add to the conversation.
+	Item MessageItemUnion `json:"item"`
+}
+
+func (m ConversationItemCreateEvent) ClientEventType() ClientEventType {
+	return ClientEventTypeConversationItemCreate
+}
+
+func (m ConversationItemCreateEvent) MarshalJSON() ([]byte, error) {
+	type typeAlias ConversationItemCreateEvent
+	type typeWrapper struct {
+		typeAlias
+		Type ClientEventType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.ClientEventType(),
+	}
+	return json.Marshal(shadow)
+}
+
+// Send this event when you want to retrieve the server's representation of a specific item in the conversation history. This is useful, for example, to inspect user audio after noise cancellation and VAD. The server will respond with a conversation.item.retrieved event, unless the item does not exist in the conversation history, in which case the server will respond with an error.
+//
+// See https://platform.openai.com/docs/api-reference/realtime-client-events/conversation/item/retrieve
+type ConversationItemRetrieveEvent struct {
+	EventBase
+	// The ID of the item to retrieve.
+	ItemID string `json:"item_id"`
+}
+
+func (m ConversationItemRetrieveEvent) ClientEventType() ClientEventType {
+	return ClientEventTypeConversationItemRetrieve
+}
+
+func (m ConversationItemRetrieveEvent) MarshalJSON() ([]byte, error) {
+	type typeAlias ConversationItemRetrieveEvent
+	type typeWrapper struct {
+		typeAlias
+		Type ClientEventType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.ClientEventType(),
+	}
+	return json.Marshal(shadow)
+}
+
+// Send this event to truncate a previous assistant message’s audio. The server will produce audio faster than realtime, so this event is useful when the user interrupts to truncate audio that has already been sent to the client but not yet played. This will synchronize the server's understanding of the audio with the client's playback.
+//
+// Truncating audio will delete the server-side text transcript to ensure there is not text in the context that hasn't been heard by the user.
+//
+// If successful, the server will respond with a conversation.item.truncated event.
+//
+// See https://platform.openai.com/docs/api-reference/realtime-client-events/conversation/item/truncate
+type ConversationItemTruncateEvent struct {
+	EventBase
+	// The ID of the assistant message item to truncate.
+	ItemID string `json:"item_id"`
+	// The index of the content part to truncate.
+	ContentIndex int `json:"content_index"`
+	// Inclusive duration up to which audio is truncated, in milliseconds.
+	AudioEndMs int `json:"audio_end_ms"`
+}
+
+func (m ConversationItemTruncateEvent) ClientEventType() ClientEventType {
+	return ClientEventTypeConversationItemTruncate
+}
+
+func (m ConversationItemTruncateEvent) MarshalJSON() ([]byte, error) {
+	type typeAlias ConversationItemTruncateEvent
+	type typeWrapper struct {
+		typeAlias
+		Type ClientEventType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.ClientEventType(),
+	}
+	return json.Marshal(shadow)
+}
+
+// Send this event when you want to remove any item from the conversation history. The server will respond with a conversation.item.deleted event, unless the item does not exist in the conversation history, in which case the server will respond with an error.
+//
+// See https://platform.openai.com/docs/api-reference/realtime-client-events/conversation/item/delete
+type ConversationItemDeleteEvent struct {
+	EventBase
+	// The ID of the item to delete.
+	ItemID string `json:"item_id"`
+}
+
+func (m ConversationItemDeleteEvent) ClientEventType() ClientEventType {
+	return ClientEventTypeConversationItemDelete
+}
+
+func (m ConversationItemDeleteEvent) MarshalJSON() ([]byte, error) {
+	type typeAlias ConversationItemDeleteEvent
+	type typeWrapper struct {
+		typeAlias
+		Type ClientEventType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.ClientEventType(),
+	}
+	return json.Marshal(shadow)
+}
+
+// This event instructs the server to create a Response, which means triggering model inference. When in Server VAD mode, the server will create Responses automatically.
+//
+// A Response will include at least one Item, and may have two, in which case the second will be a function call. These Items will be appended to the conversation history by default.
+//
+// The server will respond with a response.created event, events for Items and content created, and finally a response.done event to indicate the Response is complete.
+//
+// The response.create event includes inference configuration like instructions and tools. If these are set, they will override the Session's configuration for this Response only.
+//
+// Responses can be created out-of-band of the default Conversation, meaning that they can have arbitrary input, and it's possible to disable writing the output to the Conversation. Only one Response can write to the default Conversation at a time, but otherwise multiple Responses can be created in parallel. The metadata field is a good way to disambiguate multiple simultaneous Responses.
+//
+// Clients can set conversation to none to create a Response that does not write to the default Conversation. Arbitrary input can be provided with the input field, which is an array accepting raw Items and references to existing Items.
+//
+// See https://platform.openai.com/docs/api-reference/realtime-client-events/response/create
+type ResponseCreateEvent struct {
+	EventBase
+	// Configuration for the response.
+	Response ResponseCreateParams `json:"response"`
+}
+
+func (m ResponseCreateEvent) ClientEventType() ClientEventType {
+	return ClientEventTypeResponseCreate
+}
+
+func (m ResponseCreateEvent) MarshalJSON() ([]byte, error) {
+	type typeAlias ResponseCreateEvent
+	type typeWrapper struct {
+		typeAlias
+		Type ClientEventType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.ClientEventType(),
+	}
+	return json.Marshal(shadow)
+}
+
+// Send this event to cancel an in-progress response. The server will respond with a response.done event with a status of response.status=cancelled. If there is no response to cancel, the server will respond with an error. It's safe to call response.cancel even if no response is in progress, an error will be returned the session will remain unaffected.
+//
+// See https://platform.openai.com/docs/api-reference/realtime-client-events/response/cancel
+type ResponseCancelEvent struct {
+	EventBase
+	// A specific response ID to cancel - if not provided, will cancel an in-progress response in the default conversation.
+	ResponseID string `json:"response_id,omitempty"`
+}
+
+func (m ResponseCancelEvent) ClientEventType() ClientEventType {
+	return ClientEventTypeResponseCancel
+}
+
+func (m ResponseCancelEvent) MarshalJSON() ([]byte, error) {
+	type typeAlias ResponseCancelEvent
+	type typeWrapper struct {
+		typeAlias
+		Type ClientEventType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.ClientEventType(),
+	}
+	return json.Marshal(shadow)
+}
+
+type ClientEventInterface interface {
+	SessionUpdateEvent |
+		InputAudioBufferAppendEvent |
+		InputAudioBufferCommitEvent |
+		InputAudioBufferClearEvent |
+		OutputAudioBufferClearEvent |
+		ConversationItemCreateEvent |
+		ConversationItemRetrieveEvent |
+		ConversationItemTruncateEvent |
+		ConversationItemDeleteEvent |
+		ResponseCreateEvent |
+		ResponseCancelEvent
+}
+
+func unmarshalClientEvent[T ClientEventInterface](data []byte) (T, error) {
+	var t T
+	err := json.Unmarshal(data, &t)
+	if err != nil {
+		return t, err
+	}
+	return t, nil
+}
+
+// UnmarshalClientEvent unmarshals the client event from the given JSON data.
+func UnmarshalClientEvent(data []byte) (ClientEvent, error) {
+	var eventType struct {
+		Type ClientEventType `json:"type"`
+	}
+	err := json.Unmarshal(data, &eventType)
+	if err != nil {
+		return nil, err
+	}
+
+	switch eventType.Type {
+	case ClientEventTypeSessionUpdate:
+		return unmarshalClientEvent[SessionUpdateEvent](data)
+	case ClientEventTypeInputAudioBufferAppend:
+		return unmarshalClientEvent[InputAudioBufferAppendEvent](data)
+	case ClientEventTypeInputAudioBufferCommit:
+		return unmarshalClientEvent[InputAudioBufferCommitEvent](data)
+	case ClientEventTypeInputAudioBufferClear:
+		return unmarshalClientEvent[InputAudioBufferClearEvent](data)
+	case ClientEventTypeOutputAudioBufferClear:
+		return unmarshalClientEvent[OutputAudioBufferClearEvent](data)
+	case ClientEventTypeConversationItemCreate:
+		return unmarshalClientEvent[ConversationItemCreateEvent](data)
+	case ClientEventTypeConversationItemRetrieve:
+		return unmarshalClientEvent[ConversationItemRetrieveEvent](data)
+	case ClientEventTypeConversationItemTruncate:
+		return unmarshalClientEvent[ConversationItemTruncateEvent](data)
+	case ClientEventTypeConversationItemDelete:
+		return unmarshalClientEvent[ConversationItemDeleteEvent](data)
+	case ClientEventTypeResponseCreate:
+		return unmarshalClientEvent[ResponseCreateEvent](data)
+	case ClientEventTypeResponseCancel:
+		return unmarshalClientEvent[ResponseCancelEvent](data)
+	default:
+		// We should probably return a generic event or error here, but for now just nil.
+		// Or maybe a "UnknownEvent" struct?
+		// For now matching the existing pattern
+		return nil, nil
+	}
+}
--- a/core/http/endpoints/openai/types/int_or_inf.go
+++ b/core/http/endpoints/openai/types/int_or_inf.go
@@ -0,0 +1,39 @@
+package types
+
+import (
+	"encoding/json"
+	"math"
+)
+
+const (
+	// Inf is the maximum value for an IntOrInf.
+	Inf IntOrInf = math.MaxInt
+)
+
+// IntOrInf is a type that can be either an int or "inf".
+type IntOrInf int
+
+// IsInf returns true if the value is "inf".
+func (m IntOrInf) IsInf() bool {
+	return m == Inf
+}
+
+// MarshalJSON marshals the IntOrInf to JSON.
+func (m IntOrInf) MarshalJSON() ([]byte, error) {
+	if m == Inf {
+		return []byte("\"inf\""), nil
+	}
+	return json.Marshal(int(m))
+}
+
+// UnmarshalJSON unmarshals the IntOrInf from JSON.
+func (m *IntOrInf) UnmarshalJSON(data []byte) error {
+	if string(data) == "\"inf\"" {
+		*m = Inf
+		return nil
+	}
+	if len(data) == 0 {
+		return nil
+	}
+	return json.Unmarshal(data, (*int)(m))
+}
--- a/core/http/endpoints/openai/types/message_item.go
+++ b/core/http/endpoints/openai/types/message_item.go
@@ -0,0 +1,628 @@
+package types
+
+import (
+	"encoding/json"
+	"errors"
+	"fmt"
+)
+
+type MessageItemType string
+
+const (
+	MessageItemTypeMessage             MessageItemType = "message"
+	MessageItemTypeFunctionCall        MessageItemType = "function_call"
+	MessageItemTypeFunctionCallOutput  MessageItemType = "function_call_output"
+	MessageItemTypeMCPApprovalResponse MessageItemType = "mcp_approval_response"
+	MessageItemTypeMCPListTools        MessageItemType = "mcp_list_tools"
+	MessageItemTypeMCPCall             MessageItemType = "mcp_call"
+	MessageItemTypeMCPApprovalRequest  MessageItemType = "mcp_approval_request"
+)
+
+type MessageContentType string
+
+const (
+	MessageContentTypeText        MessageContentType = "text"
+	MessageContentTypeAudio       MessageContentType = "audio"
+	MessageContentTypeTranscript  MessageContentType = "transcript"
+	MessageContentTypeInputText   MessageContentType = "input_text"
+	MessageContentTypeInputAudio  MessageContentType = "input_audio"
+	MessageContentTypeOutputText  MessageContentType = "output_text"
+	MessageContentTypeOutputAudio MessageContentType = "output_audio"
+)
+
+type MessageContentText struct {
+	Text string `json:"text,omitempty"`
+}
+
+type MessageContentAudio struct {
+	Type  MessageContentType `json:"type,omitempty"`
+	Audio string             `json:"audio,omitempty"`
+}
+
+type MessageContentTranscript struct {
+	Type       MessageContentType `json:"type,omitempty"`
+	Transcript string             `json:"transcript,omitempty"`
+}
+
+type MessageContentImage struct {
+	Type     MessageContentType `json:"type,omitempty"`
+	ImageURL string             `json:"image_url,omitempty"`
+	Detail   ImageDetail        `json:"detail,omitempty"`
+}
+
+type MessageContentSystem MessageContentText
+
+type MessageItemSystem struct {
+	// The unique ID of the item. This may be provided by the client or generated by the server.
+	ID string `json:"id,omitempty"`
+
+	// The content of the message.
+	Content []MessageContentSystem `json:"content,omitempty"`
+
+	// Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
+	Object string `json:"object,omitempty"`
+
+	// The status of the item. Has no effect on the conversation.
+	Status ItemStatus `json:"status,omitempty"`
+}
+
+func (m MessageItemSystem) MessageItemType() MessageItemType {
+	return MessageItemTypeMessage
+}
+
+func (m MessageItemSystem) Role() MessageRole {
+	return MessageRoleSystem
+}
+
+func (m MessageItemSystem) MarshalJSON() ([]byte, error) {
+	type typeAlias MessageItemSystem
+	type typeWrapper struct {
+		typeAlias
+		Type MessageItemType `json:"type"`
+		Role MessageRole     `json:"role"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.MessageItemType(),
+		Role:      m.Role(),
+	}
+	return json.Marshal(shadow)
+}
+
+type MessageItemUser struct {
+	// The unique ID of the item. This may be provided by the client or generated by the server.
+	ID string `json:"id,omitempty"`
+
+	// The content of the message.
+	Content []MessageContentInput `json:"content,omitempty"`
+
+	// Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
+	Object string `json:"object,omitempty"`
+
+	// The status of the item. Has no effect on the conversation.
+	Status ItemStatus `json:"status,omitempty"`
+}
+
+func (m MessageItemUser) MessageItemType() MessageItemType {
+	return MessageItemTypeMessage
+}
+
+func (m MessageItemUser) Role() MessageRole {
+	return MessageRoleUser
+}
+
+func (m MessageItemUser) MarshalJSON() ([]byte, error) {
+	type typeAlias MessageItemUser
+	type typeWrapper struct {
+		typeAlias
+		Type MessageItemType `json:"type"`
+		Role MessageRole     `json:"role"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.MessageItemType(),
+		Role:      m.Role(),
+	}
+	return json.Marshal(shadow)
+}
+
+type MessageItemAssistant struct {
+	// The unique ID of the item. This may be provided by the client or generated by the server.
+	ID string `json:"id,omitempty"`
+
+	// The content of the message.
+	Content []MessageContentOutput `json:"content,omitempty"`
+
+	// Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
+	Object string `json:"object,omitempty"`
+
+	// The status of the item. Has no effect on the conversation.
+	Status ItemStatus `json:"status,omitempty"`
+}
+
+func (m MessageItemAssistant) MessageItemType() MessageItemType {
+	return MessageItemTypeMessage
+}
+
+func (m MessageItemAssistant) Role() MessageRole {
+	return MessageRoleAssistant
+}
+
+func (m MessageItemAssistant) MarshalJSON() ([]byte, error) {
+	type typeAlias MessageItemAssistant
+	type typeWrapper struct {
+		typeAlias
+		Type MessageItemType `json:"type"`
+		Role MessageRole     `json:"role"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.MessageItemType(),
+		Role:      m.Role(),
+	}
+	return json.Marshal(shadow)
+}
+
+type MessageContentInput struct {
+	// The content type (input_text, input_audio, or input_image).
+	Type MessageContentType `json:"type"`
+
+	// Base64-encoded audio bytes (for input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
+	Audio string `json:"audio,omitempty"`
+
+	// The detail level of the image (for input_image). auto will default to high.
+	Detail ImageDetail `json:"detail,omitempty"`
+
+	// Base64-encoded image bytes (for input_image) as a data URI. For example data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG.
+	ImageURL string `json:"image_url,omitempty"`
+
+	// The text content (for input_text).
+	Text string `json:"text,omitempty"`
+
+	// Transcript of the audio (for input_audio). This is not sent to the model, but will be attached to the message item for reference.
+	Transcript string `json:"transcript,omitempty"`
+}
+
+type MessageContentOutput struct {
+	// The content type (input_text, input_audio, or input_image).
+	Type MessageContentType `json:"type,omitempty"`
+
+	// Base64-encoded audio bytes (for input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
+	Audio string `json:"audio,omitempty"`
+
+	// The text content (for input_text).
+	Text string `json:"text,omitempty"`
+
+	// Transcript of the audio (for input_audio). This is not sent to the model, but will be attached to the message item for reference.
+	Transcript string `json:"transcript,omitempty"`
+}
+
+type MessageItemFunctionCall struct {
+	// The unique ID of the item. This may be provided by the client or generated by the server.
+	ID string `json:"id,omitempty"`
+
+	// The ID of the function call.
+	CallID string `json:"call_id,omitempty"`
+
+	// The arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example {"arg1": "value1", "arg2": 42}.
+	Arguments string `json:"arguments,omitempty"`
+
+	// The name of the function being called.
+	Name string `json:"name,omitempty"`
+
+	// Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
+	Object string `json:"object,omitempty"`
+
+	// The status of the item. Has no effect on the conversation.
+	Status ItemStatus `json:"status,omitempty"`
+}
+
+func (m MessageItemFunctionCall) MessageItemType() MessageItemType {
+	return MessageItemTypeFunctionCall
+}
+
+func (m MessageItemFunctionCall) MarshalJSON() ([]byte, error) {
+	type typeAlias MessageItemFunctionCall
+	type typeWrapper struct {
+		typeAlias
+		Type MessageItemType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.MessageItemType(),
+	}
+	return json.Marshal(shadow)
+}
+
+type MessageItemFunctionCallOutput struct {
+	// The unique ID of the item. This may be provided by the client or generated by the server.
+	ID string `json:"id,omitempty"`
+
+	// The ID of the function call this output is for.
+	CallID string `json:"call_id,omitempty"`
+
+	// The output of the function call, this is free text and can contain any information or simply be empty.
+	Output string `json:"output,omitempty"`
+
+	// Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
+	Object string `json:"object,omitempty"`
+
+	// The status of the item. Has no effect on the conversation.
+	Status ItemStatus `json:"status,omitempty"`
+}
+
+func (m MessageItemFunctionCallOutput) MessageItemType() MessageItemType {
+	return MessageItemTypeFunctionCallOutput
+}
+
+func (m MessageItemFunctionCallOutput) MarshalJSON() ([]byte, error) {
+	type typeAlias MessageItemFunctionCallOutput
+	type typeWrapper struct {
+		typeAlias
+		Type MessageItemType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.MessageItemType(),
+	}
+	return json.Marshal(shadow)
+}
+
+type MessageItemMCPApprovalResponse struct {
+	// The unique ID of the approval response.
+	ID string `json:"id,omitempty"`
+
+	// The ID of the approval request being answered.
+	ApprovalRequestID string `json:"approval_request_id,omitempty"`
+
+	// Whether the request was approved.
+	Approve bool `json:"approve,omitempty"`
+
+	// Optional reason for the decision.
+	Reason string `json:"reason,omitempty"`
+}
+
+func (m MessageItemMCPApprovalResponse) MessageItemType() MessageItemType {
+	return MessageItemTypeMCPApprovalResponse
+}
+
+func (m MessageItemMCPApprovalResponse) MarshalJSON() ([]byte, error) {
+	type typeAlias MessageItemMCPApprovalResponse
+	type typeWrapper struct {
+		typeAlias
+		Type MessageItemType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.MessageItemType(),
+	}
+	return json.Marshal(shadow)
+}
+
+type MCPTool struct {
+	// JSON schema describing the tool's expected input shape.
+	InputSchema string `json:"input_schema,omitempty"`
+
+	// The name of the MCP tool.
+	Name string `json:"name,omitempty"`
+
+	// A human-readable description of what the tool does.
+	Description string `json:"description,omitempty"`
+
+	// Additional metadata or annotations supplied by the server.
+	Annotations any `json:"annotations,omitempty"`
+}
+
+type MessageItemMCPListTools struct {
+	// The unique ID of the list.
+	ID string `json:"id,omitempty"`
+
+	// The label of the MCP server.
+	ServerLabel string `json:"server_label,omitempty"`
+
+	// The tools available on the server.
+	Tools []MCPTool `json:"tools,omitempty"`
+}
+
+func (m MessageItemMCPListTools) MessageItemType() MessageItemType {
+	return MessageItemTypeMCPListTools
+}
+
+func (m MessageItemMCPListTools) MarshalJSON() ([]byte, error) {
+	type typeAlias MessageItemMCPListTools
+	type typeWrapper struct {
+		typeAlias
+		Type MessageItemType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.MessageItemType(),
+	}
+	return json.Marshal(shadow)
+}
+
+type MCPErrorType string
+
+const (
+	MCPErrorTypeProtocolError MCPErrorType = "protocol_error"
+	MCPErrorTypeToolExecution MCPErrorType = "tool_execution_error"
+	MCPErrorTypeHTTPError     MCPErrorType = "http_error"
+)
+
+type MCPProtocolError struct {
+	// Numeric error code (protocol-specific).
+	Code int `json:"code,omitempty"`
+
+	// Human-readable error message.
+	Message string `json:"message,omitempty"`
+}
+
+func (m MCPProtocolError) ErrorType() MCPErrorType {
+	return MCPErrorTypeProtocolError
+}
+
+func (m MCPProtocolError) MarshalJSON() ([]byte, error) {
+	type typeAlias MCPProtocolError
+	type typeWrapper struct {
+		typeAlias
+		Type MCPErrorType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.ErrorType(),
+	}
+	return json.Marshal(shadow)
+}
+
+type MCPToolExecutionError struct {
+	// Human-readable error message from tool execution.
+	Message string `json:"message,omitempty"`
+}
+
+func (m MCPToolExecutionError) ErrorType() MCPErrorType {
+	return MCPErrorTypeToolExecution
+}
+
+func (m MCPToolExecutionError) MarshalJSON() ([]byte, error) {
+	type typeAlias MCPToolExecutionError
+	type typeWrapper struct {
+		typeAlias
+		Type MCPErrorType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.ErrorType(),
+	}
+	return json.Marshal(shadow)
+}
+
+type MCPHTTPError struct {
+	// HTTP status code returned by the upstream call.
+	Code int `json:"code,omitempty"`
+
+	// Human-readable HTTP error message.
+	Message string `json:"message,omitempty"`
+}
+
+func (m MCPHTTPError) ErrorType() MCPErrorType {
+	return MCPErrorTypeHTTPError
+}
+
+func (m MCPHTTPError) MarshalJSON() ([]byte, error) {
+	type typeAlias MCPHTTPError
+	type typeWrapper struct {
+		typeAlias
+		Type MCPErrorType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.ErrorType(),
+	}
+	return json.Marshal(shadow)
+}
+
+type MCPError struct {
+	// Details when type is protocol_error.
+	Protocol *MCPProtocolError `json:",omitempty"`
+
+	// Details when type is tool_execution_error.
+	ToolExecution *MCPToolExecutionError `json:",omitempty"`
+
+	// Details when type is http_error.
+	HTTP *MCPHTTPError `json:",omitempty"`
+}
+
+func (m MCPError) MarshalJSON() ([]byte, error) {
+	if m.Protocol != nil {
+		return json.Marshal(m.Protocol)
+	}
+	if m.ToolExecution != nil {
+		return json.Marshal(m.ToolExecution)
+	}
+	return json.Marshal(m.HTTP)
+}
+
+func (m *MCPError) UnmarshalJSON(data []byte) error {
+	if isNull(data) {
+		return nil
+	}
+	var u typeStruct
+	if err := json.Unmarshal(data, &u); err != nil {
+		return err
+	}
+	switch MCPErrorType(u.Type) {
+	case MCPErrorTypeProtocolError:
+		return json.Unmarshal(data, &m.Protocol)
+	case MCPErrorTypeToolExecution:
+		return json.Unmarshal(data, &m.ToolExecution)
+	case MCPErrorTypeHTTPError:
+		return json.Unmarshal(data, &m.HTTP)
+	default:
+		return errors.New("unknown error type: " + u.Type)
+	}
+}
+
+type MessageItemMCPToolCall struct {
+	// The unique ID of the tool call.
+	ID string `json:"id,omitempty"`
+
+	// The label of the MCP server running the tool.
+	ServerLabel string `json:"server_label,omitempty"`
+
+	// A JSON string of the arguments passed to the tool.
+	Arguments string `json:"arguments,omitempty"`
+
+	// The name of the tool that was run.
+	Name string `json:"name,omitempty"`
+
+	// The ID of an associated approval request, if any.
+	ApprovalRequestID string `json:"approval_request_id,omitempty"`
+
+	// The error from the tool call, if any.
+	Error *MCPProtocolError `json:"error,omitempty"`
+
+	// The output from the tool call.
+	Output string `json:"output,omitempty"`
+}
+
+func (m MessageItemMCPToolCall) MessageItemType() MessageItemType {
+	return MessageItemTypeMCPCall
+}
+
+func (m MessageItemMCPToolCall) MarshalJSON() ([]byte, error) {
+	type typeAlias MessageItemMCPToolCall
+	type typeWrapper struct {
+		typeAlias
+		Type MessageItemType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.MessageItemType(),
+	}
+	return json.Marshal(shadow)
+}
+
+type MessageItemMCPApprovalRequest struct {
+	// The unique ID of the approval request.
+	ID string `json:"id,omitempty"`
+
+	// The name of the tool to run.
+	Name string `json:"name,omitempty"`
+
+	// A JSON string of arguments for the tool.
+	Arguments string `json:"arguments,omitempty"`
+
+	// The label of the MCP server making the request.
+	ServerLabel string `json:"server_label,omitempty"`
+}
+
+func (m MessageItemMCPApprovalRequest) MessageItemType() MessageItemType {
+	return MessageItemTypeMCPApprovalRequest
+}
+
+func (m MessageItemMCPApprovalRequest) MarshalJSON() ([]byte, error) {
+	type typeAlias MessageItemMCPApprovalRequest
+	type typeWrapper struct {
+		typeAlias
+		Type MessageItemType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.MessageItemType(),
+	}
+	return json.Marshal(shadow)
+}
+
+type MessageItemUnion struct {
+	// A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
+	System *MessageItemSystem `json:",omitempty"`
+
+	// A user message item in a Realtime conversation.
+	User *MessageItemUser `json:",omitempty"`
+
+	// An assistant message item in a Realtime conversation.
+	Assistant *MessageItemAssistant `json:",omitempty"`
+
+	// A function call item in a Realtime conversation.
+	FunctionCall *MessageItemFunctionCall `json:",omitempty"`
+
+	// A function call output item in a Realtime conversation.
+	FunctionCallOutput *MessageItemFunctionCallOutput `json:",omitempty"`
+
+	// A Realtime item responding to an MCP approval request.
+	MCPApprovalResponse *MessageItemMCPApprovalResponse `json:",omitempty"`
+
+	// A Realtime item listing tools available on an MCP server.
+	MCPListTools *MessageItemMCPListTools `json:",omitempty"`
+
+	// A Realtime item representing an invocation of a tool on an MCP server.
+	MCPToolCall *MessageItemMCPToolCall `json:",omitempty"`
+
+	// A Realtime item requesting human approval of a tool invocation.
+	MCPApprovalRequest *MessageItemMCPApprovalRequest `json:",omitempty"`
+}
+
+func (m MessageItemUnion) MarshalJSON() ([]byte, error) {
+	switch {
+	case m.System != nil:
+		return json.Marshal(m.System)
+	case m.User != nil:
+		return json.Marshal(m.User)
+	case m.Assistant != nil:
+		return json.Marshal(m.Assistant)
+	case m.FunctionCall != nil:
+		return json.Marshal(m.FunctionCall)
+	case m.FunctionCallOutput != nil:
+		return json.Marshal(m.FunctionCallOutput)
+	case m.MCPApprovalResponse != nil:
+		return json.Marshal(m.MCPApprovalResponse)
+	case m.MCPListTools != nil:
+		return json.Marshal(m.MCPListTools)
+	case m.MCPToolCall != nil:
+		return json.Marshal(m.MCPToolCall)
+	case m.MCPApprovalRequest != nil:
+		return json.Marshal(m.MCPApprovalRequest)
+	default:
+		return nil, errors.New("unknown message item type")
+	}
+}
+
+func (m *MessageItemUnion) UnmarshalJSON(data []byte) error {
+	if isNull(data) {
+		return nil
+	}
+	var t struct {
+		Type string `json:"type"`
+		Role string `json:"role"`
+	}
+	if err := json.Unmarshal(data, &t); err != nil {
+		return err
+	}
+	switch MessageItemType(t.Type) {
+	case MessageItemTypeMessage:
+		switch MessageRole(t.Role) {
+		case MessageRoleUser:
+			return json.Unmarshal(data, &m.User)
+		case MessageRoleAssistant:
+			return json.Unmarshal(data, &m.Assistant)
+		case MessageRoleSystem:
+			return json.Unmarshal(data, &m.System)
+		default:
+			return fmt.Errorf("unknown message role: %s", t.Role)
+		}
+	case MessageItemTypeFunctionCall:
+		return json.Unmarshal(data, &m.FunctionCall)
+	case MessageItemTypeFunctionCallOutput:
+		return json.Unmarshal(data, &m.FunctionCallOutput)
+	case MessageItemTypeMCPApprovalResponse:
+		return json.Unmarshal(data, &m.MCPApprovalResponse)
+	case MessageItemTypeMCPListTools:
+		return json.Unmarshal(data, &m.MCPListTools)
+	case MessageItemTypeMCPCall:
+		return json.Unmarshal(data, &m.MCPToolCall)
+	case MessageItemTypeMCPApprovalRequest:
+		return json.Unmarshal(data, &m.MCPApprovalRequest)
+	default:
+		return fmt.Errorf("unknown message item type: %s", t.Type)
+	}
+}
--- a/core/http/endpoints/openai/types/realtime.go
+++ b/core/http/endpoints/openai/types/realtime.go
--- a/core/http/endpoints/openai/types/server_events.go
+++ b/core/http/endpoints/openai/types/server_events.go
--- a/core/http/endpoints/openai/types/types.go
+++ b/core/http/endpoints/openai/types/types.go
--- a/docs/content/advanced/model-configuration.md
+++ b/docs/content/advanced/model-configuration.md
@@ -476,7 +476,7 @@ reasoning:

 ## Pipeline Configuration

-Define pipelines for audio-to-audio processing:
+Define pipelines for audio-to-audio processing and the [Realtime API]({{%relref "features/openai-realtime" %}}):

 | Field | Type | Description |
 |-------|------|-------------|
--- a/docs/content/features/_index.en.md
+++ b/docs/content/features/_index.en.md
@@ -20,6 +20,7 @@ LocalAI provides a comprehensive set of features for running AI models locally.
 ## Advanced Features

 - **[OpenAI Functions](openai-functions/)** - Use function calling and tools API with local models
+- **[Realtime API](openai-realtime/)** - Low-latency multi-modal conversations (voice+text) over WebSocket
 - **[Constrained Grammars](constrained_grammars/)** - Control model output format with BNF grammars
 - **[GPU Acceleration](GPU-acceleration/)** - Optimize performance with GPU support
 - **[Distributed Inference](distributed_inferencing/)** - Scale inference across multiple nodes
--- a/docs/content/features/openai-realtime.md
+++ b/docs/content/features/openai-realtime.md
@@ -0,0 +1,42 @@
+
+---
+title: "Realtime API"
+weight: 60
+---
+
+# Realtime API
+
+LocalAI supports the [OpenAI Realtime API](https://platform.openai.com/docs/guides/realtime) which enables low-latency, multi-modal conversations (voice and text) over WebSocket.
+
+To use the Realtime API, you need to configure a pipeline model that defines the components for Voice Activity Detection (VAD), Transcription (STT), Language Model (LLM), and Text-to-Speech (TTS).
+
+## Configuration
+
+Create a model configuration file (e.g., `gpt-realtime.yaml`) in your models directory. For a complete reference of configuration options, see [Model Configuration]({{%relref "advanced/model-configuration" %}}).
+
+```yaml
+name: gpt-realtime
+pipeline:
+  vad: silero-vad-ggml
+  transcription: whisper-large-turbo
+  llm: qwen3-4b
+  tts: tts-1
+```
+
+This configuration links the following components:
+- **vad**: The Voice Activity Detection model (e.g., `silero-vad-ggml`) to detect when the user is speaking.
+- **transcription**: The Speech-to-Text model (e.g., `whisper-large-turbo`) to transcribe user audio.
+- **llm**: The Large Language Model (e.g., `qwen3-4b`) to generate responses.
+- **tts**: The Text-to-Speech model (e.g., `tts-1`) to synthesize the audio response.
+
+Make sure all referenced models (`silero-vad-ggml`, `whisper-large-turbo`, `qwen3-4b`, `tts-1`) are also installed or defined in your LocalAI instance.
+
+## Usage
+
+Once configured, you can connect to the Realtime API endpoint via WebSocket:
+
+```
+ws://localhost:8080/v1/realtime?model=gpt-realtime
+```
+
+The API follows the OpenAI Realtime API protocol for handling sessions, audio buffers, and conversation items.