feat: add vision/image support to agent pipeline#555
feat: add vision/image support to agent pipeline#555as3k wants to merge 2 commits intosipeed:mainfrom
Conversation
…es for vision API support - Add Media []string field to Message struct for image/media URLs - Implement serializeMessages() to format messages with image_url content parts - Enables OpenAI-compatible vision APIs to receive image attachments
- Add Media field to processOptions struct - Pass msg.Media from InboundMessage to processOptions in processMessage - Update BuildMessages to attach media to user message - Pass opts.Media to BuildMessages instead of nil in runAgentLoop - Enables vision-capable models (Gemini 2.0 Flash, etc.) to receive image_url content parts
There was a problem hiding this comment.
Pull request overview
This PR adds end-to-end vision/image support to the picoclaw agent pipeline, enabling vision-capable models like Gemini 2.0 Flash to process image attachments from Discord messages. The implementation follows the OpenAI-compatible vision API format with image_url content parts.
Changes:
- Added
Media []stringfield to theMessagestruct to store image/media URLs - Implemented
serializeMessages()function to format messages with both text and image content for OpenAI-compatible APIs - Wired the Media field through the agent pipeline from
InboundMessagetoBuildMessages
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pkg/providers/protocoltypes/types.go | Added Media field to Message struct with appropriate JSON tag and documentation |
| pkg/providers/openai_compat/provider.go | Implemented serializeMessages to format messages with image_url content parts for vision APIs |
| pkg/agent/loop.go | Added Media field to processOptions and wired it through the message processing pipeline |
| pkg/agent/context.go | Updated BuildMessages to accept and attach media to user messages, with proper condition handling |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| result = append(result, map[string]interface{}{ | ||
| "role": m.Role, | ||
| "content": parts, | ||
| }) |
There was a problem hiding this comment.
When a message contains Media, the tool_calls and tool_call_id fields are not serialized. If a message has both media attachments and tool calls (or is a tool response with media), this information would be lost. Consider handling these fields even when Media is present, or document if this combination is intentionally not supported.
| result = append(result, map[string]interface{}{ | |
| "role": m.Role, | |
| "content": parts, | |
| }) | |
| msg := map[string]interface{}{ | |
| "role": m.Role, | |
| "content": parts, | |
| } | |
| if m.ToolCallID != "" { | |
| msg["tool_call_id"] = m.ToolCallID | |
| } | |
| if len(m.ToolCalls) > 0 { | |
| msg["tool_calls"] = m.ToolCalls | |
| } | |
| result = append(result, msg) |
| func serializeMessages(messages []Message) []map[string]interface{} { | ||
| result := make([]map[string]interface{}, 0, len(messages)) | ||
| for _, m := range messages { | ||
| if len(m.Media) == 0 { | ||
| msg := map[string]interface{}{ | ||
| "role": m.Role, | ||
| "content": m.Content, | ||
| } | ||
| if m.ToolCallID != "" { | ||
| msg["tool_call_id"] = m.ToolCallID | ||
| } | ||
| if len(m.ToolCalls) > 0 { | ||
| msg["tool_calls"] = m.ToolCalls | ||
| } | ||
| result = append(result, msg) | ||
| continue | ||
| } | ||
|
|
||
| parts := make([]map[string]interface{}, 0, 1+len(m.Media)) | ||
| if m.Content != "" { | ||
| parts = append(parts, map[string]interface{}{ | ||
| "type": "text", | ||
| "text": m.Content, | ||
| }) | ||
| } | ||
| for _, mediaURL := range m.Media { | ||
| parts = append(parts, map[string]interface{}{ | ||
| "type": "image_url", | ||
| "image_url": map[string]interface{}{ | ||
| "url": mediaURL, | ||
| }, | ||
| }) | ||
| } | ||
| result = append(result, map[string]interface{}{ | ||
| "role": m.Role, | ||
| "content": parts, | ||
| }) | ||
| } | ||
| return result | ||
| } |
There was a problem hiding this comment.
The serializeMessages function lacks test coverage. Given the comprehensive test suite for other provider functionality (e.g., TestProviderChat_ParsesToolCalls, TestProviderChat_StripsMoonshotPrefixAndNormalizesKimiTemperature), consider adding tests for message serialization with media attachments to ensure correct OpenAI API format compliance and prevent regressions. Test cases should cover: messages with media only, messages with both text and media, and messages without media (backward compatibility).
| summary, | ||
| opts.UserMessage, | ||
| nil, | ||
| opts.Media, |
There was a problem hiding this comment.
The Media field contains URLs that are passed directly to external LLM APIs without validation. Consider adding validation to ensure URLs are from trusted sources (e.g., Discord CDN for Discord messages) and don't expose internal network resources. While the URLs originate from InboundMessage.Media which likely comes from Discord, explicit validation would prevent potential SSRF vulnerabilities if the source changes or if there are bugs in upstream URL extraction.
Summary
This PR adds end-to-end vision/image support to the picoclaw agent pipeline, enabling vision-capable models like Gemini 2.0 Flash to receive and process image attachments.
Problem
Images extracted from Discord messages into
InboundMessage.Mediawere being dropped (nil passed toBuildMessages), preventing vision-capable models from accessing image content.Solution
Add Media field to Message struct (
pkg/providers/protocoltypes/types.go)Media []stringfield to store image/media URLsImplement serializeMessages() (
pkg/providers/openai_compat/provider.go)image_urlcontent parts for OpenAI-compatible vision APIsWire Media through agent pipeline (
pkg/agent/context.goandpkg/agent/loop.go)Mediafield toprocessOptionsstructmsg.MediafromInboundMessagethrough the pipelineBuildMessagesto attach media to user messageopts.MediatoBuildMessagesinstead of nilTesting
The changes enable:
image_urlcontent parts in the correct OpenAI formatCommits
42725e0- feat: add Media field to Message struct and implement serializeMessages for vision API support4b58912- feat: wire Media through agent pipeline to enable vision model support