Skip to content

Comments

feat: add vision/image support to agent pipeline#555

Open
as3k wants to merge 2 commits intosipeed:mainfrom
as3k:fix/vision-image-support
Open

feat: add vision/image support to agent pipeline#555
as3k wants to merge 2 commits intosipeed:mainfrom
as3k:fix/vision-image-support

Conversation

@as3k
Copy link

@as3k as3k commented Feb 20, 2026

Summary

This PR adds end-to-end vision/image support to the picoclaw agent pipeline, enabling vision-capable models like Gemini 2.0 Flash to receive and process image attachments.

Problem

Images extracted from Discord messages into InboundMessage.Media were being dropped (nil passed to BuildMessages), preventing vision-capable models from accessing image content.

Solution

  1. Add Media field to Message struct (pkg/providers/protocoltypes/types.go)

    • Added Media []string field to store image/media URLs
  2. Implement serializeMessages() (pkg/providers/openai_compat/provider.go)

    • Formats messages with image_url content parts for OpenAI-compatible vision APIs
    • Handles both text and image content in a single message
  3. Wire Media through agent pipeline (pkg/agent/context.go and pkg/agent/loop.go)

    • Add Media field to processOptions struct
    • Pass msg.Media from InboundMessage through the pipeline
    • Update BuildMessages to attach media to user message
    • Pass opts.Media to BuildMessages instead of nil

Testing

The changes enable:

  • Discord image attachments to flow through the entire agent pipeline
  • Vision models to receive image_url content parts in the correct OpenAI format
  • Backward compatibility (messages without media work as before)

Commits

  • 42725e0 - feat: add Media field to Message struct and implement serializeMessages for vision API support
  • 4b58912 - feat: wire Media through agent pipeline to enable vision model support

as3k added 2 commits February 20, 2026 14:32
…es for vision API support

- Add Media []string field to Message struct for image/media URLs
- Implement serializeMessages() to format messages with image_url content parts
- Enables OpenAI-compatible vision APIs to receive image attachments
- Add Media field to processOptions struct
- Pass msg.Media from InboundMessage to processOptions in processMessage
- Update BuildMessages to attach media to user message
- Pass opts.Media to BuildMessages instead of nil in runAgentLoop
- Enables vision-capable models (Gemini 2.0 Flash, etc.) to receive image_url content parts
Copilot AI review requested due to automatic review settings February 20, 2026 22:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-to-end vision/image support to the picoclaw agent pipeline, enabling vision-capable models like Gemini 2.0 Flash to process image attachments from Discord messages. The implementation follows the OpenAI-compatible vision API format with image_url content parts.

Changes:

  • Added Media []string field to the Message struct to store image/media URLs
  • Implemented serializeMessages() function to format messages with both text and image content for OpenAI-compatible APIs
  • Wired the Media field through the agent pipeline from InboundMessage to BuildMessages

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
pkg/providers/protocoltypes/types.go Added Media field to Message struct with appropriate JSON tag and documentation
pkg/providers/openai_compat/provider.go Implemented serializeMessages to format messages with image_url content parts for vision APIs
pkg/agent/loop.go Added Media field to processOptions and wired it through the message processing pipeline
pkg/agent/context.go Updated BuildMessages to accept and attach media to user messages, with proper condition handling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +171 to +174
result = append(result, map[string]interface{}{
"role": m.Role,
"content": parts,
})
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a message contains Media, the tool_calls and tool_call_id fields are not serialized. If a message has both media attachments and tool calls (or is a tool response with media), this information would be lost. Consider handling these fields even when Media is present, or document if this combination is intentionally not supported.

Suggested change
result = append(result, map[string]interface{}{
"role": m.Role,
"content": parts,
})
msg := map[string]interface{}{
"role": m.Role,
"content": parts,
}
if m.ToolCallID != "" {
msg["tool_call_id"] = m.ToolCallID
}
if len(m.ToolCalls) > 0 {
msg["tool_calls"] = m.ToolCalls
}
result = append(result, msg)

Copilot uses AI. Check for mistakes.
Comment on lines +138 to +177
func serializeMessages(messages []Message) []map[string]interface{} {
result := make([]map[string]interface{}, 0, len(messages))
for _, m := range messages {
if len(m.Media) == 0 {
msg := map[string]interface{}{
"role": m.Role,
"content": m.Content,
}
if m.ToolCallID != "" {
msg["tool_call_id"] = m.ToolCallID
}
if len(m.ToolCalls) > 0 {
msg["tool_calls"] = m.ToolCalls
}
result = append(result, msg)
continue
}

parts := make([]map[string]interface{}, 0, 1+len(m.Media))
if m.Content != "" {
parts = append(parts, map[string]interface{}{
"type": "text",
"text": m.Content,
})
}
for _, mediaURL := range m.Media {
parts = append(parts, map[string]interface{}{
"type": "image_url",
"image_url": map[string]interface{}{
"url": mediaURL,
},
})
}
result = append(result, map[string]interface{}{
"role": m.Role,
"content": parts,
})
}
return result
}
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serializeMessages function lacks test coverage. Given the comprehensive test suite for other provider functionality (e.g., TestProviderChat_ParsesToolCalls, TestProviderChat_StripsMoonshotPrefixAndNormalizesKimiTemperature), consider adding tests for message serialization with media attachments to ensure correct OpenAI API format compliance and prevent regressions. Test cases should cover: messages with media only, messages with both text and media, and messages without media (backward compatibility).

Copilot uses AI. Check for mistakes.
summary,
opts.UserMessage,
nil,
opts.Media,
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Media field contains URLs that are passed directly to external LLM APIs without validation. Consider adding validation to ensure URLs are from trusted sources (e.g., Discord CDN for Discord messages) and don't expose internal network resources. While the URLs originate from InboundMessage.Media which likely comes from Discord, explicit validation would prevent potential SSRF vulnerabilities if the source changes or if there are bugs in upstream URL extraction.

Copilot uses AI. Check for mistakes.
@xiaket xiaket requested a review from yinwm February 22, 2026 05:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant