feat: add vision/image support to agent pipeline by as3k · Pull Request #555 · sipeed/picoclaw

as3k · 2026-02-20T22:33:38Z

Summary

This PR adds end-to-end vision/image support to the picoclaw agent pipeline, enabling vision-capable models like Gemini 2.0 Flash to receive and process image attachments.

Problem

Images extracted from Discord messages into InboundMessage.Media were being dropped (nil passed to BuildMessages), preventing vision-capable models from accessing image content.

Solution

Add Media field to Message struct (pkg/providers/protocoltypes/types.go)
- Added Media []string field to store image/media URLs
Implement serializeMessages() (pkg/providers/openai_compat/provider.go)
- Formats messages with image_url content parts for OpenAI-compatible vision APIs
- Handles both text and image content in a single message
Wire Media through agent pipeline (pkg/agent/context.go and pkg/agent/loop.go)
- Add Media field to processOptions struct
- Pass msg.Media from InboundMessage through the pipeline
- Update BuildMessages to attach media to user message
- Pass opts.Media to BuildMessages instead of nil

Testing

The changes enable:

Discord image attachments to flow through the entire agent pipeline
Vision models to receive image_url content parts in the correct OpenAI format
Backward compatibility (messages without media work as before)

Commits

42725e0 - feat: add Media field to Message struct and implement serializeMessages for vision API support
4b58912 - feat: wire Media through agent pipeline to enable vision model support

…es for vision API support - Add Media []string field to Message struct for image/media URLs - Implement serializeMessages() to format messages with image_url content parts - Enables OpenAI-compatible vision APIs to receive image attachments

- Add Media field to processOptions struct - Pass msg.Media from InboundMessage to processOptions in processMessage - Update BuildMessages to attach media to user message - Pass opts.Media to BuildMessages instead of nil in runAgentLoop - Enables vision-capable models (Gemini 2.0 Flash, etc.) to receive image_url content parts

Copilot

Pull request overview

This PR adds end-to-end vision/image support to the picoclaw agent pipeline, enabling vision-capable models like Gemini 2.0 Flash to process image attachments from Discord messages. The implementation follows the OpenAI-compatible vision API format with image_url content parts.

Changes:

Added Media []string field to the Message struct to store image/media URLs
Implemented serializeMessages() function to format messages with both text and image content for OpenAI-compatible APIs
Wired the Media field through the agent pipeline from InboundMessage to BuildMessages

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
pkg/providers/protocoltypes/types.go	Added Media field to Message struct with appropriate JSON tag and documentation
pkg/providers/openai_compat/provider.go	Implemented serializeMessages to format messages with image_url content parts for vision APIs
pkg/agent/loop.go	Added Media field to processOptions and wired it through the message processing pipeline
pkg/agent/context.go	Updated BuildMessages to accept and attach media to user messages, with proper condition handling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-20T22:37:27Z

pkg/providers/openai_compat/provider.go

+		result = append(result, map[string]interface{}{
+			"role":    m.Role,
+			"content": parts,
+		})


When a message contains Media, the tool_calls and tool_call_id fields are not serialized. If a message has both media attachments and tool calls (or is a tool response with media), this information would be lost. Consider handling these fields even when Media is present, or document if this combination is intentionally not supported.

Suggested change

result = append(result, map[string]interface{}{

"role": m.Role,

"content": parts,

})

msg := map[string]interface{}{

"role": m.Role,

"content": parts,

}

if m.ToolCallID != "" {

msg["tool_call_id"] = m.ToolCallID

}

if len(m.ToolCalls) > 0 {

msg["tool_calls"] = m.ToolCalls

}

result = append(result, msg)

Copilot · 2026-02-20T22:37:28Z

pkg/providers/openai_compat/provider.go

+func serializeMessages(messages []Message) []map[string]interface{} {
+	result := make([]map[string]interface{}, 0, len(messages))
+	for _, m := range messages {
+		if len(m.Media) == 0 {
+			msg := map[string]interface{}{
+				"role":    m.Role,
+				"content": m.Content,
+			}
+			if m.ToolCallID != "" {
+				msg["tool_call_id"] = m.ToolCallID
+			}
+			if len(m.ToolCalls) > 0 {
+				msg["tool_calls"] = m.ToolCalls
+			}
+			result = append(result, msg)
+			continue
+		}
+
+		parts := make([]map[string]interface{}, 0, 1+len(m.Media))
+		if m.Content != "" {
+			parts = append(parts, map[string]interface{}{
+				"type": "text",
+				"text": m.Content,
+			})
+		}
+		for _, mediaURL := range m.Media {
+			parts = append(parts, map[string]interface{}{
+				"type": "image_url",
+				"image_url": map[string]interface{}{
+					"url": mediaURL,
+				},
+			})
+		}
+		result = append(result, map[string]interface{}{
+			"role":    m.Role,
+			"content": parts,
+		})
+	}
+	return result
+}


The serializeMessages function lacks test coverage. Given the comprehensive test suite for other provider functionality (e.g., TestProviderChat_ParsesToolCalls, TestProviderChat_StripsMoonshotPrefixAndNormalizesKimiTemperature), consider adding tests for message serialization with media attachments to ensure correct OpenAI API format compliance and prevent regressions. Test cases should cover: messages with media only, messages with both text and media, and messages without media (backward compatibility).

Copilot · 2026-02-20T22:37:28Z

pkg/agent/loop.go

 		summary,
 		opts.UserMessage,
-		nil,
+		opts.Media,


The Media field contains URLs that are passed directly to external LLM APIs without validation. Consider adding validation to ensure URLs are from trusted sources (e.g., Discord CDN for Discord messages) and don't expose internal network resources. While the URLs originate from InboundMessage.Media which likely comes from Discord, explicit validation would prevent potential SSRF vulnerabilities if the source changes or if there are bugs in upstream URL extraction.

as3k added 2 commits February 20, 2026 14:32

Copilot AI review requested due to automatic review settings February 20, 2026 22:33

Copilot started reviewing on behalf of as3k February 20, 2026 22:34 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

xiaket requested a review from yinwm February 22, 2026 05:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: add vision/image support to agent pipeline#555

feat: add vision/image support to agent pipeline#555
as3k wants to merge 2 commits intosipeed:mainfrom
as3k:fix/vision-image-support

as3k commented Feb 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

as3k commented Feb 20, 2026

Summary

Problem

Solution

Testing

Commits

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant