feat: simplify call-actor tool and add long rung running task support, #387

MQ37 · 2026-01-08T15:40:09Z

closes #365

This is quite a large PR but core changes are not that huge and it was quite painful—a lot of circular dependencies, and I had to summon Opus to solve that 😆

🔧 Changes 🔧

Simplified call-actor tool:
- Removed the step param
- Added long-running task support
- Moved the step=info logic to fetch-actor-details tool
Updated fetch-actor-details tool:
- Moved the step=info logic here via the output input param (supports granular output)
- Added mcp-tools, which uses the previous logic from call-actor (connects to MCP server Actors and lists the tools)
- It is backwards compatible
Improved workflow evals and added related test cases
Moved a few functions and consts to solve circular dependencies—it was really painful

✅ Testing ✅

Also tested fetch-actor-details and call-actor manually with OpenCode, and it works well. 🚀

commit d1f7dc7 Author: Jakub Kopecký <[email protected]> Date: Wed Jan 7 14:03:21 2026 +0100 fix: update @modelcontextprotocol/sdk to version 1.25.1 in package.json and package-lock.json (#384) * fix: update @modelcontextprotocol/sdk to version 1.25.1 in package.json and package-lock.json * fix: remove pollInterval from task creation in tool call request commit 4270b02 Author: Jakub Kopecký <[email protected]> Date: Wed Jan 7 12:10:14 2026 +0100 feat(evals): add llm driven workflow evals with llm as a judge (#383) * feat(evals): add llm driven workflow evals with llm as a judge Add workflow evaluation system for testing AI agents in multi-turn conversations using Apify MCP tools, with LLM-based evaluation. Core Components: - Multi-turn conversation executor with dynamic tool discovery - LLM judge for evaluating agent performance against requirements - Isolated MCP server per test (prevents state contamination) - OpenRouter integration (agent + judge models) - Configurable tool timeout (default: 60s, MCP SDK integration) Architecture: • MCP server spawned fresh per test → test isolation • Tools refreshed after each turn → supports dynamic registration (add-actor) • Strict pass/fail → all tests must pass for CI success • Raw error propagation → LLM receives MCP SDK errors unchanged CLI Usage: npm run evals:workflow npm run evals:workflow -- --tool-timeout 300 --category search CLI Options: --tool-timeout <seconds> Tool call timeout (default: 60) --agent-model <model> Agent model (default: claude-haiku-4.5) --judge-model <model> Judge model (default: grok-4.1-fast) --category <name> Filter by category --id <id> Run specific test --verbose Show full conversations Environment: APIFY_TOKEN - Required for MCP server OPENROUTER_API_KEY - Required for LLM calls This enables systematic testing of MCP tools, agent tool-calling behavior, and automated quality evaluation without manual verification. * refactor(evals): extract shared utilities and unify test case format This commit refactors the evaluation system to eliminate code duplication and standardize test case formats across both tool selection and workflow evaluation systems. - types.ts: Unified type definitions for test cases and tools - config.ts: Shared OpenRouter configuration and environment validation - openai-tools.ts: Consolidated tool transformation utilities - test-case-loader.ts: Unified test case loading and filtering functions - Standardized on 'query' (previously 'prompt' in workflows) - Standardized on 'reference' (previously 'requirements' in workflows) - Added version tracking to workflows/test-cases.json - Maintains backwards compatibility through type exports Removed 7 duplicate functions across the codebase: - Test case loading (evaluation-utils.ts vs test-cases-loader.ts) - Test case filtering (filterById, filterByCategory, filterTestCases) - OpenAI tool transformation (transformToolsToOpenAIFormat vs mcpToolsToOpenAiTools) - OpenRouter configuration (OPENROUTER_CONFIG duplicated) - Environment validation (validateEnvVars duplicated) - OPENROUTER_BASE_URL is now optional (defaults to https://openrouter.ai/api/v1) - Created Phoenix-specific validation (validatePhoenixEnvVars) - Separated concerns between shared and system-specific config - Updated 11 existing files to use shared utilities - Deleted evals/workflows/convert-mcp-tools.ts (replaced by shared) - All imports updated to reference shared modules - Reduced config code by ~37% - Eliminated 100% of duplicate functions - Improved maintainability and consistency - No breaking changes to external APIs - TypeScript compilation: ✓ - Project build: ✓ - All imports verified: ✓ * feat(evals): add parallel execution and fix linting for workflows - Add --concurrency/-c flag to run workflow evals in parallel (default: 4) - Add p-limit dependency for concurrency control - Enable ESLint for evals/workflows/ and evals/shared/ directories - Fix all linting issues (117 errors): - Convert interfaces to types per project convention - Fix import ordering with simple-import-sort - Remove trailing spaces - Fix comma-dangle, arrow-parens, operator-linebreak - Prefer node: protocol for built-in imports - Fix nested ternary in output-formatter.ts - Add logWithPrefix() helper for prefixed live output - Extract runSingleTest() function from main evaluation loop - Remove empty line after test completion in output Breaking changes: None (all changes backward compatible) Usage: npm run evals:workflow -- -c 10 # Run 10 tests in parallel npm run evals:workflow -- -c 1 # Sequential mode * feat(evals): use structured output for judge LLM and fix test filtering - Refactor judge to use OpenAI's structured output (JSON schema) for robust evaluation - Replace fragile text parsing with guaranteed JSON validation - Fix test case filtering to support wildcard patterns (--category) and regex (--id) - Add responseFormat parameter to LLM client for structured outputs - Update judge prompt to remove manual format instructions - Add test case for weather MCP Actor * feat(evals): MCP instructions, test tracking, and expanded test coverage commit 6dd3b10 Author: Apify Release Bot <[email protected]> Date: Tue Jan 6 14:28:55 2026 +0000 chore(release): Update changelog, package.json, manifest.json and server.json versions [skip ci] commit eaeb57b Author: Jiří Spilka <[email protected]> Date: Tue Jan 6 15:27:51 2026 +0100 fix: Improve README for clarity and MCP clients info at the top (#382)

…e fetch-actor-details ## Summary This commit simplifies the Actor calling workflow from a mandatory two-step process to a more intuitive single-step approach, while enhancing tool capabilities and improving evaluation infrastructure. ## Core Changes ### call-actor Tool Simplification (src/tools/actor.ts) - **Removed**: Mandatory two-step workflow (step='info' then step='call') - **Changed**: Input is now required (was optional in step='call') - **Updated**: Tool description to guide users to fetch-actor-details first - **Simplified**: Removed conditional logic for step parameter - **Added**: Support for MCP server Actors using 'actorName:toolName' format - **Added**: taskSupport: 'optional' execution annotation for long-running tasks ### fetch-actor-details Enhancement (src/tools/fetch-actor-details.ts) - **Added**: 'output' parameter to control response content (description, stats, pricing, input-schema, readme, mcp-tools) - **Added**: Support for listing available MCP tools directly (output=['mcp-tools']) - **Enhanced**: Better token efficiency with output=['input-schema'] for minimal responses - **Updated**: Documentation with usage examples and parameter descriptions - **Added**: MCP client connection and tool listing logic ### Tool Architecture Refactoring - **Created**: src/tools/categories.ts - Separate module for tool categories to avoid circular dependencies - **Created**: src/utils/tool-categories-helpers.ts - Helper functions for category operations - **Refactored**: src/tools/index.ts - Now uses string constants and imports from categories.ts - **Fixed**: Circular dependency: tools/index.ts → utils/tools.ts → tools/categories.ts → tools/index.ts ### Evaluation System Improvements - **Updated**: evals/config.ts - Tool selection guidelines reflect new single-step workflow - **Removed**: evals/run-evaluation.ts - Normalization of call-actor step='info' to fetch-actor-details (tools now independent) - **Enhanced**: evals/workflows/mcp-client.ts: - Better error handling with error field population - Timeout-based cleanup (2s default) to prevent indefinite waiting - Force kill of transport process if graceful shutdown fails - More robust state cleanup ### Test Infrastructure Enhancements - **Created**: evals/shared/line-range-parser.ts - Parse line ranges from strings - **Created**: evals/shared/line-range-filter.ts - Filter test cases by line numbers - **Enhanced**: evals/workflows/test-cases-loader.ts - New loadTestCasesWithLineNumbers() function - **Added**: evals/workflows/run-workflow-evals.ts - Line range filtering support (--lines flag) - **Added**: evals/workflows/output-formatter.ts - Tool result display in verbose mode - **Updated**: evals/workflows/README.md - Documentation for line range filtering ### Documentation Updates - **Updated**: README.md - call-actor description to emphasize fetch-actor-details requirement - **Updated**: AGENTS.md - Added quick validation workflow section ## Breaking Changes - **call-actor**: No longer supports step='info' parameter. Use fetch-actor-details instead. - **call-actor**: Input parameter is now required (was optional before). ## Benefits 1. **Simpler workflow**: Users call the Actor directly without intermediate schema fetch 2. **Clearer tool division**: fetch-actor-details handles all documentation/schema needs 3. **Better UX**: fetch-actor-details with output=['input-schema'] provides token-efficient schema retrieval 4. **MCP tool discovery**: fetch-actor-details with output=['mcp-tools'] lists available tools 5. **Cleaner code**: Removed two-step orchestration logic from call-actor 6. **Better testing**: Line range filtering for test case evaluation 7. **Robust cleanup**: Timeout-based cleanup prevents hanging processes

commit c1c415f Author: Apify Release Bot <[email protected]> Date: Thu Jan 8 09:53:59 2026 +0000 chore(release): Update changelog, package.json, manifest.json and server.json versions [skip ci] commit 31c3bdd Author: Jakub Kopecký <[email protected]> Date: Thu Jan 8 10:53:06 2026 +0100 fix: update @modelcontextprotocol/sdk to version 1.25.2 in package.json and package-lock.json (#385) commit d1f7dc7 Author: Jakub Kopecký <[email protected]> Date: Wed Jan 7 14:03:21 2026 +0100 fix: update @modelcontextprotocol/sdk to version 1.25.1 in package.json and package-lock.json (#384) * fix: update @modelcontextprotocol/sdk to version 1.25.1 in package.json and package-lock.json * fix: remove pollInterval from task creation in tool call request commit 4270b02 Author: Jakub Kopecký <[email protected]> Date: Wed Jan 7 12:10:14 2026 +0100 feat(evals): add llm driven workflow evals with llm as a judge (#383) * feat(evals): add llm driven workflow evals with llm as a judge Add workflow evaluation system for testing AI agents in multi-turn conversations using Apify MCP tools, with LLM-based evaluation. Core Components: - Multi-turn conversation executor with dynamic tool discovery - LLM judge for evaluating agent performance against requirements - Isolated MCP server per test (prevents state contamination) - OpenRouter integration (agent + judge models) - Configurable tool timeout (default: 60s, MCP SDK integration) Architecture: • MCP server spawned fresh per test → test isolation • Tools refreshed after each turn → supports dynamic registration (add-actor) • Strict pass/fail → all tests must pass for CI success • Raw error propagation → LLM receives MCP SDK errors unchanged CLI Usage: npm run evals:workflow npm run evals:workflow -- --tool-timeout 300 --category search CLI Options: --tool-timeout <seconds> Tool call timeout (default: 60) --agent-model <model> Agent model (default: claude-haiku-4.5) --judge-model <model> Judge model (default: grok-4.1-fast) --category <name> Filter by category --id <id> Run specific test --verbose Show full conversations Environment: APIFY_TOKEN - Required for MCP server OPENROUTER_API_KEY - Required for LLM calls This enables systematic testing of MCP tools, agent tool-calling behavior, and automated quality evaluation without manual verification. * refactor(evals): extract shared utilities and unify test case format This commit refactors the evaluation system to eliminate code duplication and standardize test case formats across both tool selection and workflow evaluation systems. - types.ts: Unified type definitions for test cases and tools - config.ts: Shared OpenRouter configuration and environment validation - openai-tools.ts: Consolidated tool transformation utilities - test-case-loader.ts: Unified test case loading and filtering functions - Standardized on 'query' (previously 'prompt' in workflows) - Standardized on 'reference' (previously 'requirements' in workflows) - Added version tracking to workflows/test-cases.json - Maintains backwards compatibility through type exports Removed 7 duplicate functions across the codebase: - Test case loading (evaluation-utils.ts vs test-cases-loader.ts) - Test case filtering (filterById, filterByCategory, filterTestCases) - OpenAI tool transformation (transformToolsToOpenAIFormat vs mcpToolsToOpenAiTools) - OpenRouter configuration (OPENROUTER_CONFIG duplicated) - Environment validation (validateEnvVars duplicated) - OPENROUTER_BASE_URL is now optional (defaults to https://openrouter.ai/api/v1) - Created Phoenix-specific validation (validatePhoenixEnvVars) - Separated concerns between shared and system-specific config - Updated 11 existing files to use shared utilities - Deleted evals/workflows/convert-mcp-tools.ts (replaced by shared) - All imports updated to reference shared modules - Reduced config code by ~37% - Eliminated 100% of duplicate functions - Improved maintainability and consistency - No breaking changes to external APIs - TypeScript compilation: ✓ - Project build: ✓ - All imports verified: ✓ * feat(evals): add parallel execution and fix linting for workflows - Add --concurrency/-c flag to run workflow evals in parallel (default: 4) - Add p-limit dependency for concurrency control - Enable ESLint for evals/workflows/ and evals/shared/ directories - Fix all linting issues (117 errors): - Convert interfaces to types per project convention - Fix import ordering with simple-import-sort - Remove trailing spaces - Fix comma-dangle, arrow-parens, operator-linebreak - Prefer node: protocol for built-in imports - Fix nested ternary in output-formatter.ts - Add logWithPrefix() helper for prefixed live output - Extract runSingleTest() function from main evaluation loop - Remove empty line after test completion in output Breaking changes: None (all changes backward compatible) Usage: npm run evals:workflow -- -c 10 # Run 10 tests in parallel npm run evals:workflow -- -c 1 # Sequential mode * feat(evals): use structured output for judge LLM and fix test filtering - Refactor judge to use OpenAI's structured output (JSON schema) for robust evaluation - Replace fragile text parsing with guaranteed JSON validation - Fix test case filtering to support wildcard patterns (--category) and regex (--id) - Add responseFormat parameter to LLM client for structured outputs - Update judge prompt to remove manual format instructions - Add test case for weather MCP Actor * feat(evals): MCP instructions, test tracking, and expanded test coverage commit 6dd3b10 Author: Apify Release Bot <[email protected]> Date: Tue Jan 6 14:28:55 2026 +0000 chore(release): Update changelog, package.json, manifest.json and server.json versions [skip ci] commit eaeb57b Author: Jiří Spilka <[email protected]> Date: Tue Jan 6 15:27:51 2026 +0100 fix: Improve README for clarity and MCP clients info at the top (#382)

Implement granular control over Actor card output to enable token-efficient information retrieval. Users can now request specific sections (description, stats, pricing, rating, metadata) independently instead of receiving all information bundled together. Changes: - Add ActorCardOptions type with 5 boolean flags for granular control - Update formatActorToActorCard() and formatActorToStructuredCard() with conditional rendering based on options - Fix rating and bookmarkCount to check both ActorStoreList and Actor.stats locations for better compatibility - Add comprehensive unit tests (32 tests) using real apify/rag-web-browser data - Update integration tests to validate granular output functionality - Update fetch-actor-details tool schema to include 'rating' and 'metadata' as separate output options Benefits: - Reduces token usage by allowing users to request only needed information - Maintains backwards compatibility (all options default to true) - Improves flexibility for different use cases (e.g., pricing-only queries)

jirispilka

I tested it in claude-desktop. It works great!

With the actor-mcp I had more problems but I don't think it was introduced here.
I've tried playwright and tavily, the both failed.

I made two suggestions.

jirispilka · 2026-01-09T11:13:27Z

src/tools/fetch-actor-details.ts

+    output: z.array(z.enum(['description', 'stats', 'pricing', 'rating', 'metadata', 'input-schema', 'readme', 'mcp-tools']))
+        .min(1)
+        .optional()
+        .default(['description', 'stats', 'pricing', 'rating', 'metadata', 'readme', 'input-schema'])


Instead of using an array of enums for selecting output fields, a more idiomatic and user-friendly approach in Zod is to use an object with boolean flags. This change simplifies the internal tool logic because you no longer need to check for inclusion in an array using .includes().

const fetchActorDetailsToolArgsSchema = z.object({ actor: z.string() .min(1) .describe(`Actor ID or full name in the format "username/name", e.g., "apify/rag-web-browser".`), output: z.object({ description: z.boolean().default(true).describe("Include Actor description text only."), stats: z.boolean().default(true).describe("Include usage statistics (users, runs, success rate)."), pricing: z.boolean().default(true).describe("Include pricing model and costs."), .... }) .optional() .default({ description: true, stats: true, pricing: true, ..... }) .describe("Specify which information to include in the response to save tokens."), });

Another option is to use the fields as it is standard in rest API. We are using it when fetching dataset and it worked fine.

const fetchActorDetailsToolArgsSchema = z.object({ actor: z.string() .min(1) .describe(`Actor ID or full name in the format "username/name", e.g., "apify/rag-web-browser".`), fields: z.string() .optional() .describe(`Comma-separated list of fields to include in the response. Available fields: 'description', 'stats', 'pricing', 'rating', 'metadata', 'input-schema', 'readme', 'mcp-tools'. Default: all fields except 'mcp-tools'.`), });

AI conclusion :) """ Use fields if you want to match the API convention. Use the Object/Boolean flags if you want the best performance and reliability from the LLM."""

jirispilka · 2026-01-09T11:14:24Z

src/tools/fetch-actor-details.ts

    description: `Get detailed information about an Actor by its ID or full name (format: "username/name", e.g., "apify/rag-web-browser").
-This returns the Actor's title, description, URL, README (documentation), input schema, pricing/usage information, and basic stats.
-Present the information in a user-friendly Actor card.
+


This description is super long now :(

jirispilka · 2026-01-09T11:28:50Z

src/tools/actor.ts

                // Missing input is most likely an LLM error, so NOT marking as a soft-fail to track potential issues with tool description
                return buildMCPResponse({
-                    texts: [`Input is required when step="call". Please provide the input parameter based on the Actor's input schema.`],
+                    texts: [`Input is required. Please provide the input parameter based on the Actor's input schema. Use fetch-actor-details tool with output=['input-schema'] to get the Actor's input schema first.`],


Here I would do the same logic as below, we return input-schema if the input is incorrect. Let us return input-schema if the input is empty.

I would write it something like ... [fetch-actor-details was called and schema retrieved, "input-schema"]. Just to tell LLM how the schema was retried (but that's not that important)

MQ37 added 7 commits January 6, 2026 14:12

plan.md

fa518e2

remove plan.md

9a76311

Merge branch 'master' into simplify-call-actor

48f0bea

github-actions bot assigned MQ37 Jan 8, 2026

MQ37 requested a review from jirispilka January 8, 2026 15:45

github-actions bot added t-ai Issues owned by the AI team. tested Temporary label used only programatically for some analytics. labels Jan 8, 2026

MQ37 mentioned this pull request Jan 8, 2026

feat(gpt-apps): add tools and widget descriptors #375

Open

jirispilka approved these changes Jan 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: simplify call-actor tool and add long rung running task support, #387

feat: simplify call-actor tool and add long rung running task support, #387

Uh oh!

MQ37 commented Jan 8, 2026 •

edited

Loading

Uh oh!

jirispilka left a comment

Uh oh!

jirispilka Jan 9, 2026

Uh oh!

jirispilka Jan 9, 2026

Uh oh!

jirispilka Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: simplify call-actor tool and add long rung running task support, #387

Are you sure you want to change the base?

feat: simplify call-actor tool and add long rung running task support, #387

Uh oh!

Conversation

MQ37 commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jirispilka left a comment

Choose a reason for hiding this comment

Uh oh!

jirispilka Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

jirispilka Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

jirispilka Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MQ37 commented Jan 8, 2026 •

edited

Loading