What optimizes performance? [Working with PrefectHQ] #3246

davidbernat · 2026-02-20T08:30:30Z

davidbernat
Feb 20, 2026

Enhancement

Hey folks. The MCP Protocol is not yet an agreed upon standard in the industry, despite early adoption by Linux Foundation (along with AGENTS.md, etc.) and I consistently apply a critical eye to certain design choices and its limitation (specifically for adding file transfer underneath the LLM within the client to better accept REST APIs as native MCP Tools in the explicit cases of simplified text-restricted LLMs, which are the industry expectation).

With those caveats in mind, and knowing PrefectHQ from its Prefect work (re: toothpick in the sandwich), there is remaining one more big question mark lording over the implementation goals of a standard MCP enterprise team: what actually works?

I had quite a row with the PydanticAI team about even basic questions of what their "AI" did under the hood, between Prompt and its Request to LLM, to coerce the prompt before transmitting to the LLM. Their loggers, for even the day one intern-level questions were purposefully non-existent, and their higher-level loggers were proprietary. It took several days to build loggers and hooks into the httpx system underneath to get basic transmission statistics to the LLM, and in doing so I discovered that their PydanticAI system does not actually do much in the way of proprietary magic, at all. It configured one field in the LLM request to make a mandatory tool call to a constructed final_result tool which accepts its data model as arguments to that function; that works, and we should all be thrilled that a simple (clever) mechanism works, but the experience was quite unsatisfactory.

Caveats in mind, I feel the same way about MCPs, and there are not going to be any good solutions to the next question until an empirical framework of benchmarks are created and run for the purpose of a white paper. So be it; that is why I am here. For instance, and all these statement questions are top-of-mind obvious to everyone:

What makes for an optimal docstring?
Are Field descriptions helpful if the variable names are descriptive?
If the variables are type hinted to dataclasses with schema that can be generated?
...and so on.

Similarly, one of the first experiments I needed to run was a simple monolith v. servlet:

Does an LLM crunch when asking for e.g. summary and search tags, and extraction of entities?
Should these lightweight agents be run each individually with smaller GPUs in parallel requests?
...and so on.

Luckily, these experiments all have robust, common, common sense, simple empirical answers we can set up.
And automate. And I would like volunteers to build that with me.
We would use the PydanticAI system and small GPUs to run common sense and small benchmark queries. To interested interns and engineers, the auto-construction of these experiments for various configurations of MCPs is the valuable asset. It seems common sense that any enterprise company will have their own version to create.

To start with, in the immediate get-things-done, I would love to discuss the "cookbooks" others are using, and their verification methods. There are certain to be many debates because the difference in performances in many cases are going to be small, and insignificant. But, c'mon, the idea that our AI agent systems are not "robotically exploring this space in small batch A/B tests" in ten years (ten weeks) and finding long lists of 10-20% gains is a special kind of poppycock. These are the values of these frameworks, and I would like to see some people progress.

For right now, our StarlightHQ cookbook is to use 1. typed inputs, 2. dataclasses for non-primitives, 3. descriptive variable names, 4. dataclasses for return values when more than one value, 5. docstrings with LLM-facing thought logic and google docstring descriptions of arguments, which include descriptions of their validators, 6. a robust "purpose" field in all docstrings that describe why we created the function, 7. a robust "design choice" which describes what internal business logic the function is mechanically constrained by, 8. forgoing Field descriptions, and 9. finding that limits in validators (i.e., at least 6 tags) is unusually taxing on the smaller GPUs that will be the workhorse for a lot of basic feature creation (i.e., summaries to make searchable).

Not exactly groundbreaking or earth-shattering, but it took time, and I still do not know what I want to write when I sit down to write a function, and I know the LLM should figure this part out next month anyway. Happy to chat, and all who arrive here from search engines should feel compelled to track down my email should they want to.

Thanks.

davidbernat · 2026-02-21T16:30:59Z

davidbernat
Feb 21, 2026
Author

@jlowin I found a old forum post that describes some split between Anthropic and FastMCP, and lack of clarity elsewhere whether the FastMCP in MCP Protocol is supported by FastMCP, and then only a day ago FastMCP migrated to Prefect. It is all very worrisome. I had hoped you had replied to this discussion when I saw a notification in my inbox.

Why do you (Prefect++) not let me (StarlightHQ) help you (FastMCP)?

We know what we are on to here, and we have time and the skillset.
There are boats in the middle east waiting for orders from a President with no agenda other than gambling on tariff outcomes.

To all other who reach this post please contact me privately via email to work with CloudNode.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What optimizes performance? [Working with PrefectHQ] #3246

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What optimizes performance? [Working with PrefectHQ] #3246

Uh oh!

davidbernat Feb 20, 2026

Enhancement

Replies: 1 comment

Uh oh!

davidbernat Feb 21, 2026 Author

davidbernat
Feb 20, 2026

davidbernat
Feb 21, 2026
Author