What optimizes performance? [Working with PrefectHQ] #3246
Replies: 1 comment
-
|
@jlowin I found a old forum post that describes some split between Anthropic and FastMCP, and lack of clarity elsewhere whether the FastMCP in MCP Protocol is supported by FastMCP, and then only a day ago FastMCP migrated to Prefect. It is all very worrisome. I had hoped you had replied to this discussion when I saw a notification in my inbox. Why do you (Prefect++) not let me (StarlightHQ) help you (FastMCP)? We know what we are on to here, and we have time and the skillset. To all other who reach this post please contact me privately via email to work with CloudNode. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Enhancement
Hey folks. The MCP Protocol is not yet an agreed upon standard in the industry, despite early adoption by Linux Foundation (along with AGENTS.md, etc.) and I consistently apply a critical eye to certain design choices and its limitation (specifically for adding file transfer underneath the LLM within the client to better accept REST APIs as native MCP Tools in the explicit cases of simplified text-restricted LLMs, which are the industry expectation).
With those caveats in mind, and knowing PrefectHQ from its Prefect work (re: toothpick in the sandwich), there is remaining one more big question mark lording over the implementation goals of a standard MCP enterprise team: what actually works?
I had quite a row with the PydanticAI team about even basic questions of what their "AI" did under the hood, between Prompt and its Request to LLM, to coerce the prompt before transmitting to the LLM. Their loggers, for even the day one intern-level questions were purposefully non-existent, and their higher-level loggers were proprietary. It took several days to build loggers and hooks into the httpx system underneath to get basic transmission statistics to the LLM, and in doing so I discovered that their PydanticAI system does not actually do much in the way of proprietary magic, at all. It configured one field in the LLM request to make a mandatory tool call to a constructed final_result tool which accepts its data model as arguments to that function; that works, and we should all be thrilled that a simple (clever) mechanism works, but the experience was quite unsatisfactory.
Caveats in mind, I feel the same way about MCPs, and there are not going to be any good solutions to the next question until an empirical framework of benchmarks are created and run for the purpose of a white paper. So be it; that is why I am here. For instance, and all these statement questions are top-of-mind obvious to everyone:
Similarly, one of the first experiments I needed to run was a simple monolith v. servlet:
Luckily, these experiments all have robust, common, common sense, simple empirical answers we can set up.
And automate. And I would like volunteers to build that with me.
We would use the PydanticAI system and small GPUs to run common sense and small benchmark queries. To interested interns and engineers, the auto-construction of these experiments for various configurations of MCPs is the valuable asset. It seems common sense that any enterprise company will have their own version to create.
To start with, in the immediate get-things-done, I would love to discuss the "cookbooks" others are using, and their verification methods. There are certain to be many debates because the difference in performances in many cases are going to be small, and insignificant. But, c'mon, the idea that our AI agent systems are not "robotically exploring this space in small batch A/B tests" in ten years (ten weeks) and finding long lists of 10-20% gains is a special kind of poppycock. These are the values of these frameworks, and I would like to see some people progress.
For right now, our StarlightHQ cookbook is to use 1. typed inputs, 2. dataclasses for non-primitives, 3. descriptive variable names, 4. dataclasses for return values when more than one value, 5. docstrings with LLM-facing thought logic and google docstring descriptions of arguments, which include descriptions of their validators, 6. a robust "purpose" field in all docstrings that describe why we created the function, 7. a robust "design choice" which describes what internal business logic the function is mechanically constrained by, 8. forgoing Field descriptions, and 9. finding that limits in validators (i.e., at least 6 tags) is unusually taxing on the smaller GPUs that will be the workhorse for a lot of basic feature creation (i.e., summaries to make searchable).
Not exactly groundbreaking or earth-shattering, but it took time, and I still do not know what I want to write when I sit down to write a function, and I know the LLM should figure this part out next month anyway. Happy to chat, and all who arrive here from search engines should feel compelled to track down my email should they want to.
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions