-
Notifications
You must be signed in to change notification settings - Fork 81
test: new fake OpenAI server for shared, deterministic tests #914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tests/extproc/README.md
Outdated
@@ -0,0 +1,55 @@ | |||
# AI Gateway ExtProc Tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI once this is merged, I will progress to do similar for aigw (standalone), so that we can easily test ad-hoc with the same config used in unit tests, ideally port collision free ***
** (there is a little bit of dancing in cmd/aigw we'll need to do because of duplicate prom registrations, but we'll get through)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you also mention the other existing tests using testupstream and how they differ from this https://github.com/envoyproxy/ai-gateway/blob/main/tests/extproc/testupstream_test.go ? Or maybe cutting a subdirectory for all the new files in this PRs better for separation since the existing tests are not using the ones added here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I wanted to remove the openai -> openai tests from testupstream_test.go once the approach was ok as we don't want redundant I think.
internal/testing/fakeopenai/sse.go
Outdated
} | ||
|
||
// ReadChatCompletionStream reads and parses OpenAI chat completion chunks from an SSE stream.. | ||
func ReadChatCompletionStream(r io.Reader) ([]ChatCompletionChunk, string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a utility to verify OpenInference formatted otel span data, which isn't in this change. I can remove it or leave it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments but before going forward i would love to understand if it's possible to deprecate testupstream_test.go tests where we already use a "fake" homemade server without real providers, so this is not the first extproc tests that do not rely on external credentials in other words. I think this these test and this fakeopenai tests are somewhat overlapped in the sense of using a fake server (not the replay portion which i like about this one), so i would like to migrate existing ones to this recording style. Or, would it be possible to refactor the existing testupstream to use the yaml recorded by go-vcr, which wouldn't require us to write fake servers for all the translation targets (AWS, Anthropic, and Google Vertex).
Meanwhile, to move things forward without too much migration, I would slightly prefer keep one envoy config in tests/extproc
test rather than two different envoy configs in the same directly where each will be used in different tests testupstream_test.go,realproviders_test.go,custom_extproc_test.go
vs chat_completions_test.go
. Maybe we can have another test class tests/extproc/go-vcr
or tests/extproc-vcr
or whatever new dir until we migrate the existing ones (if possible?) then we won't need to modify the current extproc/REAME.md you added to mention both different classes of tests (existing ones vs vcr). Not strong preference but just that i would like to make it clear that which file does what in tests/extproc dir. Maybe prefixing the new files here with "vcr_" would be another way... what do you think...
cmd/extproc/mainlib/main.go
Outdated
@@ -160,6 +160,10 @@ func Main(ctx context.Context, args []string, stderr io.Writer) (err error) { | |||
l.Error("Failed to shutdown health check server gracefully", "error", err) | |||
} | |||
}() | |||
|
|||
// Emit startup message to stderr when all listeners are ready. | |||
fmt.Fprintln(stderr, "AI Gateway External Processor is ready") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be possible to use the logger l
for consistency or any reason not to do it? otherwise, only this line would be non structured text but other logs are all structured json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this was intentionally using non-buffered stderr (like envoy does) which allows processes to reliably block on it. Should I put comments to that effect or should we just hope the buffering doesn't impact barriers? You can push your preference if I'm not around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logger is using the same stderr io.Writer variable l := slog.New(slog.NewTextHandler(stderr, &slog.HandlerOptions{Level: flags.logLevel})
and i think it works exactly the same in terms of blocking? I will push the change
@@ -205,3 +208,54 @@ func TestStartHealthCheckServer(t *testing.T) { | |||
}) | |||
} | |||
} | |||
|
|||
// TestExtProcStartupMessage ensures other programs can rely on the startup message to STDERR. | |||
func TestExtProcStartupMessage(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
tests/extproc/envoy.yaml
Outdated
request_attributes: | ||
- xds.upstream_host_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we need this at the router level extproc, so could you try removing it? otherwise there seems a bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try removing it, in my Otel tests there were errors in the log about missing attributes and finally hunted down to this. It might now be the case anymore or triggered with these files.
tests/extproc/envoy_aigw_local.yaml
Outdated
"@type": type.googleapis.com/envoy.config.route.v3.FilterConfig | ||
disabled: true | ||
http_filters: | ||
# Simulate real config injected via EnvoyExtensionPolicy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so unfortunately this is totally different from the actual envoy config generated by both aigw and the k8s controller mode. I think this is accidentally working and i have a slight idea of why it's working if it's already passing tests, but this differs from the actual how it works. In that sense aigw translate
is useless since the translation result does not reflect the reality. That is because (unfortunately in my opinion, due to the lack of many API in EG) the final envoy resources are fine tuned via the extension server mechanism in internal/extensionserver package, which makes the translation result different from this yaml.
If you take a look at the original envoy.yaml in this directory, you would see two extprocs filters (!= not two processes, just filters) are configured, and that's intentional to enable the fallback across multiple providers like falling back from local model to AWS/OpenAI etc. https://aigateway.envoyproxy.io/docs/latest/concepts/architecture/data-plane. The existing envoy.yaml is aligned with the actual config generated by aigw run & the control plane in other words
I am ok to go with this as-is since i see the main purpose of this config is to have a minimal working stuff, not for the alignment with exactly how it works. On the other hand, this might cause a trouble like; while things working at the extproc layer with this tests, not working at k8s & "aigw run" command since the configuration setups are different from here and there. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where I started was Otel and I couldn't find any yaml that worked without errors. So, I pared this down to the minimum that would work with extproc (ack I have a partial understanding still). I do prefer parity with intent more than what I was able to get working, so happy to try to match esp as I have tests now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing envoy.yaml is tightly coupled with many clouds and slow testupstream which makes it difficult to progress before "stop the world" refactor of everything. I will make the openai one as similar as possible, yet still runnable with docker against ollama which is a very important goal of this PR. We need to be able to quickly ad-hoc especially when adding tracing to this, and there will be overlap with the "testupstream" thing vs VCR until other things migrate. Doing that can happen in later PRs and aren't urgent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I think I get what you are saying more. Our destination is parity with real envoy config aigw would make, and a simplified envoy.yaml might pass for the wrong reasons. Yet, challenges remain..
It's a chicken-egg issue at this low layer of extproc to even know the config it might beed to run, because the owner of that generation is aigw -> eg, etc. We have envoy.yaml which while hand edited, is likely very close, and we have good reason to be cautious about deviations.
Future idea: Run aigw as a library with a startup hook to func-e, capture the real envoy config and have Go tests compare that with test config in this directory. For example, it could have a working aigw yaml for ollama and simply splat out the envoy. There is chicken egg on this part, too, but something like this would reduce drift and misconfiguration anxiety. Also, for newcomers like me, it makes it a lot more transparent what we actually must have for testing vs all the stuff that happens to be in a big file ;)
Tactically in this PR: Align new config closely to existing (e.g., dual filters), even if the new tests don't use those features, because it is closer to prod.
Hope this works, but you can feel free to revise if not!
I cleaned up the fakeopenai server, but I still have some work to do to try to use the old envoy yaml, also remove redundant tests, and try to use internal/apischema/openai to describe the recordings. should be done in an hour or two |
e73853d
to
25272d7
Compare
ok I was unable to get the tests/extproc/envoy.yaml working with my test fixtures and also have it pass with ollama locally etc. It is a "kitchen sink" config, containing many different things used by all tests, and I don't think that is sustainable for reasons including deleting several of my hours trying to re-use it despite being mostly not about local openai connections. Solving the entire surface of all tests is too much to do in this PR, and I'm not the right one to do that (yet at least). However, with this in, I can progress tracing which is something I'm a specialist at. Then, as I gain experience maybe I can fail less at this configuration. TL;DR; I put the new tests into a "vcr" directory so that we can at least begin the process of less monolithic test execution, free of port conflicts and without risk of accidentally copy/pasting wrong responses etc. Feel free to revise this branch without my permission. Thanks for your attention to this matter ;) |
This PR introduces a fake OpenAI API server using `go-vcr` to enable fast, credential-free, and reproducible integration tests for the AI Gateway external processor (`extproc`). Key changes: - **Fake Server Implementation**: New `internal/testing/fakeopenai` package with pre-recorded "cassettes" (YAML) for common OpenAI endpoints like chat completions (basic, streaming, tools, multimodal, etc.). Supports automatic recording of new interactions when `OPENAI_API_KEY` is set. - **ExtProc Testing Enhancements**: - On-demand `extproc` binary build in tests (via `EXTPROC_BIN` env or auto-compile). - New `tests/extproc` suite using the fake server, covering various request scenarios with assertions on responses and status codes. - Startup message ("AI Gateway External Processor is ready") emitted to stderr for reliable readiness detection. - **Config & Tooling Updates**: Added `.env.ollama` for model defaults; updated Makefile, licenses, and lint ignores; new Docker Compose setup for manual testing with Ollama and Envoy. - **Dependencies**: Added `gopkg.in/dnaeon/go-vcr.v4` for cassette recording/replay. **Benefits**: Tests now run offline, deterministically, and without API costs/keys. Reduces flakiness and speeds up CI. No breaking changes—existing tests remain compatible. To record new cassettes: Set `OPENAI_API_KEY` and run `go test -run TestNewRequest` in `fakeopenai`. **Notes**: Cassettes recorded were based on analysis of OpenInference instrumentation and the edge cases that often break traces. The design is inspired by work done by @anuraaga in different language families in OpenTelemetry. The motivation is preparation for OpenTelemetry work, as well as later sharing the same cassettes to aigw and e2e tests. Signed-off-by: Adrian Cole <[email protected]>
Co-authored-by: Takeshi Yoneda <[email protected]> Signed-off-by: Adrian Cole <[email protected]>
Co-authored-by: Takeshi Yoneda <[email protected]> Signed-off-by: Adrian Cole <[email protected]>
Signed-off-by: Adrian Cole <[email protected]>
Signed-off-by: Adrian Cole <[email protected]>
Signed-off-by: Adrian Cole <[email protected]>
Signed-off-by: Adrian Cole <[email protected]>
Signed-off-by: Adrian Cole <[email protected]>
Signed-off-by: Adrian Cole <[email protected]>
Signed-off-by: Adrian Cole <[email protected]>
Signed-off-by: Adrian Cole <[email protected]>
a183cea
to
33cde19
Compare
added a footnote into the description about future work, taken from the fakeopenai README. signing off, and enjoy! |
TestChatCompletions race during the test... let me take alook ..
|
Signed-off-by: Takeshi Yoneda <[email protected]>
Signed-off-by: Takeshi Yoneda <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯! the exciting future of smooth e2e testing experience has begun
why only ubuntu extproc jobs are failing... |
can't repro on my local ubuntu but the seems not a flake on CI .. |
Signed-off-by: Takeshi Yoneda <[email protected]>
@@ -129,6 +129,8 @@ func requireRunEnvoy(t *testing.T, accessLogPath string) { | |||
"-c", envoyYamlPath, | |||
"--log-level", "warn", | |||
"--concurrency", strconv.Itoa(max(runtime.NumCPU(), 2)), | |||
// This allows multiple Envoy instances to run in parallel. | |||
"--base-id", strconv.Itoa(time.Now().Nanosecond()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all your base are belong to us
Thanks for helping this in safely, @mathetake, like the safe buffers and also didn't know about the base-id thing. TIL! |
**Description** This adds openai recordings for all features supported by openinference spans in preparation of the open telemetry pull request. **Related Issues/PRs (if applicable)** supports #791 follow-up to #914 **Special notes for reviewers (if applicable)** was pretty tricky to trigger the audio and image features with concise requests! Signed-off-by: Adrian Cole <[email protected]> Co-authored-by: Adrian Cole <[email protected]>
This PR refactors the extproc startup process to make it more robust around network listeners. It also starts the process of consolidating new (VCR) test infrastructure by migrating custom metrics and base types of existing tests to it. Finally, this improves the README of the custom metrics example as before, it wasn't obvious what it did and its limitations. The net result is a win on test infrastructure as nothing uses static ports (conflicts) anymore. Key changes include: - **Startup Refinements**: Introduced a unified `listen` helper using `net.ListenConfig` for creating listeners (extproc, metrics, health). This provides better error context (e.g., named failures like "failed to listen for metrics") and simplifies server initialization. Updated metrics and health servers to accept listeners directly, reducing duplication and potential races. - **Custom Metrics Example**: Expanded the `examples/extproc_custom_metrics` README with detailed explanations of default metrics flow, example behavior, and use cases (e.g., enhanced performance tracking, cost optimization). The example now logs events and returns fixed TTFT/ITL values for easier verification. - **Testing Improvements**: - Moved binary building logic (extproc and testupstream) to shared helpers that build on-demand, supporting custom binaries (e.g., for examples). - Refactored tests to use a centralized `TestEnvironment` that captures stdout/stderr for all components (Envoy, extproc, upstream), enabling parallel runs and easier debugging via unified logs. - Switched Envoy access logs to stdout for consistency in tests. - Added a new test for custom metrics integration, verifying dynamic metadata in access logs. - **Other Cleanups**: Adjusted timeouts, fixed minor test flakes (e.g., stale sockets), and improved log handling in test servers. **Related Issues/PRs (if applicable)** Follow-up to envoyproxy#914 and envoyproxy#944. Signed-off-by: Adrian Cole <[email protected]>
**Description** This PR refactors the extproc startup process to make it more robust around network listeners. It also starts the process of consolidating new (VCR) test infrastructure by migrating custom metrics and base types of existing tests to it. Finally, this improves the README of the custom metrics example as before, it wasn't obvious what it did and its limitations. The net result is a win on test infrastructure as nothing uses static ports (conflicts) anymore. Key changes include: - **Startup Refinements**: Introduced a unified `listen` helper using `net.ListenConfig` for creating listeners (extproc, metrics, health). This provides better error context (e.g., named failures like "failed to listen for metrics") and simplifies server initialization. Updated metrics and health servers to accept listeners directly, reducing duplication and potential races. - **Custom Metrics Example**: Expanded the `examples/extproc_custom_metrics` README with detailed explanations of default metrics flow, example behavior, and use cases (e.g., enhanced performance tracking, cost optimization). The example now logs events and returns fixed TTFT/ITL values for easier verification. - **Testing Improvements**: - Moved binary building logic (extproc and testupstream) to shared helpers that build on-demand, supporting custom binaries (e.g., for examples). - Refactored tests to use a centralized `TestEnvironment` that captures stdout/stderr for all components (Envoy, extproc, upstream), enabling parallel runs and easier debugging via unified logs. - Switched Envoy access logs to stdout for consistency in tests. - Added a new test for custom metrics integration, verifying dynamic metadata in access logs. - **Other Cleanups**: Adjusted timeouts, fixed minor test flakes (e.g., stale sockets), and improved log handling in test servers. **Related Issues/PRs (if applicable)** Follow-up to #914 and #944. --------- Signed-off-by: Adrian Cole <[email protected]> Co-authored-by: Takeshi Yoneda <[email protected]>
Description
This PR introduces a fake OpenAI API server using
go-vcr
to enable fast, credential-free, and reproducible integration tests for the AI Gateway external processor (extproc
). The changes enhance testing reliability, eliminate dependency on live API calls, and reduce CI flakiness and costs. Key changes include:internal/testing/fakeopenai
package with pre-recorded YAML cassettes for common OpenAI endpoints (e.g., chat completions, streaming, tools, multimodal). Supports automatic recording of new interactions whenOPENAI_API_KEY
is set.extproc
binary build in tests (viaEXTPROC_BIN
env or auto-compile).tests/extproc
suite using the fake server, covering various request scenarios with assertions on responses and status codes..env.ollama
and a working docker-compose configuration for ad-hoc testing of extprocgopkg.in/dnaeon/go-vcr.v4
for cassette recording/replay.Related Issues/PRs (if applicable)
Special notes for reviewers (if applicable)
Cassettes were recorded based on analysis of OpenInference instrumentation and edge cases that often break traces.
Future work
OpenAI is not the only inference API supported, but it is special as it is
the most common frontend and backend for AI Gateway. This is why we expose the
requests, as we will often proxy these even if the backend is not OpenAI
compatible.
The recording process would remain consistent for other cloud services, such as
Anthropic or Bedrock, though there could be variations in how requests are
scrubbed for secrets or handled for request signing. In a future refactoring,
we could extract the core recording infrastructure into a separate package,
reducing this one to just cassette constants and OpenAI-specific request
recording and handling details. Most of the code could be reused for other
backends.
For additional insights, refer to OpenTelemetry instrumentation, which often
employs VCR for LLM frameworks as well.
Here are key parts of the OpenTelemetry Botocore Bedrock instrumentation that
deals with request signing and recording:
Here are key parts of OpenInference Anthropic instrumentation, which handles
their endpoint.