feat: add request logging and admin audit foundations#48
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 045e04d372
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| app_error_status_code(&error), | ||
| Some(&state.route_alias), | ||
| state.context.provider.as_deref(), | ||
| Some(error_code.as_str()), |
There was a problem hiding this comment.
Keep request_log error_code aligned with logged status
When a streamed upstream chunk fails and then billing::record_usage also fails, this call logs status from the billing error (app_error_status_code(&error)) but keeps error_code from the earlier provider error (error_code). That produces inconsistent request log rows (e.g., 500 with provider_error), which can corrupt failure-rate analytics and incident triage for this path. Use an error code derived from the same error used for status, or log the original provider status/code pair together.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request implements observability foundations by introducing a new logging module for request and admin audit logs, updating the database schema, and integrating these logs into the gateway and provider management flows. The implementation includes new structs for logging inputs and helper functions for database persistence. Feedback suggests optimizing the implementation by offloading synchronous database writes to background tasks to reduce API latency, improving error handling types, optimizing database bindings for Uuid to reduce allocations, increasing the reusability of the logging helper by parameterizing hardcoded values, and adopting Tower middleware to avoid repetitive manual logging calls.
| #[allow(clippy::too_many_arguments)] | ||
| async fn record_gateway_request_completion( | ||
| database: &AnyPool, | ||
| database_backend: DatabaseBackend, | ||
| context: &RequestContext, | ||
| request_started_at: &Instant, | ||
| status: StatusCode, | ||
| route_alias: Option<&str>, | ||
| provider_alias: Option<&str>, | ||
| error_code: Option<&str>, | ||
| ) { | ||
| let route = context | ||
| .route | ||
| .clone() | ||
| .or_else(|| route_alias.map(|value| value.to_string())); | ||
| let provider = context | ||
| .provider | ||
| .clone() | ||
| .or_else(|| provider_alias.map(|value| value.to_string())); | ||
| let latency_ms = request_started_at.elapsed().as_millis() as u64; | ||
|
|
||
| if let Err(error) = record_request_log( | ||
| database, | ||
| database_backend, | ||
| &RequestLogInput { | ||
| request_id: context.request_id, | ||
| user_id: context.user_id, | ||
| api_key_id: context.api_key_id, | ||
| provider_id: context.provider_id, | ||
| route_id: context.route_id, | ||
| method: "POST".to_string(), | ||
| path: "/v1/chat/completions".to_string(), | ||
| route, | ||
| provider, | ||
| status_code: status, | ||
| latency_ms, | ||
| error_code: error_code.map(|value| value.to_string()), | ||
| }, | ||
| ) | ||
| .await | ||
| { | ||
| warn!(error = %error, "failed to persist gateway request log"); | ||
| } |
There was a problem hiding this comment.
Recording request logs synchronously in the gateway handler adds significant latency to every API response, as it blocks on a database INSERT. For a high-performance gateway, consider offloading these writes to a background task (e.g., using tokio::spawn or a dedicated worker channel) to avoid blocking the client response.
| AppError::invalid_config( | ||
| "request_log.latency_ms", | ||
| format!("latency_ms must fit into i64: {error}"), | ||
| ) | ||
| })?; | ||
|
|
There was a problem hiding this comment.
The error type AppError::invalid_config is misleading here. An overflow during u64 to i64 conversion for latency is an internal infrastructure or data validation issue, not a user-provided configuration error. Consider using AppError::infrastructure or simply using as i64 since millisecond latency will not realistically exceed i64::MAX (~292 million years).
let latency_ms = i64::try_from(input.latency_ms).map_err(|error| {
AppError::infrastructure(format!("latency_ms overflow: {error}"))
})?;| .bind(input.request_id.to_string()) | ||
| .bind(input.user_id.map(|value| value.to_string())) | ||
| .bind(input.api_key_id.map(|value| value.to_string())) | ||
| .bind(input.provider_id.map(|value| value.to_string())) | ||
| .bind(input.route_id.map(|value| value.to_string())) | ||
| .bind(&input.method) |
There was a problem hiding this comment.
Converting Uuid to String before binding to the query causes unnecessary allocations. sqlx supports binding Uuid directly for both SQLite and PostgreSQL backends if the uuid feature is enabled. This improves efficiency in the request logging path.
| .bind(input.request_id.to_string()) | |
| .bind(input.user_id.map(|value| value.to_string())) | |
| .bind(input.api_key_id.map(|value| value.to_string())) | |
| .bind(input.provider_id.map(|value| value.to_string())) | |
| .bind(input.route_id.map(|value| value.to_string())) | |
| .bind(&input.method) | |
| .bind(Uuid::now_v7()) | |
| .bind(input.request_id) | |
| .bind(input.user_id) | |
| .bind(input.api_key_id) | |
| .bind(input.provider_id) | |
| .bind(input.route_id) |
| route_id: context.route_id, | ||
| method: "POST".to_string(), |
There was a problem hiding this comment.
The HTTP method and path are hardcoded in this helper. While currently only used for chat completions, this limits the reusability of the logging foundation. Consider passing these values as arguments to record_gateway_request_completion or extracting them from the request context.
| route_id: context.route_id, | |
| method: "POST".to_string(), | |
| method: method.to_string(), | |
| path: path.to_string(), |
| record_gateway_request_completion( | ||
| &state.database, | ||
| state.database_backend(), | ||
| &context, | ||
| &request_started_at, | ||
| status, | ||
| Some(public_model), | ||
| None, | ||
| Some(&error_code), | ||
| ) | ||
| .await; |
There was a problem hiding this comment.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dcaa7297a2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| curl -fsS "${BASE_URL}/v1/models" -H "authorization: Bearer ${api_key}" >/dev/null | ||
| echo "Syncing upstream model list" | ||
| synced_models="$( | ||
| MODEL_SYNC_BASE_URL="${MOCK_URL}" \ |
There was a problem hiding this comment.
Point smoke model sync at /v1 models
In the default smoke path, MOCK_URL is the unversioned mock origin and the mock only serves models at /v1/models (this script already waits on ${MOCK_URL}/v1/models). Passing ${MOCK_URL} here makes model-sync.sh build ${MOCK_URL}/models, so curl receives a 404 and set -e aborts the alpha smoke before the chat/usage checks. Use ${MOCK_URL}/v1 here or make the sync helper normalize unversioned OpenAI-compatible base URLs.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 98ebf2c919
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "${url}" | ||
| )" | ||
|
|
||
| jq -e '.object == "list" and (.data | type == "array")' >/dev/null <<<"${response}" |
There was a problem hiding this comment.
Add jq to smoke prerequisites or avoid it
In environments that follow the documented docs/ALPHA_RUNBOOK.md prerequisites (Rust, curl, python3, Redis) but do not have jq installed, scripts/alpha-smoke.sh now always invokes this helper and this first jq call exits 127 under set -e, aborting the smoke before the chat/usage checks. Either document/install jq as a required dependency or parse with the already-required python3 so the documented alpha smoke path remains reproducible.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7e32c7233c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| latency_ms, | ||
| state.route_price, | ||
| ); | ||
| let _ = record_gateway_request_completion( |
There was a problem hiding this comment.
Record streamed requests when clients disconnect early
For streaming requests, this request-log write is only reached after the upstream stream returns None and the code is about to emit [DONE]. If the client disconnects before that point, Axum drops StreamBuildState and its Drop implementation only releases the limit lease, so those aborted streaming requests never get a request_logs row. That leaves a gap in the new durable request-log coverage for common SSE cancellations; consider logging from the drop/abort path as well.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ef8b39353e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if let Err(error) = record_admin_audit(&state.database, state.database_backend(), &audit).await | ||
| { | ||
| warn!(error = %error, "failed to record credit grant audit"); |
There was a problem hiding this comment.
Make admin audit writes atomic with mutations
When the admin_audit_logs insert fails here (for example because migrations were disabled or the audit table/schema is unavailable), the credit grant has already committed and this handler still returns success, leaving a balance-changing admin action with no audit row. Since the audit log is being introduced as durable admin audit coverage, record it in the same transaction as the credit ledger/wallet update or fail the mutation when the audit write cannot be persisted.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4644b0973a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| State(state): State<AppState>, | ||
| Extension(identity): Extension<ApiKeyIdentity>, | ||
| headers: HeaderMap, | ||
| Json(payload): Json<ChatCompletionsRequest>, |
There was a problem hiding this comment.
Support the Responses API schema on /v1/responses
When a client uses the OpenAI Responses API payload (for example { "model": "...", "input": "..." }), this extractor deserializes it as ChatCompletionsRequest, which requires messages, so Axum rejects the request before the gateway can route or request-log it; successful calls also continue through the chat-completions mapper and return a chat.completion object rather than a Responses API object. Since this PR advertises /v1/responses as OpenAI-compatible, add a responses-specific request/response mapping or keep the route disabled until that contract is implemented.
Useful? React with 👍 / 👎.
| if let Err(error) = record_admin_audit(&state.database, state.database_backend(), &audit).await | ||
| { | ||
| warn!(error = %error, "failed to record provider connection creation audit"); | ||
| } |
There was a problem hiding this comment.
Make provider audit writes atomic with mutations
If this audit insert fails after the provider connection insert has already committed (for example with migrations disabled or the audit table unavailable), the handler still returns success and leaves an admin credential/configuration change with no audit row. Since this PR is adding durable admin audit coverage, record the audit row in the same transaction as the provider/model-route mutation or fail the mutation when the audit write cannot be persisted; the same post-commit pattern appears in the delete and model-route admin handlers.
Useful? React with 👍 / 👎.
Summary
0002_request_log_foundations.sqlmizan-core/v1/modelscargo run -p mizan-apibuild is less likely to false-failmethod/paththrough request context instead of hardcoding them in the persistence helperjqdependency fromscripts/model-sync.shby parsing model sync JSON withpython3Testing
cargo fmt --all -- --checkcargo test --workspacezsh -n scripts/model-sync.shzsh -n scripts/alpha-smoke.shhttp://127.0.0.1:18180confirmed:GET /v1/modelsPOST /v1/chat/completionsPOST /v1/chat/completionsGET /v1/usageGET /v1/creditsGET /metricsscripts/model-sync.shagainst a mock OpenAI-compatible upstreamscripts/model-sync.shnow works withoutjqNotes