Skip to content

feat(docker): add support for a docker build image of datahub-mcp-server#92

Open
nwadams wants to merge 8 commits intomainfrom
na--docker-build-mcp-server
Open

feat(docker): add support for a docker build image of datahub-mcp-server#92
nwadams wants to merge 8 commits intomainfrom
na--docker-build-mcp-server

Conversation

@nwadams
Copy link

@nwadams nwadams commented Mar 11, 2026

Add support for running datahub-mcp-server in http mode with a docker-compose. Build a docker image for public consumption


Note

Medium Risk
Changes add a new container build/publish pipeline that relies on registry credentials and release tagging; misconfiguration could publish incorrect images or fail releases, but runtime code is unchanged.

Overview
Adds first-class Docker deployment support for running the server in HTTP mode, including a new Dockerfile (Python 3.11-slim + uv) and a docker-compose.yml that wires required DataHub env vars and exposes port 8000.

Introduces a GitHub Actions workflow that, on GitHub Release publish, builds the image and pushes versioned + latest tags to both Docker Hub (acryldata/mcp-server-datahub) and GHCR, and updates the README with Docker/Compose run instructions and documented endpoints (/mcp, /health).

Written by Cursor Bugbot for commit 2180f41. This will update automatically on new commits. Configure here.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Free Tier Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Docker image always reports version 0.0.0
    • I passed a release-derived build arg through the Docker workflow and used it as SETUPTOOLS_SCM_PRETEND_VERSION during project install so container builds no longer fall back to 0.0.0 without .git.
  • ✅ Fixed: Unpinned uv:latest tag risks breaking Docker builds
    • I replaced ghcr.io/astral-sh/uv:latest with the pinned major tag ghcr.io/astral-sh/uv:0.6 to avoid unbounded upgrades from latest.

Create PR

Or push these changes by commenting:

@cursor push 793c95037d
Preview (793c95037d)
diff --git a/.github/workflows/docker.yml b/.github/workflows/docker.yml
--- a/.github/workflows/docker.yml
+++ b/.github/workflows/docker.yml
@@ -16,7 +16,9 @@
       - name: Extract version from release tag
         id: version
         run: |
-          echo "version=${{ github.event.release.tag_name }}" >> "$GITHUB_OUTPUT"
+          tag="${{ github.event.release.tag_name }}"
+          echo "version=${tag}" >> "$GITHUB_OUTPUT"
+          echo "package_version=${tag#v}" >> "$GITHUB_OUTPUT"
 
       - name: Log in to Docker Hub
         uses: docker/login-action@v3
@@ -39,6 +41,8 @@
         with:
           context: .
           push: true
+          build-args: |
+            MCP_SERVER_DATAHUB_VERSION=${{ steps.version.outputs.package_version }}
           tags: |
             acryldata/mcp-server-datahub:${{ steps.version.outputs.version }}
             acryldata/mcp-server-datahub:latest

diff --git a/Dockerfile b/Dockerfile
--- a/Dockerfile
+++ b/Dockerfile
@@ -3,8 +3,11 @@
 WORKDIR /app
 
 # Install uv
-COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
+COPY --from=ghcr.io/astral-sh/uv:0.6 /uv /usr/local/bin/uv
 
+# Build-time version override for setuptools-scm when .git is unavailable
+ARG MCP_SERVER_DATAHUB_VERSION=0.0.0
+
 # Copy dependency files
 COPY pyproject.toml uv.lock ./
 
@@ -15,7 +18,7 @@
 COPY src/ ./src/
 
 # Install the project itself
-RUN uv sync --frozen --no-dev
+RUN SETUPTOOLS_SCM_PRETEND_VERSION=${MCP_SERVER_DATAHUB_VERSION} uv sync --frozen --no-dev
 
 ENV PATH="/app/.venv/bin:$PATH"

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

COPY src/ ./src/

# Install the project itself
RUN uv sync --frozen --no-dev
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker image always reports version 0.0.0

Medium Severity

The project uses setuptools-scm for versioning, which derives the version from git tags and writes _version.py. Since _version.py is in .gitignore (not tracked in git) and the Dockerfile never copies the .git directory, setuptools-scm cannot determine the version when uv sync runs during the build. It falls back to fallback_version = "0.0.0" from pyproject.toml. This means every Docker image — even those tagged with a real release version by the CI workflow — will report __version__ as "0.0.0", affecting the --version CLI output and the telemetry datahub_component string.

Fix in Cursor Fix in Web

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed this

WORKDIR /app

# Install uv
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unpinned uv:latest tag risks breaking Docker builds

Medium Severity

The COPY --from=ghcr.io/astral-sh/uv:latest uses an unpinned :latest tag, making the Docker build non-reproducible. If uv releases a breaking change (e.g., moving the binary path from /uv, or changing CLI behavior), builds will silently break. The existing wheels.yml workflow pins astral-sh/setup-uv@v6, but this Dockerfile has no version constraint at all. Pinning to a specific version or major version tag (e.g., uv:0.6) would prevent unexpected build failures.

Fix in Cursor Fix in Web

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignoring this as other use cases use uv:latest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not pin? where are the other use cases that use uv:latest?

@cursor
Copy link

cursor bot commented Mar 11, 2026

You have run out of free Bugbot PR reviews for this billing cycle. This will reset on March 27.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@nwadams nwadams requested a review from alexsku March 11, 2026 16:31
README.md Outdated
docker build -t mcp-server-datahub .
docker run -p 8000:8000 \
-e DATAHUB_GMS_URL=https://your-datahub-instance \
-e DATAHUB_GMS_TOKEN=your-token \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a problem right? we are running datahub mcp without any authentication publically available.

Copy link
Contributor

@alexsku alexsku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should authenticate customer requests, I don't think it is ok for us to release docker files that promote unsecured services

@nwadams
Copy link
Author

nwadams commented Mar 11, 2026

I think we should authenticate customer requests, I don't think it is ok for us to release docker files that promote unsecured services

I'll add support for token auth ad query param or header as default options

@alexsku
Copy link
Contributor

alexsku commented Mar 11, 2026

This PR exposes the MCP server over HTTP with no authentication. Anyone who can reach port 8000 gets full access to all tools using the configured DATAHUB_GMS_TOKEN — including search, lineage, schema inspection, and (if enabled) mutation tools like tag/owner/description changes. Since the README and docker-compose encourage shared/team deployment, this is a meaningful risk.

I'd recommend requiring an Authorization: Bearer header on the /mcp endpoint before merge. The simplest approach: validate the incoming bearer token against the already-configured DATAHUB_GMS_TOKEN, so there's no extra credential to manage. This is well-supported across MCP clients — Claude Code has native bearer_token_env_var support, Claude Desktop works via mcp-remote --header, and Cursor/LiteLLM/OpenAI agents all support static auth headers on Streamable HTTP transports. The server should return 401 with WWW-Authenticate: Bearer on auth failure per the MCP spec. A few lines of Starlette middleware would cover it. The README should also note that HTTPS is strongly recommended when deploying over a network.

@cursor
Copy link

cursor bot commented Mar 11, 2026

You have run out of free Bugbot PR reviews for this billing cycle. This will reset on March 27.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.


def __init__(self, client: DataHubClient) -> None:
self._client = client
def __init__(self, server_url: str, default_client: Optional[DataHubClient]) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need to have two separate modes - one that we use for http server mode (that would require authentication) and another for others. and we would have two mutually exclusive middleware - old _DataHubClientMiddleware and new one that will construct DataHubClient every time (it can use cache function that takes the token and the url as parameters)

@cursor
Copy link

cursor bot commented Mar 11, 2026

You have run out of free Bugbot PR reviews for this billing cycle. This will reset on March 27.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@nwadams
Copy link
Author

nwadams commented Mar 11, 2026

@alexsku
Updated to use fastmcp TokenAuth middleware to validate the token and correctly return 401

mcp-server-datahub-1  | INFO:     172.67.72.239:51770 - "POST /mcp HTTP/1.1" 401 Unauthorized
mcp-server-datahub-1  |                     INFO     Auth error returned: invalid_token middleware.py:92
mcp-server-datahub-1  |                              (status=401)   

@nwadams nwadams force-pushed the na--docker-build-mcp-server branch from e1adc70 to 0c74fc4 Compare March 11, 2026 18:46
@cursor
Copy link

cursor bot commented Mar 11, 2026

You have run out of free Bugbot PR reviews for this billing cycle. This will reset on March 27.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@alexsku
Copy link
Contributor

alexsku commented Mar 11, 2026

Nice progress on the auth — the TokenVerifier approach using FastMCP's BearerAuthBackend is the right call. A few thoughts on the middleware design:

Right now _DataHubClientMiddleware handles both stdio (fixed client) and HTTP (per-request token with fallback to default client) in a single class with conditional branching. This is where things get muddled — when DATAHUB_GMS_TOKEN is set, the verifier isn't installed (main() line ~201) and the middleware silently falls back to the global client, effectively disabling auth for HTTP mode.

I think we'd be better off with two mutually exclusive middlewares selected by transport mode:

stdio/SSE — keep the original _DataHubClientMiddleware as-is. Single client from env, no auth needed (the process is local to the user).

HTTP with DATAHUB_GMS_TOKEN set (local CLI usage, e.g. mcp-server-datahub --transport http) — same as stdio, use the env token as the fixed client. Auth verifier is optional here since the user is running it locally and already has the token. This is essentially the existing behavior.

HTTP without DATAHUB_GMS_TOKEN (shared/Docker deployment) — a new _PerRequestClientMiddleware that:

  • Requires a Bearer token on every request (no fallback, no silent pass-through)
  • _DataHubTokenVerifier is always enabled as the upstream auth gate
  • Builds a DataHubClient per token, cached with cachetools.TTLCache keyed on (server_url, token) — avoids reconstructing the client + hitting the getMe query on every single request for the same user

The distinction is simple: if DATAHUB_GMS_TOKEN is present, the operator has a token and trusts the network (local use). If it's absent, every caller must authenticate (shared deployment). create_app() picks the right middleware based on transport + whether a global token is configured.

This gives us:

  • Clear separation of responsibility — no conditional branching mixing two different auth models in one middleware
  • Local HTTP usage (--transport http with env token) still works without clients needing Bearer headers
  • Shared/Docker deployment has no path to run unauthenticated
  • Better performance for the per-request path — the TTL cache (cachetools is already a dependency) means repeated calls from the same token reuse the client instead of doing a GraphQL round-trip every time
  • Simpler testing — each middleware has a single concern

@alexsku
Copy link
Contributor

alexsku commented Mar 11, 2026

One more thing — I'd remove DATAHUB_GMS_TOKEN from docker-compose.yml entirely. If it's not there, anyone running via Docker gets per-request auth by default with no way to accidentally disable it. If someone really needs a static token for a trusted network setup, they can always add -e DATAHUB_GMS_TOKEN=... themselves — but we shouldn't be surfacing that option in the default config.

Same for the README Docker examples — only show DATAHUB_GMS_URL, no mention of DATAHUB_GMS_TOKEN in the Docker context. Keep the secure path as the path of least resistance.

@cursor
Copy link

cursor bot commented Mar 11, 2026

You have run out of free Bugbot PR reviews for this billing cycle. This will reset on March 27.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@nwadams
Copy link
Author

nwadams commented Mar 11, 2026

@alexsku refactored to simplfiy the middleware. I didn't end up setting up two different middleware because I made it so only the client is created in middleware and validation is done elsewhere.


global_token = os.environ.get("DATAHUB_GMS_TOKEN")
if global_token:
_verify_client(_build_client(server_url, global_token))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to do this here? not that it is a big deal either way, I'm just confused how this is related to the pr

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to error at startup if the token isn't valid since nothing will work without a valid token.

_GET_ME_QUERY = "query getMe { me { corpUser { urn username } } }"


def _build_client(server_url: str, token: str) -> DataHubClient:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to have a cache for this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we need to have a cache for this. There's no http calls here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it results in the new connection for every call, no? I'm pretty certain we need to cache

super().__init__()
self._server_url = server_url

async def verify_token(self, token: str) -> Optional[AccessToken]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to have this cacheable?

return AccessToken(
client_id=f"mcp-server-datahub/{__version__}", scopes=[], token=token
)
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are caching perhaps we should be more precise in what to catch here, we don't want to cache null if there was 500 server error from the server

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would want to avoid caching errors in general

try:
request = get_http_request()
except RuntimeError:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when runtime error would be thrown? should we propagate the exception up instead of returning None here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is called outside of an http context. It should never happen in the current implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should propagate the exception then, why changing it to None?

Copy link
Contributor

@alexsku alexsku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

PR: #92 — feat(docker): add support for a docker build image of datahub-mcp-server
Author: nwadams
Base: main ← na--docker-build-mcp-server
Type: Mixed — new infra (Docker/CI), behavior change (auth), API change (per-request tokens)
Size: +555 / -11 across 6 files

Adds Docker deployment support for running the MCP server in HTTP mode with per-request Bearer token authentication. Each HTTP request now carries its own DataHub token, validated against the DataHub GMS getMe query. When DATAHUB_GMS_TOKEN is set, it acts as a fallback (local/CLI usage); when absent, every request must authenticate (shared/Docker deployment). Includes a GitHub Actions workflow to build and publish Docker images on release, a Dockerfile, docker-compose, README docs, and unit tests.

Key invariants:

  1. HTTP requests without a valid Bearer token must be rejected with 401 when no global DATAHUB_GMS_TOKEN is configured
  2. Per-request tokens must be used to build a DataHubClient scoped to that request's ContextVar
  3. stdio/SSE transport must continue working unchanged with env-based configuration
  4. Docker images must be tagged with the release version and published to both Docker Hub and GHCR

Risk Assessment

Risk Medium
Blast radius All HTTP-mode deployments; stdio users affected by new DATAHUB_GMS_URL requirement
Rollback Safe — no migrations or data changes
Rollout Ship after fixes — Docker is new surface area so no existing users affected, but stdio regression needs attention
CI Passing (lint + integration), but new unit tests are NOT run in CI

Blocking Issues

[BLOCKER] Tests don't match implementation — will fail if actually run

Category: tests | Confidence: high
Location: tests/test_mcp/test_auth_middleware.py:80-88test_falls_back_to_default_client_when_no_request_token()

The middleware constructor is __init__(self, server_url: str, default_token: Optional[str]) — it stores a token string and builds a client per-request via _build_client(). But the test passes a MagicMock() as default_token and asserts result is default_client, expecting the middleware to return a pre-built client object directly. Since _build_client is not patched in this test, it will attempt to construct a real DataHubClient with a MagicMock token and crash, or at minimum the is assertion will fail.

Same issue in test_create_app_with_token_builds_default_client (line ~185): it asserts args[1] is mock_build.return_value, but create_app() passes global_token (the string "globaltoken") as the second arg to the middleware, not the built client.

These tests are not run in CI — the CI workflow only runs tests/test_mcp_integration.py, so the failures are invisible.

Suggested fix: Either:

  • (a) Add test_auth_middleware.py to CI, then fix the tests to match the actual code (patch _build_client where needed, assert on token strings not client objects), or
  • (b) Refactor the middleware to accept an optional pre-built default client (as the tests expect), which would also avoid rebuilding it every request when a global token is configured.

Option (b) is better — it fixes both the test mismatch and the performance issue below.


High-Priority Issues

[HIGH] No client caching — new DataHubClient + GraphQL roundtrip on every request

Category: performance | Confidence: high
Location: src/mcp_server_datahub/__main__.py:85-91verify_token() and _client_for_request()

Every HTTP request triggers two _build_client() calls:

  1. _DataHubTokenVerifier.verify_token() builds a client and runs a getMe GraphQL query to validate the token
  2. _DataHubClientMiddleware._client_for_request() builds another client for the same token

For the verifier path, this means every MCP request does a synchronous GraphQL roundtrip just for auth validation — before the actual tool call even starts. There's no caching of verified tokens or clients.

Suggested fix: Add a cachetools.TTLCache (already a project dependency) keyed on (server_url, token) to cache validated clients for a short TTL (e.g. 5 minutes). The verifier can populate the cache, and the middleware can read from it.

[HIGH] Blocking synchronous I/O in async verify_token

Category: performance | Confidence: high
Location: src/mcp_server_datahub/__main__.py:85-91_DataHubTokenVerifier.verify_token()

verify_token is an async method but calls _verify_client()client._graph.execute_graphql() which is synchronous blocking I/O. This blocks the event loop for every incoming HTTP request during token verification.

Suggested fix: Run the blocking call in a thread via asyncio.to_thread() or anyio.to_thread.run_sync() (the project already uses asyncer which wraps anyio).

[HIGH] Breaking change for stdio users — DATAHUB_GMS_URL now strictly required

Category: correctness | Confidence: medium
Location: src/mcp_server_datahub/__main__.py:159-161create_app()

Previously, create_app() used DataHubClient.from_env() which may have had default values (e.g. http://localhost:8080). Now it does a hard os.environ.get("DATAHUB_GMS_URL") check and raises RuntimeError if missing. Existing stdio users who relied on DataHubClient.from_env() defaults (or who set the URL via other means like ~/.datahubenv) will break.

Suggested fix: For stdio/SSE transport, consider falling back to DataHubClient.from_env() when DATAHUB_GMS_URL is not set. Only require explicit DATAHUB_GMS_URL for HTTP mode.


Other Issues

  • [MEDIUM / security] Dockerfile:6COPY --from=ghcr.io/astral-sh/uv:latest is unpinned. A breaking uv release will silently break Docker builds. Fix: pin to a specific version like ghcr.io/astral-sh/uv:0.6.
  • [MEDIUM / operability] __main__.py:88verify_token catches all exceptions with a bare except Exception: return None. A network timeout, DNS failure, or DataHub outage will silently return 401 to all users with no logging. Fix: add logger.warning("Token verification failed", exc_info=True) before returning None.
  • [MEDIUM / correctness] __main__.py:196-199main() reads DATAHUB_GMS_URL a second time after create_app() already validated it. The auth verifier decision is split between main() and create_app(). Fix: move the mcp.auth assignment into create_app() based on transport mode.
  • [LOW / dx] docker-compose.yml — No restart policy. Docker deployments will silently stop on crash. Fix: add restart: unless-stopped.

What's Missing

  • Unit tests not in CI: test_auth_middleware.py is never executed. Add it to the CI workflow or add a separate unit test job.
  • No HTTPS guidance: The README Docker section doesn't mention HTTPS. Per-request Bearer tokens over plain HTTP are trivially interceptable. Add a note recommending a TLS-terminating reverse proxy.
  • No rate limiting on token verification: Each verify_token call does a GraphQL roundtrip. An attacker can DoS the DataHub GMS by sending many requests with different invalid tokens.

Test Plan

Invariant Covered? Gap
401 on missing/invalid token (HTTP, no global token) Yes — tests exist Not run in CI
Valid token accepted (HTTP) Yes — test exists Not in CI
Fallback to global token when DATAHUB_GMS_TOKEN set Partially — test is buggy (blocker above) Fix test, add to CI
stdio transport unaffected No Add test verifying stdio init without DATAHUB_GMS_URL
Docker image builds with correct version No Add CI job for test build

Questions for Author

  1. Was the DataHubClient.from_env() removal intentional? Are there stdio users who configure the client via ~/.datahubenv or other mechanisms that from_env() supports but direct os.environ.get doesn't?
  2. What's the plan for getting the new unit tests into CI? Currently they're invisible — the CI workflow only runs integration tests.
  3. Have you considered caching verified clients (TTL cache keyed on token)? The current design does a GraphQL roundtrip on every single request, which could add significant latency under load.

client = DataHubClient.from_env(
client_mode=ClientMode.SDK,
datahub_component=f"mcp-server-datahub/{__version__}",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude is right - this is a breaking change

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants