Skip to content

Local evals tutorial fails with generic 'Engine core initialization failed' when Python dev headers are missing #180

@kamran-rapidfireAI

Description

@kamran-rapidfireAI

Summary

Running tutorial_notebooks/rag-contexteng/rf-tutorial-rag-fiqa.ipynb locally (Linux, non-Colab) fails at experiment.run_evals(...) with:

  • RayTaskError(RuntimeError)
  • Failed to initialize pipeline: RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

The top-level error is not actionable and hides the real root cause.

Repro

  1. Open and run tutorial_notebooks/rag-contexteng/rf-tutorial-rag-fiqa.ipynb locally.
  2. Execute:
results = experiment.run_evals(
    config_group=config_group,
    dataset=fiqa_dataset,
    num_actors=1,
    num_shards=4,
    seed=42,
)
  1. Observe actor init failure with generic engine-core message.

Actual Root Cause (from Ray worker logs)

In Ray worker stderr, the underlying failure is:

fatal error: Python.h: No such file or directory

and then Triton/vLLM fails while compiling runtime CUDA helper code:

CalledProcessError ... /usr/bin/gcc ... -I/usr/include/python3.12 ... returned non-zero exit status 1

Machine was missing Python dev headers package (python3.12-dev / python3-dev).

Why this happens

vllm + torch._inductor + Triton can compile native/CUDA helper modules at runtime. On Linux, this requires Python C headers (Python.h) and compiler toolchain. If headers are missing, model engine init fails.

Confirmed Fix

Installing Python dev headers resolved the issue:

sudo apt-get update
sudo apt-get install -y python3.12-dev

(Equivalent distro package names apply, e.g. python3-dev.)

Requested Improvements

  1. Preflight dependency checks before actor/model init in evals mode:
    • verify Python.h exists (sysconfig.get_paths()["include"]/Python.h)
    • verify compiler exists (gcc/cc)
    • fail fast with actionable install instructions.
  2. Improve surfaced error message in QueryProcessingActor.initialize_for_pipeline path:
    • include root cause snippet from worker stderr (not only generic “Engine core initialization failed”).
  3. Docs update for local/tutorial setup (non-Colab):
    • Linux prerequisites should explicitly include Python dev headers and build essentials.
  4. Add CI smoke test for missing headers scenario:
    • assert user-facing error is clear and actionable.

Environment

  • OS: Linux
  • Python: 3.12
  • vLLM: 0.10.2
  • GPU: NVIDIA L4

User Impact

This is a first-run blocker for local users and can be mistaken for GPU/vLLM incompatibility. Better guardrails and messaging would significantly improve onboarding and support load.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions