Skip to content

Open-source readiness: cleanup, governance, test fixes#54

Closed
drunkcoding wants to merge 498 commits into
mainfrom
dev
Closed

Open-source readiness: cleanup, governance, test fixes#54
drunkcoding wants to merge 498 commits into
mainfrom
dev

Conversation

@drunkcoding

Copy link
Copy Markdown
Collaborator

Summary

End-to-end open-source readiness pass for the EdgeSys '26 companion release. This PR brings dev to a state where it can be made public: no hardcoded personal paths, no unreviewed third-party code, complete governance metadata, a clean test suite, and a documented contributor experience.

123 commits / +7,772 / −27,181 lines / 160 files changed.

Highlights

  • Paper pipeline removed from the public release. All plot scripts, table generators, aggregators, microbench drivers, paper-experiment shell wrappers, and the morphling/evaluation/ package are deleted (with full off-tree backup) because they required a private baselines repo and per-device measurement data that cannot ship publicly. See docs/paper.md for the data-availability statement.
  • Governance + licensing. New SECURITY.md, CODE_OF_CONDUCT.md (Contributor Covenant v2.1), NOTICE, THIRD_PARTY_LICENSES.md, and external/muduo_base/LICENSE. README gains a Licensing section.
  • All hardcoded /home/eren paths in the C-side intercept layer fixed. Build is now portable; intercept log directory honours MORPHLING_INTERCEPT_LOG_DIR.
  • Protobuf pin aligned. pyproject.toml build-system and requirements.txt runtime now both pin protobuf>=4.21.6,<7, matching the runtime 6.x that ships in the Docker image.
  • .env.example moved under .taskmaster/ — the LLM API-key template was misleading as a top-level file in a non-LLM project.
  • Contributor docs reconciled. CONTRIBUTING.md now accurately describes what CI does (Build Sanity: hadolint + validate-pyproject + pip --dry-run + MANIFEST.in coverage + cffconvert + community-files check) and what it does NOT do (no docker build, no pytest, no GPU). PR template adds a needs-gpu-verification label workflow for CPU-only contributors.
  • 6 high-visibility TODO/FIXME markers now link to tracking issues (Device measurement: trust model needs revisit #45Test placeholder: FlatBuffers comparison never implemented #50).
  • Test infrastructure unblocked. make docker-test went from "0 collected (fatal)" before this PR to "100 passed, 12 skipped, 0 failed, 0 errors".

What changed (by area)

Repo hygiene & release scope

  • chore(release): purge paper figures, plot scripts, and morphling.evaluation — 59 files removed (−13,126 lines); local backup at ~/morphling-figures-backup/<timestamp>/.
  • chore(paths): replace hardcoded /home/eren paths in intercept layercsrc/intercept/interceptor.h, csrc/memory/shared_memory_{initializer,manager}.c, new env var documented in docs/troubleshooting.md.
  • chore(env): move .env.example under .taskmaster.
  • chore(tests): move tests/cpp/echo_* into unit/network/ — enforces the documented tests/cpp/{unit,bench,integration}/ layout.

Dependencies & build

  • chore(deps): align protobuf version between pyproject and requirements — consistent >=4.21.6,<7 pin.
  • chore(packaging): expand pyproject metadata — description, requires-python, license, 10 classifiers, [project.urls]. No PyPI publishing workflow added.

Governance & docs

  • docs(security): add SECURITY.md and CODE_OF_CONDUCT.md — GitHub Security Advisories channel; Contributor Covenant v2.1 verbatim.
  • docs(licensing): add NOTICE, THIRD_PARTY_LICENSES, muduo LICENSE — vendored, submodule, and runtime third-party components inventoried.
  • docs(paper): rewrite figure inventory after pipeline purge — all figure rows now declare [data not public]; runtime-API pointers retained for re-implementation.
  • docs(contrib): reconcile dev-loop docs and CI scope wording.
  • docs(prd): archive open-source readiness PRD under docs/.

Test fixes (made under this PR after CI unblocked)

  • fix(tests): unblock pytest collection on pytest 9 + protobuf 6 — moves pytest_plugins to a top-level conftest.py; switches BuildPackageProtos to python3 -m grpc_tools.protoc (libprotoc 31) instead of the torch-bundled protoc 3.13.
  • fix(tests): quarantine torch-monkey-patching demo from pytest collectiontest_pytorch_decorator.py had no test_* functions but globally replaced torch.* and torch.Tensor.* at import time, breaking 40+ downstream tests. Renamed to _demo_pytorch_decorator.py.
  • fix(greenctx): re-export trace helpers from green_context_backends shimimport * does not propagate private names; added explicit re-export.
  • fix(tests): point swap_timing source-check at canonical cpp_backend.py.
  • fix(tests): prefer real morphling import over sys.modules stub — three _bootstrap_morphling helpers were installing bare types.ModuleType stubs that poisoned sys.modules['morphling'] for the rest of the session.
  • fix(tests): guard tests blocked by pre-existing API drift — module-level pytest.skip for three pre-existing API gaps tracked in Test: rewrite test_pytorch_autograd against current autograd.py API #51, Test: rewrite gpt2_training_test for transformers AdamW removal #52, C extension: morphling._C missing ArcherTensorHandle, MemoryManagerClient #53.

Code traceability

Test plan

Built and verified inside the canonical Docker image at every step.

```bash
make docker-build # green
make docker-test # 100 passed, 12 skipped, 0 failed, 0 errors
```

Skipped tests are gated by pytest.skip(..., allow_module_level=True) with clear reasons and issue links:

C++ test suite (tests/cpp/build/test_xtgemm_worker, etc.) passes.

Acceptance gates

Gate Status Evidence
A1 — Hardcoded paths gone rg over csrc/morphling/scripts/tests returns 0
A2 — Governance + licensing files present SECURITY.md, CODE_OF_CONDUCT.md, NOTICE, THIRD_PARTY_LICENSES.md, external/muduo_base/LICENSE
A3 — Protobuf pin consistent both >=4.21.6,<7
A4 — .env.example not at root only .taskmaster/.env.example
A5 — No TODO(owner) in public docs 0 hits in docs/paper.md, README.md, CITATION.cff
A6 — Dev-loop ↔ CI scope reconciled DEV_README banner, CONTRIBUTING describes actual CI, PR template asks for needs-gpu-verification
A7 — tests/cpp layout clean no .cc/.cpp outside {unit,bench,integration}/
A8 — figures policy only figures/README.md tracked; figures/evaluation/ gitignored
A8-EXT — plot pipeline purged 0 hits across deletion patterns; morphling/evaluation/ empty
A9 — 6 audit-flagged TODOs issue-linked #45#50
A10 — pyproject metadata; no release workflow description / classifiers / [project.urls]; .github/workflows/{publish,release}*.yml absent
A11 — Docker build + test pass make docker-build green; make docker-test = 100 passed

Follow-up issues opened (not blocking this PR)

Local backup of removed assets

All deleted figures, plot scripts, generators, paper-experiment drivers, and the morphling.evaluation package are preserved at:

```
~/morphling-figures-backup/20260523T181313Z/
├── tracked/ # ex-tracked figures + comparison/
├── evaluation/ # untracked experiment outputs
├── scripts/ # 37 plot/aggregator/microbench/driver scripts
├── morphling-evaluation/ # entire package
├── tests-cpp-bench-intercept/
└── tests-deleted/ # 4 dependent tests
```

Not in this PR (out of scope per plan)

xly and others added 30 commits March 5, 2026 16:20
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
…re, remove dead code

- Move diagnostic scripts to scripts/diagnostics/ (analyze_vtime.py, diagnose_events.py, diagnose_filtered.py)
- Move loose docs to docs/ (DEV_README.md, DOCKER.md, EARLIEST_vs_LATEST.md, GEMM_ID_ISSUES.md, LOG_FORMAT.md)
- Harden .gitignore: add *.patch, .tmp_*/, !tests/**/*.log, PRD_*.md patterns
- Remove dead csrc/worker/ directory (C-style legacy code with wrong hardcoded paths)
- Remove dead csrc/launch_processes.sh (no active references found)

All file moves use git mv to preserve history. Zero behavior changes.
- Removed commented-out debug flags (lines ~21-23)
- Removed Umpire external project block (lines ~37-51)
- Removed commented proto generation foreach loop (lines ~157-185)
- Removed commented morphling_interceptor/morphling libraries (lines ~191-203)
- Removed commented morphling_server/morphling_worker_server (lines ~212-258)
- Removed commented _intercept extension target (lines ~366-387)
- Removed commented morphling_allocator extension target (lines ~389-427)

Active targets preserved: _C, _Msg, _GreenCtx

Lines reduced: 435 → 296 (139 lines removed)
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
…MakeLists.txt

Pre-existing copy-paste bug causing Docker build failure. Lines 180-195 were
an exact duplicate of the bench_trace_switch target defined at lines 164-178.
- csrc/utils/ → csrc/core/ (logger.cpp, cuda_utils.cpp)
- csrc/common/generator.cc → csrc/core/generator.cpp
- csrc/base/*.cc → external/muduo_base/*.cc
- csrc/backend/server_base.cpp → split files
…package

- morphling/common/: config, types, logging, keywords, decorators
- morphling/utils/: hfparser, checkpoints, save_load
- morphling/runtime/: model_emulator, green_context, ldpc_trace_adapter
- morphling/hooks/: autograd, timer, comm
- morphling/backend/: base (BaseBackend), rabbitmq
- morphling/entrypoint/: run_device, cmdline, emulator, generate_device_config
- morphling/checkpoint/: save_and_load
- morphling/simulator/: events, network, profiles
- morphling/__init__.py, morphling/proto/__init__.py
@xyf2002 xyf2002 self-requested a review June 1, 2026 20:11
xly added 7 commits June 3, 2026 14:47
Previously `run_tests_from` ran each binary with `|| true`, so even when a
test exited non-zero the script returned 0. CI gating on this script was
silently impossible. Now the function tracks the worst exit code seen,
continues through the remaining binaries, and returns non-zero if any
test failed.
The numerical_consistency CI workflow runs `pytest ... -m smoke` over
a hand-listed file set, but five of those files carried zero `smoke`
markers — they were collected and silently contributed nothing. Add a
module-level `pytestmark = pytest.mark.smoke` to each so CI signal
matches the workflow's intent.

Files now actually exercised by the smoke filter:
  - test_determinism_utils.py
  - test_numerical_utils.py
  - test_golden_generation.py
  - test_deep_verification_script.py
  - test_convergence_regression.py

Smoke collection grows from 7 to 24 tests.
Replace the hand-listed file glob with `pytest tests/python -m smoke`
plus an `--ignore=` set that documents what needs GPU or the morphling
C extension (both unavailable on hosted runners). New tests marked
`@pytest.mark.smoke` are picked up automatically.

Adds a second CPU-only job `cpu-entrypoint` that runs the CLI tests
under `tests/python/unit/entrypoint/`. Those tests already stub out
torch, morphling._C, and huggingface_hub via monkeypatch, so they only
need pytest + pytest-timeout.

Also bumps the runner Python to 3.10 to match `pyproject.toml`'s
`requires-python`.
Two files under tests/cpp/unit/ were never wired into CMakeLists.txt
and contained only `int main()` print demos with no assertions:

  - test_uuid.cpp        — generates a UUID and prints it
  - ml/test_torch_layout.cpp — prints a tensor before/after from_blob

Neither exercises any Morphling code path. Remove them, drop the empty
`ml/` directory entry from tests/cpp/README.md, and clean up the
dead commented-out `test_torch_layout` block in cmake/tests.cmake.
Weekly pip updates with grouped lint tools (ruff, pre-commit,
clang-format) so style-only bumps land as a single PR. Weekly GitHub
Actions version bumps. Monthly Docker base-image checks (the PyTorch
CUDA-devel image is high-impact; keep it manual-review-friendly).
Static analysis on PRs, pushes to main/dev, and a weekly cron. The C++
build uses the test tree with every optional suite (CUDA, XtGemm, green
context, zerocopy, checkpoint) turned OFF — enough for CodeQL DB
extraction without requiring CUDA on a hosted runner.

`external/**` is excluded so vendored protobuf and muduo_base aren't
analyzed.
CLAUDE.md §5 declares proto/** public API requiring confirmation before
edits. This was previously unenforced. The new workflow runs on PRs and
pushes that touch proto/** or the workflow itself:

  - `buf lint` against the DEFAULT rule set, with PACKAGE_VERSION_SUFFIX,
    FIELD_LOWER_SNAKE_CASE, and PACKAGE_DIRECTORY_MATCH grandfathered
    via the `except` list (cleanup deferred).
  - `buf breaking --against` the PR base branch using the FILE rule,
    catching wire-incompatible changes before merge.
@github-advanced-security

Copy link
Copy Markdown

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

  • The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
  • Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
  • You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

xly added 10 commits June 3, 2026 16:11
buf config v1 only accepts ignore_unstable_packages under `breaking:`,
not `lint:` (it was a v1beta1 lint field). The misplacement made buf
reject the whole config before linting, failing the Proto Compatibility
workflow with a decode error.
Two collection-time failures in the Python Smoke workflow:

1. cpu-smoke: five files (test_ldpc_adapter, test_matrix_sum,
   test_pytorch_mempool, test_tenseal, test_torch_cpu) import pandas /
   psutil / tqdm at module scope. pytest imports every module during
   collection before reading markers, so -m smoke deselecting them did
   not prevent the import error. Add them to the --ignore list.

2. cpu-entrypoint: the repo-root conftest.py registers
   tests.python.testutils.determinism as a global plugin, which imports
   numpy + torch. The lean install (pytest only) broke collection. Add
   numpy + torch to that job.
The always-on C++ test targets include csrc/core/types_and_defs.h, which
hard-includes rapidjson/document.h. The minimal CodeQL apt install omitted
rapidjson-dev (the Dockerfile installs it at line 43), so the cpp analysis
build failed with a fatal 'rapidjson/document.h: No such file' error.
…ntracts

The Proto Compatibility workflow failed because buf could not build the
module. Root causes, all fixed here:

1. Two orphan protos were unparseable AND semantically broken:
   - collective.proto referenced AllReduceRequest/Response, types that
     have never existed in the tree or git history.
   - matmul.proto duplicated the live global_api.proto ComputeGemm
     contract and collided on extension tags 101/102.
   Both were uncompiled and unreferenced by any C++/Python source and
   were never compiled in history. Deleted them; global_api.proto already
   carries the live ComputeGemm contract.

2. Bare-filename imports (import "morphling.proto") could not resolve
   from the repo root. Introduce a buf.work.yaml workspace rooted at
   proto/ so the import root matches the protoc IMPORT_DIRS, and move
   buf.yaml to proto/buf.yaml as the module config.

3. buf config v1 rejected ignore_unstable_packages under lint:; it belongs
   under breaking:. Grandfather pre-existing legacy lint conventions on the
   three live proto2 contracts (DIRECTORY_SAME_PACKAGE, ENUM_FIRST_VALUE_ZERO,
   ENUM_VALUE_PREFIX) rather than restyle wire contracts.

4. The breaking-change check now probes base-branch buildability first and
   skips with a notice when the base predates the guard (e.g. main still has
   the malformed protos), instead of failing on an unbuildable baseline.

Update workflow path triggers from buf.yaml to buf.work.yaml.
EmulationEngine.__enter__ only ever reached a RuntimeError: its
from_pretrained decorator depends on MemoryManagerClient, which has no
C++ implementation (removed during the open-source pass, see #53). A
repo-wide audit found zero live consumers — no entry point or script
reaches it, and the only tests skip themselves on the missing symbol.

Delete the unreachable module chain (model_emulator, patching,
shm_mapping) and strip its guarded re-export from runtime/__init__.

The binding-test case test_memory_manager_client_absence_is_handled
imported model_emulator and asserted EmulationEngine exists; it is
dropped in this same commit so the change stays independently testable.
The two live morphling._C binding checks (ArcherTensorHandle,
set_tensor_shm) are kept.
test_param_offload.py and test_loaded_lib.py both skip at module level
(allow_module_level=True) on the absence of MemoryManagerClient in
morphling._C, so they never ran in any environment. They imported
EmulationEngine / InitEmptyModel, removed in the previous commit.
… deltas (#60)

#55 landed server-side device measurement but the measured_* fields are
write-only for decision-making: stored, serialized, and logged ad hoc,
yet never read by any scheduling path (the vtime calculator reads only
the device-reported legacy fields). #60 cannot pick a reconciliation
policy without a machine-readable record of measured-vs-reported skew.

Emit one PROFILE_DELTA row to perf_server.log on every measured-profile
update, capturing reported vs measured vs ratio per field. The row-format
logic lives in a pure FormatProfileDeltaRow() seam so it is unit-testable
without the tracker's scheduler/network link set.

Observability only: no reconciliation decision is made, the vtime model
still reads legacy fields verbatim, and with the shipped default (probes
off) no rows are emitted. Latency has no ratio column on purpose --
reported is microseconds, measured is nanoseconds; raw columns are
emitted side by side so analysis normalizes rather than inheriting a
1000x error.
…els (#60)

Unit-tests the pure FormatProfileDeltaRow() seam in the zerocopy suite:
column order/count, measured/reported ratios, the -1 sentinel when a
reported field is 0, and the zero-ratio case when measured is absent.
…orkflow (#60)

Adds a Profile delta log subsection: the PROFILE_DELTA schema, the
microsecond-vs-nanosecond latency caveat, and the grep-based workflow
for collecting measured-vs-reported ratios to inform the #60 policy.
Open-source readiness follow-up: remove maintainer-specific paths,
names, and internal planning artifacts that gate A1 missed (it only
scanned csrc/morphling/scripts/tests, not docs/).

- rm docs/internal/vtime-data-inventory.md: private 64-device experiment
  inventory with hardcoded /home/xly result paths; gitignore docs/internal/
- CLAUDE.md: strip personal header (handle, non-English directive, creed);
  keep numbered sections referenced by README/CONTRIBUTING/Makefile/MANIFEST
- docs/opensource-readiness.md: drop 'Owner' line and private ~/batchgen
  clone path
- figures/README.md: drop personal ~/morphling-figures-backup path
- untrack .taskmaster/.env.example (LLM dev-tool template, not project config)
- .gitignore: ignore docs/internal/, remove .env.example exception
@drunkcoding drunkcoding marked this pull request as ready for review June 4, 2026 08:05
xly added 8 commits June 5, 2026 08:25
The 'main' branch protection requires a status check named exactly
'smoke-tests', but the job published as 'pytest -m smoke (CPU)', leaving
PRs blocked on a check that never reported. Rename the display name to
match the required context.
The morphling_emulator entrypoint exec'd a standalone C++ morphling_server
binary whose build target was removed in e5bfbef (gRPC server no longer
needed), so the documented README Quick Start failed with FileNotFoundError.
It also used parse_args(), which returns a bare Namespace and never runs
EmulatorConfig.__post_init__, leaving the checkpoint env vars unset.

Rewire the entrypoint to start the proxy backend (ProxySvr.initialize/start)
via parse_args_into_dataclasses(), matching scripts/run_devices.py. The
server loads the checkpoint, binds 0.0.0.0:39000 (overridable via
--listen_ip/--listen_port), and serves until Ctrl-C. Update the README and
quickstart docs to describe the long-running proxy-server behavior.
docs/deployment.md mounts docker-nginx/nginx.conf and
docker-nginx/morphling_stream.conf for the physical-device deployment and
asserts they exist in the repo root, but the files were absent, so the
documented Nginx stream-proxy step was not reproducible.

Add both in composable form: nginx.conf includes stream_conf.d/*.conf (no
inline stream block) and morphling_stream.conf carries the stream block
forwarding :443 to the local proxy on :39000. Validated with `nginx -t`
using the exact mount layout from deployment.md.
…d test

docs/GEMM_ID_ISSUES.md was a resolved debugging postmortem: all three issues
it described (gemm_id stuck at 0, missing log header comments, a merge-script
syntax error) are already fixed in the current tree, and it referenced stale
.cc paths that no longer exist.

Rewrite it as the canonical "Performance Log Formats" reference (VTIME,
Throughput, PROFILE_DELTA schemas + the gemm_id field), grounded in the
csrc/backend/device_tracker.cpp emitters, and update the README link text.
Add tests/python/unit/test_merge_perf_logs.py to lock in the documented
behavior: header preservation, gemm_id field positions/increments, and
timestamp-sorted merge output.
docs/opensource-readiness.md was a fully-executed 425-line planning PRD
(all W1-W4 deliverables landed, removal targets gone) that had been
archived under docs/ with no README link and no live consumer. Delete it;
git history preserves the plan.

docs/EARLIEST_vs_LATEST.md is a useful conceptual reference but baked in
one-off "from the provided logs (97 GEMMs)" numbers from a past debugging
session. Drop the non-reproducible Data Overview and replace the captured
GEMM-0 figures with a round, clearly-illustrative worked example; keep the
strategy comparison, guidance, and verified sync_virtual_time/analyze_sync
workflow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants