Open-source readiness: cleanup, governance, test fixes#54
Closed
drunkcoding wants to merge 498 commits into
Closed
Open-source readiness: cleanup, governance, test fixes#54drunkcoding wants to merge 498 commits into
drunkcoding wants to merge 498 commits into
Conversation
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
…re, remove dead code - Move diagnostic scripts to scripts/diagnostics/ (analyze_vtime.py, diagnose_events.py, diagnose_filtered.py) - Move loose docs to docs/ (DEV_README.md, DOCKER.md, EARLIEST_vs_LATEST.md, GEMM_ID_ISSUES.md, LOG_FORMAT.md) - Harden .gitignore: add *.patch, .tmp_*/, !tests/**/*.log, PRD_*.md patterns - Remove dead csrc/worker/ directory (C-style legacy code with wrong hardcoded paths) - Remove dead csrc/launch_processes.sh (no active references found) All file moves use git mv to preserve history. Zero behavior changes.
- Removed commented-out debug flags (lines ~21-23) - Removed Umpire external project block (lines ~37-51) - Removed commented proto generation foreach loop (lines ~157-185) - Removed commented morphling_interceptor/morphling libraries (lines ~191-203) - Removed commented morphling_server/morphling_worker_server (lines ~212-258) - Removed commented _intercept extension target (lines ~366-387) - Removed commented morphling_allocator extension target (lines ~389-427) Active targets preserved: _C, _Msg, _GreenCtx Lines reduced: 435 → 296 (139 lines removed)
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
…MakeLists.txt Pre-existing copy-paste bug causing Docker build failure. Lines 180-195 were an exact duplicate of the bench_trace_switch target defined at lines 164-178.
- csrc/utils/ → csrc/core/ (logger.cpp, cuda_utils.cpp) - csrc/common/generator.cc → csrc/core/generator.cpp - csrc/base/*.cc → external/muduo_base/*.cc - csrc/backend/server_base.cpp → split files
…sts after T11 rename
…sts for muduo_base headers
…ArcherTensorHandle)
…nneeded checkpoint_handle
…runtime symbol resolution
…package - morphling/common/: config, types, logging, keywords, decorators - morphling/utils/: hfparser, checkpoints, save_load - morphling/runtime/: model_emulator, green_context, ldpc_trace_adapter - morphling/hooks/: autograd, timer, comm - morphling/backend/: base (BaseBackend), rabbitmq - morphling/entrypoint/: run_device, cmdline, emulator, generate_device_config - morphling/checkpoint/: save_and_load - morphling/simulator/: events, network, profiles - morphling/__init__.py, morphling/proto/__init__.py
added 7 commits
June 3, 2026 14:47
Previously `run_tests_from` ran each binary with `|| true`, so even when a test exited non-zero the script returned 0. CI gating on this script was silently impossible. Now the function tracks the worst exit code seen, continues through the remaining binaries, and returns non-zero if any test failed.
The numerical_consistency CI workflow runs `pytest ... -m smoke` over a hand-listed file set, but five of those files carried zero `smoke` markers — they were collected and silently contributed nothing. Add a module-level `pytestmark = pytest.mark.smoke` to each so CI signal matches the workflow's intent. Files now actually exercised by the smoke filter: - test_determinism_utils.py - test_numerical_utils.py - test_golden_generation.py - test_deep_verification_script.py - test_convergence_regression.py Smoke collection grows from 7 to 24 tests.
Replace the hand-listed file glob with `pytest tests/python -m smoke` plus an `--ignore=` set that documents what needs GPU or the morphling C extension (both unavailable on hosted runners). New tests marked `@pytest.mark.smoke` are picked up automatically. Adds a second CPU-only job `cpu-entrypoint` that runs the CLI tests under `tests/python/unit/entrypoint/`. Those tests already stub out torch, morphling._C, and huggingface_hub via monkeypatch, so they only need pytest + pytest-timeout. Also bumps the runner Python to 3.10 to match `pyproject.toml`'s `requires-python`.
Two files under tests/cpp/unit/ were never wired into CMakeLists.txt and contained only `int main()` print demos with no assertions: - test_uuid.cpp — generates a UUID and prints it - ml/test_torch_layout.cpp — prints a tensor before/after from_blob Neither exercises any Morphling code path. Remove them, drop the empty `ml/` directory entry from tests/cpp/README.md, and clean up the dead commented-out `test_torch_layout` block in cmake/tests.cmake.
Weekly pip updates with grouped lint tools (ruff, pre-commit, clang-format) so style-only bumps land as a single PR. Weekly GitHub Actions version bumps. Monthly Docker base-image checks (the PyTorch CUDA-devel image is high-impact; keep it manual-review-friendly).
Static analysis on PRs, pushes to main/dev, and a weekly cron. The C++ build uses the test tree with every optional suite (CUDA, XtGemm, green context, zerocopy, checkpoint) turned OFF — enough for CodeQL DB extraction without requiring CUDA on a hosted runner. `external/**` is excluded so vendored protobuf and muduo_base aren't analyzed.
CLAUDE.md §5 declares proto/** public API requiring confirmation before
edits. This was previously unenforced. The new workflow runs on PRs and
pushes that touch proto/** or the workflow itself:
- `buf lint` against the DEFAULT rule set, with PACKAGE_VERSION_SUFFIX,
FIELD_LOWER_SNAKE_CASE, and PACKAGE_DIRECTORY_MATCH grandfathered
via the `except` list (cleanup deferred).
- `buf breaking --against` the PR base branch using the FILE rule,
catching wire-incompatible changes before merge.
|
You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool. What Enabling Code Scanning Means:
For more information about GitHub Code Scanning, check out the documentation. |
added 10 commits
June 3, 2026 16:11
buf config v1 only accepts ignore_unstable_packages under `breaking:`, not `lint:` (it was a v1beta1 lint field). The misplacement made buf reject the whole config before linting, failing the Proto Compatibility workflow with a decode error.
Two collection-time failures in the Python Smoke workflow: 1. cpu-smoke: five files (test_ldpc_adapter, test_matrix_sum, test_pytorch_mempool, test_tenseal, test_torch_cpu) import pandas / psutil / tqdm at module scope. pytest imports every module during collection before reading markers, so -m smoke deselecting them did not prevent the import error. Add them to the --ignore list. 2. cpu-entrypoint: the repo-root conftest.py registers tests.python.testutils.determinism as a global plugin, which imports numpy + torch. The lean install (pytest only) broke collection. Add numpy + torch to that job.
The always-on C++ test targets include csrc/core/types_and_defs.h, which hard-includes rapidjson/document.h. The minimal CodeQL apt install omitted rapidjson-dev (the Dockerfile installs it at line 43), so the cpp analysis build failed with a fatal 'rapidjson/document.h: No such file' error.
…ntracts
The Proto Compatibility workflow failed because buf could not build the
module. Root causes, all fixed here:
1. Two orphan protos were unparseable AND semantically broken:
- collective.proto referenced AllReduceRequest/Response, types that
have never existed in the tree or git history.
- matmul.proto duplicated the live global_api.proto ComputeGemm
contract and collided on extension tags 101/102.
Both were uncompiled and unreferenced by any C++/Python source and
were never compiled in history. Deleted them; global_api.proto already
carries the live ComputeGemm contract.
2. Bare-filename imports (import "morphling.proto") could not resolve
from the repo root. Introduce a buf.work.yaml workspace rooted at
proto/ so the import root matches the protoc IMPORT_DIRS, and move
buf.yaml to proto/buf.yaml as the module config.
3. buf config v1 rejected ignore_unstable_packages under lint:; it belongs
under breaking:. Grandfather pre-existing legacy lint conventions on the
three live proto2 contracts (DIRECTORY_SAME_PACKAGE, ENUM_FIRST_VALUE_ZERO,
ENUM_VALUE_PREFIX) rather than restyle wire contracts.
4. The breaking-change check now probes base-branch buildability first and
skips with a notice when the base predates the guard (e.g. main still has
the malformed protos), instead of failing on an unbuildable baseline.
Update workflow path triggers from buf.yaml to buf.work.yaml.
EmulationEngine.__enter__ only ever reached a RuntimeError: its from_pretrained decorator depends on MemoryManagerClient, which has no C++ implementation (removed during the open-source pass, see #53). A repo-wide audit found zero live consumers — no entry point or script reaches it, and the only tests skip themselves on the missing symbol. Delete the unreachable module chain (model_emulator, patching, shm_mapping) and strip its guarded re-export from runtime/__init__. The binding-test case test_memory_manager_client_absence_is_handled imported model_emulator and asserted EmulationEngine exists; it is dropped in this same commit so the change stays independently testable. The two live morphling._C binding checks (ArcherTensorHandle, set_tensor_shm) are kept.
test_param_offload.py and test_loaded_lib.py both skip at module level (allow_module_level=True) on the absence of MemoryManagerClient in morphling._C, so they never ran in any environment. They imported EmulationEngine / InitEmptyModel, removed in the previous commit.
… deltas (#60) #55 landed server-side device measurement but the measured_* fields are write-only for decision-making: stored, serialized, and logged ad hoc, yet never read by any scheduling path (the vtime calculator reads only the device-reported legacy fields). #60 cannot pick a reconciliation policy without a machine-readable record of measured-vs-reported skew. Emit one PROFILE_DELTA row to perf_server.log on every measured-profile update, capturing reported vs measured vs ratio per field. The row-format logic lives in a pure FormatProfileDeltaRow() seam so it is unit-testable without the tracker's scheduler/network link set. Observability only: no reconciliation decision is made, the vtime model still reads legacy fields verbatim, and with the shipped default (probes off) no rows are emitted. Latency has no ratio column on purpose -- reported is microseconds, measured is nanoseconds; raw columns are emitted side by side so analysis normalizes rather than inheriting a 1000x error.
…els (#60) Unit-tests the pure FormatProfileDeltaRow() seam in the zerocopy suite: column order/count, measured/reported ratios, the -1 sentinel when a reported field is 0, and the zero-ratio case when measured is absent.
Open-source readiness follow-up: remove maintainer-specific paths, names, and internal planning artifacts that gate A1 missed (it only scanned csrc/morphling/scripts/tests, not docs/). - rm docs/internal/vtime-data-inventory.md: private 64-device experiment inventory with hardcoded /home/xly result paths; gitignore docs/internal/ - CLAUDE.md: strip personal header (handle, non-English directive, creed); keep numbered sections referenced by README/CONTRIBUTING/Makefile/MANIFEST - docs/opensource-readiness.md: drop 'Owner' line and private ~/batchgen clone path - figures/README.md: drop personal ~/morphling-figures-backup path - untrack .taskmaster/.env.example (LLM dev-tool template, not project config) - .gitignore: ignore docs/internal/, remove .env.example exception
added 8 commits
June 5, 2026 08:25
The 'main' branch protection requires a status check named exactly 'smoke-tests', but the job published as 'pytest -m smoke (CPU)', leaving PRs blocked on a check that never reported. Rename the display name to match the required context.
The morphling_emulator entrypoint exec'd a standalone C++ morphling_server binary whose build target was removed in e5bfbef (gRPC server no longer needed), so the documented README Quick Start failed with FileNotFoundError. It also used parse_args(), which returns a bare Namespace and never runs EmulatorConfig.__post_init__, leaving the checkpoint env vars unset. Rewire the entrypoint to start the proxy backend (ProxySvr.initialize/start) via parse_args_into_dataclasses(), matching scripts/run_devices.py. The server loads the checkpoint, binds 0.0.0.0:39000 (overridable via --listen_ip/--listen_port), and serves until Ctrl-C. Update the README and quickstart docs to describe the long-running proxy-server behavior.
docs/deployment.md mounts docker-nginx/nginx.conf and docker-nginx/morphling_stream.conf for the physical-device deployment and asserts they exist in the repo root, but the files were absent, so the documented Nginx stream-proxy step was not reproducible. Add both in composable form: nginx.conf includes stream_conf.d/*.conf (no inline stream block) and morphling_stream.conf carries the stream block forwarding :443 to the local proxy on :39000. Validated with `nginx -t` using the exact mount layout from deployment.md.
…d test docs/GEMM_ID_ISSUES.md was a resolved debugging postmortem: all three issues it described (gemm_id stuck at 0, missing log header comments, a merge-script syntax error) are already fixed in the current tree, and it referenced stale .cc paths that no longer exist. Rewrite it as the canonical "Performance Log Formats" reference (VTIME, Throughput, PROFILE_DELTA schemas + the gemm_id field), grounded in the csrc/backend/device_tracker.cpp emitters, and update the README link text. Add tests/python/unit/test_merge_perf_logs.py to lock in the documented behavior: header preservation, gemm_id field positions/increments, and timestamp-sorted merge output.
docs/opensource-readiness.md was a fully-executed 425-line planning PRD (all W1-W4 deliverables landed, removal targets gone) that had been archived under docs/ with no README link and no live consumer. Delete it; git history preserves the plan. docs/EARLIEST_vs_LATEST.md is a useful conceptual reference but baked in one-off "from the provided logs (97 GEMMs)" numbers from a past debugging session. Drop the non-reproducible Data Overview and replace the captured GEMM-0 figures with a round, clearly-illustrative worked example; keep the strategy comparison, guidance, and verified sync_virtual_time/analyze_sync workflow.
… undeclared license)
…ed personal conda path
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
End-to-end open-source readiness pass for the EdgeSys '26 companion release. This PR brings
devto a state where it can be made public: no hardcoded personal paths, no unreviewed third-party code, complete governance metadata, a clean test suite, and a documented contributor experience.123 commits / +7,772 / −27,181 lines / 160 files changed.
Highlights
morphling/evaluation/package are deleted (with full off-tree backup) because they required a private baselines repo and per-device measurement data that cannot ship publicly. Seedocs/paper.mdfor the data-availability statement.SECURITY.md,CODE_OF_CONDUCT.md(Contributor Covenant v2.1),NOTICE,THIRD_PARTY_LICENSES.md, andexternal/muduo_base/LICENSE. README gains a Licensing section./home/erenpaths in the C-side intercept layer fixed. Build is now portable; intercept log directory honoursMORPHLING_INTERCEPT_LOG_DIR.pyproject.tomlbuild-system andrequirements.txtruntime now both pinprotobuf>=4.21.6,<7, matching the runtime 6.x that ships in the Docker image..env.examplemoved under.taskmaster/— the LLM API-key template was misleading as a top-level file in a non-LLM project.CONTRIBUTING.mdnow accurately describes what CI does (Build Sanity: hadolint + validate-pyproject + pip --dry-run + MANIFEST.in coverage + cffconvert + community-files check) and what it does NOT do (nodocker build, no pytest, no GPU). PR template adds aneeds-gpu-verificationlabel workflow for CPU-only contributors.TODO/FIXMEmarkers now link to tracking issues (Device measurement: trust model needs revisit #45–Test placeholder: FlatBuffers comparison never implemented #50).make docker-testwent from "0 collected (fatal)" before this PR to "100 passed, 12 skipped, 0 failed, 0 errors".What changed (by area)
Repo hygiene & release scope
chore(release): purge paper figures, plot scripts, and morphling.evaluation— 59 files removed (−13,126 lines); local backup at~/morphling-figures-backup/<timestamp>/.chore(paths): replace hardcoded /home/eren paths in intercept layer—csrc/intercept/interceptor.h,csrc/memory/shared_memory_{initializer,manager}.c, new env var documented indocs/troubleshooting.md.chore(env): move .env.example under .taskmaster.chore(tests): move tests/cpp/echo_* into unit/network/— enforces the documentedtests/cpp/{unit,bench,integration}/layout.Dependencies & build
chore(deps): align protobuf version between pyproject and requirements— consistent>=4.21.6,<7pin.chore(packaging): expand pyproject metadata— description, requires-python, license, 10 classifiers, [project.urls]. No PyPI publishing workflow added.Governance & docs
docs(security): add SECURITY.md and CODE_OF_CONDUCT.md— GitHub Security Advisories channel; Contributor Covenant v2.1 verbatim.docs(licensing): add NOTICE, THIRD_PARTY_LICENSES, muduo LICENSE— vendored, submodule, and runtime third-party components inventoried.docs(paper): rewrite figure inventory after pipeline purge— all figure rows now declare[data not public]; runtime-API pointers retained for re-implementation.docs(contrib): reconcile dev-loop docs and CI scope wording.docs(prd): archive open-source readiness PRD under docs/.Test fixes (made under this PR after CI unblocked)
fix(tests): unblock pytest collection on pytest 9 + protobuf 6— movespytest_pluginsto a top-levelconftest.py; switchesBuildPackageProtostopython3 -m grpc_tools.protoc(libprotoc 31) instead of the torch-bundledprotoc 3.13.fix(tests): quarantine torch-monkey-patching demo from pytest collection—test_pytorch_decorator.pyhad notest_*functions but globally replacedtorch.*andtorch.Tensor.*at import time, breaking 40+ downstream tests. Renamed to_demo_pytorch_decorator.py.fix(greenctx): re-export trace helpers from green_context_backends shim—import *does not propagate private names; added explicit re-export.fix(tests): point swap_timing source-check at canonical cpp_backend.py.fix(tests): prefer real morphling import over sys.modules stub— three_bootstrap_morphlinghelpers were installing baretypes.ModuleTypestubs that poisonedsys.modules['morphling']for the rest of the session.fix(tests): guard tests blocked by pre-existing API drift— module-levelpytest.skipfor three pre-existing API gaps tracked in Test: rewrite test_pytorch_autograd against current autograd.py API #51, Test: rewrite gpt2_training_test for transformers AdamW removal #52, C extension: morphling._C missing ArcherTensorHandle, MemoryManagerClient #53.Code traceability
chore(code): link 6 high-visibility TODOs to tracking issues— issues Device measurement: trust model needs revisit #45–Test placeholder: FlatBuffers comparison never implemented #50.Test plan
Built and verified inside the canonical Docker image at every step.
```bash
make docker-build # green
make docker-test # 100 passed, 12 skipped, 0 failed, 0 errors
```
Skipped tests are gated by
pytest.skip(..., allow_module_level=True)with clear reasons and issue links:morphling._CC-extension symbol drift (C extension: morphling._C missing ArcherTensorHandle, MemoryManagerClient #53) — 2 filesmorphling.hooks.autogradremoved symbols (Test: rewrite test_pytorch_autograd against current autograd.py API #51) — 1 filetransformers.AdamWupstream removal (Test: rewrite gpt2_training_test for transformers AdamW removal #52) — 1 fileC++ test suite (
tests/cpp/build/test_xtgemm_worker, etc.) passes.Acceptance gates
rgover csrc/morphling/scripts/tests returns 0>=4.21.6,<7.env.examplenot at root.taskmaster/.env.exampleTODO(owner)in public docsneeds-gpu-verification.cc/.cppoutside{unit,bench,integration}/figures/README.mdtracked;figures/evaluation/gitignoredmorphling/evaluation/empty.github/workflows/{publish,release}*.ymlabsentmake docker-buildgreen;make docker-test= 100 passedFollow-up issues opened (not blocking this PR)
Local backup of removed assets
All deleted figures, plot scripts, generators, paper-experiment drivers, and the
morphling.evaluationpackage are preserved at:```
~/morphling-figures-backup/20260523T181313Z/
├── tracked/ # ex-tracked figures + comparison/
├── evaluation/ # untracked experiment outputs
├── scripts/ # 37 plot/aggregator/microbench/driver scripts
├── morphling-evaluation/ # entire package
├── tests-cpp-bench-intercept/
└── tests-deleted/ # 4 dependent tests
```
Not in this PR (out of scope per plan)