retrace-cpython is the CPython runtime used by Retrace when it needs native
execution probes for deterministic record and replay.
It produces patched CPython executables that behave like normal Python
interpreters, but expose a small built-in _retrace module. Retrace uses that
module to observe exact execution coordinates, bytecode thread-switch points,
and call_at callbacks with much lower overhead than Python-level monitoring
hooks.
This repository is the build and release recipe for those interpreters. It does not vendor CPython.
Full project documentation lives under docs/ and can be built with MkDocs.
Release builds generate llms.txt and llms-full.txt from those docs for
AI-readable package context.
- patched CPython executables for supported upstream CPython releases
- a built-in
_retracemodule for native probe access - fast execution coordinate snapshots and deltas
- bytecode-level thread-switch callbacks for scheduling telemetry
- call_at callbacks at exact Python execution coordinates
- build, test, package, and PyPI release infrastructure
The result is intended to be used by higher-level Retrace packages. End users
should not normally have to think about this project directly; installing
Retrace should select the matching retrace-cpython artifact for their Python
version and platform.
Release artifacts contain the patched Python executable and any required
CPython runtime dynamic libraries. They do not include a full copy of the
standard library, tests, headers, static archives, or ensurepip bundles.
The patched executable can run against a vanilla installation of the same
CPython version by setting PYTHONHOME. That keeps artifacts small while still
using the exact CPython runtime version expected by the target environment.
On unpatched CPython, the built-in _retrace module is absent. Consumers should
probe for support at runtime:
import importlib.util
if importlib.util.find_spec("_retrace") is None:
# native probes unavailable; use fallback behavior or fail early
...Coordinates are always enabled in the native layer. Higher-level Retrace cursor code is responsible for deciding which coordinates belong to application code and which belong to control-plane code.
Patched interpreters expose a public retrace module backed by the native
_retrace builtin:
retrace.coordinates(thread_id=None, drop=0)
retrace.thread_delta()
retrace.hash()
retrace.exclude(callable)
retrace.disable(callable)
retrace.include(callable)
retrace.enable(callable)
retrace.CoordinateSpace().wrap(callable)
retrace.run_disabled(callable, *args, **kwargs)
retrace.with_new_coordinates(callable, *args, **kwargs)
retrace.root_space.thread_switch = callback_or_none
retrace.CoordinateSpace().thread_switch = callback_or_none
retrace.CoordinateSpace().monitoring.register_callback(
tool_id, event, callback_or_none, target_space=retrace.root_space
)
retrace.CoordinateSpace().settrace(callback_or_none, target_space=retrace.root_space)
retrace.CoordinateSpace().setprofile(callback_or_none, target_space=retrace.root_space)
retrace.call_at(coordinates, callback, overshoot_callback=None)
retrace.call_at(None)
retrace.ThreadHandoff(timeout=None)coordinates() returns a tuple of Python int values ordered from oldest
visible Python frame to current frame. Each visible frame contributes two
values:
(call_ordinal, instruction_coordinate)
The call_ordinal is usually 0; it becomes non-zero only when one parent
instruction enters multiple visible Python frames before that parent advances.
thread_id is the integer returned by _thread.get_ident() and selects which
thread to inspect; unknown thread ids raise LookupError. drop omits leading
coordinate words from the returned tuple.
thread_delta() is the fast current-thread path for frequent thread scheduling
events. It returns:
[common_prefix_count, *new_suffix]common_prefix_count is measured in coordinate words, not frames. Consumers
update their materialized stack like this:
delta = _retrace.thread_delta()
common_count = delta[0]
del stack[common_count:]
stack.extend(delta[1:])hash() returns the current thread's 64-bit coordinate-location hash as a
Python int. It hashes the same logical pair stream returned by
coordinates().
RETRACE_ROOT_SEED can be set to any stable string before interpreter startup
to seed the main thread id. If it is unset, Retrace uses the literal
seed string retrace.
Use retrace for Python code; it exposes the native helpers and wraps
thread-switch registrations with retrace.exclude. _retrace remains the
lower-level builtin substrate for capability probes and native-oriented tests.
root_space.thread_switch stores a Python callback called as
callback(previous_delta, next_thread_id) for root space. Use
space.thread_switch = callback to target another coordinate
space, or assign None to clear a callback.
Thread switch callbacks are namespace-local. For each registered coordinate space, Retrace tracks the last thread that executed visible bytecode in that space. Other spaces are invisible to that callback. If thread A runs in space S, thread B runs outside S, then thread A runs in S again, no S callback fires. If thread A runs in S, thread B runs outside S, then thread C runs in S, the S callback fires once with A's delta in S and C's thread id.
The switch check runs before each bytecode instruction. The hot path is an interpreter-wide pointer comparison; if the bytecode thread has not changed, no namespace work is performed. When the bytecode thread has changed, Retrace examines only registered spaces with switch callbacks. A callback is considered only when the instruction about to execute is visible in that callback's space. If the current thread is already that space's last visible bytecode thread, the callback is a same-thread no-op and is dropped without producing a callback delta.
For a real namespace switch, Retrace computes previous_delta for the previous
visible thread in that same space before advancing the space's last-thread
cursor. The callback is then delivered on the new/current thread before the
current bytecode instruction executes. previous_delta has the same shape as
thread_delta(space): (common_prefix_count, *new_suffix). next_thread_id
is the new/current thread's stable Retrace thread id as a Python int. The
first visible thread in a space initializes that space's cursor and does not
fire a callback, because there is no previous visible thread or delta. If the
previous visible thread completed naturally, previous_delta is None. This
is a distinct completion sentinel, not an empty/root coordinate.
call_at(coordinates, callback, overshoot_callback=None) arms one callback per
interpreter for the current thread. If coordinates is already in the past for
the current thread, arming raises ValueError. When the current thread reaches
the exact coordinate tuple, CPython clears the callback and calls callback()
on that thread. If the thread passes the target coordinate without hitting it
exactly, CPython clears the callback and calls overshoot_callback() when one
was supplied.
Passing coordinates=None arms a one-shot completion callback for the current
thread instead of a coordinate checkpoint. No coordinate checkpoint or overshoot
callback is armed for a completion target.
CoordinateSpace.monitoring mirrors sys.monitoring for space-filtered event
callbacks on Python 3.12 and newer. space.monitoring.register_callback( tool_id, event, callback, target_space=target) routes events observed while
space is active into callback running in target; events in other spaces
hit a native no-op dispatch path. target_space defaults to root space. The
set_events and set_local_events controls still operate at CPython's
tool/code level, so they enable event production globally for that tool; the
Retrace wrapper only filters callback delivery by coordinate space. On Python
versions without sys.monitoring, the monitoring proxy raises RuntimeError.
CoordinateSpace.settrace(callback, target_space=target) and
CoordinateSpace.setprofile(callback, target_space=target) provide the same
source-space filter and target-space callback routing for sys.settrace and
sys.setprofile. Events in other spaces hit the same native no-op dispatch
path. Trace callbacks that return another local trace callable have that
callable wrapped back into target_space.
Thread switch callbacks and call_at callbacks are stored as
coordinate-transparent wrappers. If a call_at callback needs to run
application-visible work under the target frame, it can call a no-argument
callable wrapped with retrace.include. While one of
those transparent callbacks is running,
coordinates(), thread_delta(), and hash() describe the pinned application
instruction boundary that caused the callback, not the Python frames entered by
the callback itself. Any Python frame created under the callback is marked
transparent and skipped by coordinate walks; if a callback-created generator or
coroutine is resumed later, that frame remains transparent. Transparent frames
do not consume root activation counters or parent child-call ordinals.
with_new_coordinates(callable, *args, **kwargs) calls callable as the root
of a fresh coordinate space. Outer frames are hidden, the first visible frame
starts with call ordinal 0, and coordinate hashes use the default root hash.
When the callable returns or raises, Retrace restores the parent thread's root
ordinal and delta state. This helper is intended for same-process record/replay
tests that need literal thread ids and coordinate roots.
CoordinateSpace().wrap(callable) returns a native callable wrapper that runs
the callable in that coordinate space whenever the wrapper is called. The
module-level exclude and disable decorators are implemented as wrappers for
the disabled space; include and enable are the corresponding root-space
wrappers.
run_disabled(callable, *args, **kwargs) is the immediate-run helper for
executing a callable in the disabled space.
ThreadHandoff(timeout=None) creates a replay handoff gate. handoff.start()
registers the current stable _thread.get_ident() id, then parks that thread
with the GIL released. handoff.to(thread_id) marks the target id runnable,
then parks the current thread until a later transfer marks it runnable. Transfer
tokens are durable: the target thread does not need to be asleep yet. When
timeout is not None, parked waits raise TimeoutError after that many
seconds without a transfer. handoff.close() wakes any sleeping threads and
causes future start() or to() calls to raise RuntimeError.
Apply the patch stack to a CPython release:
scripts/apply-patches 3.12.13Build and install:
scripts/build-release 3.12.13Package:
scripts/package 3.12.13By default:
- patched sources go under
build/src/ - installed interpreters go under
build/install/ - release archives go under
build/dist/
Set CPYTHON_REPO_URL to use a CPython mirror or fork instead of
https://github.com/python/cpython.git.
Build the documentation and refresh AI-readable artifacts:
python3 -m pip install -r requirements-docs.txt
python3 scripts/build-docsThe docs build writes the MkDocs site to build/site/ and refreshes
repository-root llms.txt and llms-full.txt. The release workflow includes
those generated files inside each retracesoftware-cpython wheel.
Run the probe smoke test against a built interpreter:
build/src/cpython-3.12.13+retrace/python.exe tests/smoke/probe_capability.pyRun a focused CPython regression test:
build/src/cpython-3.12.13+retrace/python.exe -m test test_sys -qThe full CPython test suite is the stronger validation gate for release builds.
The GitHub workflow can run it unless skip_tests is enabled for a fast
packaging smoke run.
The GitHub workflow builds platform wheels for the retracesoftware-cpython
PyPI project. Wheels contain the minimal runtime overlay plus a
retrace-python launcher. The package version is Retrace's version and is
independent of the CPython version being embedded.
Release versions live in the tracked VERSION file. To release, update
VERSION, commit it, create and push a tag such as v0.4.3, then run the
workflow against that tag. The workflow checks out the tag before reading patch
manifests or building wheels, so later platform builds can be run
retroactively against the same release tree.
Built wheels are uploaded as GitHub Release assets. Uploads are additive: if a
wheel with the same filename already exists on the release, the workflow leaves
it alone. This makes it cheap to build one missing platform later without
rebuilding or re-uploading the whole matrix. PyPI publishing downloads the wheel
assets from the GitHub Release and uses skip-existing, so rerunning publish is
also additive.
Workflow inputs:
release_tag=v0.4.3builds from that Git tag and uploads to that GitHub Release. If omitted, the workflow uses the current tag, orv<VERSION>when manually dispatched from a branch.python_version=manifest-allbuilds every CPython release listed in patch manifests.python_version=manifest-latestbuilds the latest listed release per series.python_version=3.12.13builds one exact CPython release.target=allbuilds every supported platform;target=macos-arm64builds only that platform;target=noneskips builds and only publishes existing GitHub Release wheel assets whenpublish_pypi=true.package_version=0.4.3overrides the version read fromVERSION; leave this empty for normal tagged releases.skip_testsskips CPython test-suite runs for faster smoke publishing.upload_release_assetsuploads missing wheel assets to the GitHub Release.publish_pypiopts into PyPI Trusted Publishing from GitHub Release assets.
For a quick one-platform release fill-in, rerun the workflow with the same
release_tag, the exact missing target, and publish_pypi=true. The workflow
will upload the new wheel, download all release wheel assets, and publish only
the PyPI files that do not already exist.
patches/
3.11/
3.12/
cpython-overlay/
Include/internal/
Lib/
Modules/
Python/
scripts/
apply-patches
build-release
package
package-runtime
package-wheel
test-against-vanilla
docs/
index.md
api.md
runtime-package.md
build-release.md
patching.md
testing.md
probe-abi.md
tests/
smoke/
Patch directories may be keyed by exact release, such as patches/3.12.8/, or
by minor series, such as patches/3.12/. scripts/apply-patches prefers an
exact release directory, then falls back to the minor-series directory.
If a patch directory has a series.toml, the manifest declares the supported
version range, patch order, and releases the stack is expected to apply to.
This section describes how the probes work internally. It is useful when editing the patch stack, but it is not the public shape of the project.
Upstream CPython changes live in patches/. Retrace-owned source files live in
cpython-overlay/ and are copied into the CPython checkout after the patch
stack applies.
This keeps the patch files focused on CPython injection points and build-system changes. Most probe implementation code lives in new compilation units.
The patch changes these CPython-owned structure layouts. Each touched core type
gets one retrace field whose type is defined in
cpython-overlay/Include/cpython/retrace_state.h. That keeps the CPython patch
small: adding Retrace-owned state to these structs usually means editing the
overlay header, not changing every CPython-version patch.
_PyInterpreterFrame gets:
_PyRetraceFrameState retrace;where _PyRetraceFrameState is:
typedef struct {
uint64_t last_call_ordinal;
uint64_t last_instruction_counter;
_PyRetraceThreadSpaceState *last_delta_space;
} _PyRetraceFrameDeltaState;
typedef struct {
int64_t coordinate_bias;
uint64_t current_call_ordinal;
uint64_t *previous_call_ordinal_ptr;
_PyRetraceThreadSpaceState *space;
_PyRetraceFrameDeltaState delta;
uint64_t coordinate_hash;
} _PyRetraceFrameState;coordinate_bias is the running adjustment that makes bias + f_lasti a
logical coordinate and survives frame suspension. current_call_ordinal is the
child-activation counter for the frame's current instruction.
previous_call_ordinal_ptr points back to the parent or space-root ordinal slot
while the frame is active. space records the active thread-local coordinate
space. delta groups the coordinate-delta cache fields. coordinate_hash
caches the frame coordinate hash contribution.
PyThreadState gets:
_PyRetraceThreadState retrace;where _PyRetraceThreadState is:
typedef struct {
uint64_t thread_id;
PyObject *thread_id_object;
unsigned long cpython_thread_ident;
_PyRetraceThreadSpaceState root_space;
_PyRetraceThreadSpaceState *current_space;
_PyRetraceThreadSpaceState *last_space;
uint64_t *call_ordinal_ptr;
uint32_t inherited_space_id;
int thread_callback_active;
struct _PyInterpreterFrame *thread_callback_frame;
} _PyRetraceThreadState;thread_id is the deterministic 64-bit Retrace identity exposed through
Python's thread-ident APIs, and thread_id_object is its cached Python int
object. cpython_thread_ident records CPython's native
_thread.start_new_thread() ident for bridge lookups. The top 16 bits of
thread_id hold (space_id & 0xffff) for the space inherited at thread
creation; the lower 48 bits hold the deterministic hashed id. The space fields
track the thread-local coordinate spaces, current root ordinal slot, and
inherited space id. The callback fields track active callback delivery, pin the
application frame observed by callback code, and prevent recursive callback
delivery.
PyInterpreterState gets:
_PyRetraceInterpreterState retrace;where _PyRetraceInterpreterState is:
typedef struct {
_PyRetraceIdentityHashTable *identity_hashes;
PyThreadState *last_bytecode_thread;
uint64_t completed_last_bytecode_thread_id;
int thread_switch_armed;
_PyRetraceSpaceCallbackState *space_callbacks;
int call_at_armed;
int call_at_extra_armed;
uint64_t call_at_owner_thread_id;
PyObject *call_at_coordinates;
PyObject *call_at_callback;
PyObject *call_at_overshoot_callback;
int completion_at_armed;
int completion_at_extra_armed;
PyObject *completion_at_callback;
} _PyRetraceInterpreterState;The identity hash table stores coordinate-derived synthetic object hashes. The
interpreter last_bytecode_thread is the cheap global bytecode-thread switch
predicate; it does not define namespace semantics. thread_switch_armed lets
the eval loop skip the namespace callback list when no thread-switch callback
is registered. Per-space callback registrations, including root-space
thread-switch callbacks, live in space_callbacks, a linked list keyed by
coordinate-space id. The root call_at fields hold one armed root-space
coordinate or completion target, plus the exact-hit and overshoot callbacks;
call_at_extra_armed lets the eval loop notice the rare non-root target path
without replacing the root fast path. call_at_owner_thread_id is an internal owner
guard captured from the arming thread, not a public argument.
Each _PyRetraceSpaceCallbackState stores callback state for one coordinate
space:
typedef struct _PyRetraceSpaceCallbackState {
uint32_t space_id;
PyThreadState *last_bytecode_thread;
uint64_t completed_last_bytecode_thread_id;
PyObject *thread_switch_callback;
int call_at_armed;
uint64_t call_at_owner_thread_id;
PyObject *call_at_coordinates;
PyObject *call_at_callback;
PyObject *call_at_overshoot_callback;
int completion_at_armed;
PyObject *completion_at_callback;
struct _PyRetraceSpaceCallbackState *next;
} _PyRetraceSpaceCallbackState;For thread-switch callbacks, last_bytecode_thread is the last thread that
executed bytecode visible in space_id. It is separate from the interpreter
global fast-path pointer so each namespace can ignore bytecode in other
namespaces.
No existing public object layout such as PyObject, PyTypeObject, or
PyFrameObject is extended directly.
The frame coordinate is:
frame->retrace.coordinate_bias + f_lasti
Fallthrough bytecode execution does not write the coordinate. The bias changes when dispatch jumps, so the logical coordinate remains monotonic across fallthrough and branch paths without paying a per-opcode increment.
The exported frame path is the pair stream:
(frame_coordinate, current_call_ordinal)
For a child frame, previous_call_ordinal_ptr points at the nearest visible
parent's active ordinal slot. When the child returns, suspends, or unwinds, that
saved parent or space-root slot is incremented. The parent counter is reset
before each parent instruction starts. That gives repeated C-driven callbacks,
such as map(callback, values) or
sorted(values, key=lambda value: ...), distinct coordinate spaces even when
their Python caller is parked at one bytecode offset. Ordinary child calls do
not bump or bias the parent coordinate.
The native layer does not hide ordinary Python frames. Retrace callback delivery pins the application frame that caused the callback, so callbacks observe that application coordinate rather than their own callback implementation frames.
Each thread-space has an internal root activation counter. When a visible frame starts without a visible parent in that space, activation saves that counter as the frame's parent slot. When the frame returns, suspends, or unwinds, Retrace writes the incremented value back to the space. The exported coordinate vector therefore stays as the visible frame path without a separate synthetic root word or thread-id prefix.
Coordinate snapshot code walks visible frames directly. Normal execution does not pay for depth maintenance.
frame->retrace.delta holds the delta stream's remembered instruction and call
coordinate words, or unset sentinels. Normal frame initialization stores the
unset sentinels, and linking a frame into the active chain refreshes them.
frame->retrace.coordinate_hash is a lazy cache slot for the frame coordinate
hash contribution. The exported _retrace.hash() value is computed from the
same visible frame coordinate stream returned by _retrace.coordinates().
A simple per-thread bytecode counter would make every later coordinate depend on every earlier bytecode executed by that thread. That is brittle for replay. Local effects such as cache warmup, adaptive interpreter state, or a leaf helper taking a slightly different internal path can change how much bytecode runs inside one small region without changing the surrounding application control flow.
With a global counter, that local difference shifts all later call_at targets and thread scheduling coordinates. The replay timeline is then displaced even if the program returns to the same parent frame and continues through the same visible application path.
Frame coordinates intentionally localize that damage. A leaf frame can have a different internal coordinate without automatically changing the coordinates of older visible frames. Parent coordinates move when visible control crosses a frame boundary or when that parent executes its own jumps, not because arbitrary child bytecode happened to run. This gives replay a coordinate space that can resynchronize at non-local boundaries instead of treating a harmless local instruction-count difference as a global clock skew.
This does not make semantic divergence safe. If a local difference changes an observable result, replay still has to fail. The point is narrower: coordinate numbering should not amplify contained execution-detail differences into unrelated call_at, scheduling, or stack-position failures.
The native side stores no previous vector and no previous stack size. Each visible frame remembers its last emitted call ordinal and instruction coordinate. A delta call compares the current root-first coordinate word path with remembered frame coordinates. It returns the common prefix length plus the changed suffix, so callers replace everything after that prefix.
Thread execution-order telemetry is modeled as bytecode-thread switches. The switch callback is registered per coordinate space and is called as:
callback(previous_delta, next_thread_id)The eval loop checks call_at and then checks for a thread switch before
executing each bytecode instruction. Keeping call_at at the same bytecode
boundary, immediately before the switch check, lets replay yield points line up
with the exact instruction that is about to run. The switch hot path is one
interpreter-level pointer comparison: if the bytecode thread has not changed
since the previous instruction, Retrace does no namespace work. If the thread
has changed and at least one namespace has a switch callback armed, Retrace
walks the registered namespace callback states.
Each namespace callback state has its own last_bytecode_thread. A namespace
sees only bytecode whose current frame is visible in that namespace. Bytecode in
other namespaces does not update that namespace cursor and cannot by itself
cause that namespace's callback to run.
The exact per-namespace algorithm is:
if current instruction is not visible in space S:
ignore S
elif S.last_bytecode_thread is current thread:
assert no callback delta has been produced
ignore S
elif S.last_bytecode_thread is NULL:
if S.completed_last_bytecode_thread_id is set:
S.last_bytecode_thread = current thread
clear S.completed_last_bytecode_thread_id
callback(None, current.thread_id)
else:
S.last_bytecode_thread = current thread
do not call back
else:
previous = S.last_bytecode_thread
previous_delta = thread_delta(previous, S)
S.last_bytecode_thread = current thread
if previous_delta is meaningful:
callback(previous_delta, current.thread_id)
previous_delta is computed before the namespace cursor is advanced. It is the
coordinate delta for the previous visible thread in that same namespace, with
the same (common_prefix_count, *new_suffix) shape returned by
thread_delta(space), or None when the previous visible thread completed
naturally. current.thread_id is passed as the cached Python int for the new
thread's stable Retrace id. The callback runs on the new/current thread before
the bytecode instruction that caused the switch is executed.
A same-thread observation is never emitted and does not consume the thread's
delta state; debug builds assert that no callback delta object exists on that
path.
This makes other namespaces invisible. If A executes in S, B executes outside S, then A executes in S again, S observes no switch. If A executes in S, B executes outside S, then C executes in S, S observes one switch from A to C and receives A's delta in S.
Callback return values are ignored, callback exceptions are reported through
PyErr_WriteUnraisable(), and reentrant delivery is suppressed while the
current thread is already inside a Retrace callback. Switch callbacks are
coordinate-transparent: the current coordinate inside the callback is the
pinned application boundary for the instruction about to execute on the new
thread. Python helper calls made by the callback, and any generator/coroutine
frame created by the callback, are skipped by coordinates, deltas, and
coordinate hashes.
The eval loop checks a cheap root armed flag and root thread id, or the rare
non-root armed flag, then frame visibility. Only after those match does it
compare the full root-first frame coordinate tuple for the selected space.
Exact matches run the hit callback. If the current coordinate has passed the
target without an exact match, the call_at is cleared and the optional
overshoot callback runs instead. call_at callbacks are
coordinate-transparent; a no-argument callable wrapped with retrace.include
runs visibly as a child of the target frame, then returns to the
transparent callback. call_at targets are armed by the current thread for the
current thread; callback transparency does not suppress delivery after
execution returns to application bytecode.
The public CoordinateSpace.monitoring proxy wraps sys.monitoring callback
registration without changing CPython's tool-level event switches. The native
SpaceDispatch callable handles the active-space check; when no case matches
and no default callable was supplied, it returns None directly from C without
entering Python.
space.monitoring.register_callback(
tool_id, event, callback, target_space=retrace.root_space)
space.monitoring.set_events(tool_id, event_set)The first space is the source coordinate space whose events should be
observed. target_space selects where the Python callback executes. Passing
callback=None removes that source-space callback; when no source spaces
remain for a (tool_id, event) pair, Retrace unregisters the underlying
sys.monitoring callback. On Python versions without sys.monitoring, use of
the proxy raises RuntimeError.
CoordinateSpace.settrace(callback, target_space=target) and
CoordinateSpace.setprofile(callback, target_space=target) use the same
dispatch model for the current thread's sys.settrace and sys.setprofile
hooks. The source space selects which events run the hook; target_space
selects where the hook's Python frames execute. Passing None clears that
source-space hook. Trace callbacks that return a local trace function have the
returned callable wrapped into the same target space.
- Keep CPython injection points minimal and obvious.
- Put Retrace-owned implementation code in
cpython-overlay/. - Keep generated patches reviewable and release-specific.
- Do not add compatibility shims for old probe APIs or trace formats.
- Treat the probe ABI as private to a
CPython version + retrace_probe_abipairing. - Preserve graceful degradation on vanilla CPython.
More detail lives in docs/probe-abi.md.