A Python command-line tool for running repeated Open Traceability Assessments against an open-source project, open-science project, report, dashboard, or other public sustainability-related evidence artifact.
The tool uses the OpenAI API and/or the Anthropic (Claude) API to assess how externally inspectable the evidence chain behind a project or claim is. It can run the same assessment multiple times, capture score variation across runs, preserve references used for scoring, show derivations for each score, and produce both structured JSON and a Markdown report. You can run against a single provider or against both at once to compare how different models score the same project.
Rather than asking whether an environmental statement, insight, report or number is true or false, the Open Traceability Concept asks:
How open, linked, and externally inspectable is the evidence chain behind a sustainability or environmental claim?
The concept is developed by the Open Traceability Initiative, which describes Open Traceability as the externally inspectable connection between an environmental claim and the specific evidence, methods, assumptions, and publications from which that claim was derived.
The project responds to a common weakness in sustainability decision-making: claims may be presented as evidence-based, but the chain linking evidence to the claim is often difficult to inspect. Data, models, assumptions, workflows, review processes, and publications may exist, but they are not always connected in ways that allow meaningful external scrutiny.
Open Traceability therefore shifts attention from openness of isolated artifacts to the inspectability of the claim-support chain. A dataset, repository, report, or paper may be public, but it is only traceable when the links between inputs, methods, execution, review, and outputs are explicit enough for others to examine.
This repository provides a reusable assessment runner that:
- Fetches an Open Traceability definition and project evidence.
- Supports GitHub repositories, web pages, and PDF reports.
- Runs the assessment multiple independent times.
- Works with OpenAI models, Anthropic (Claude) models, or both providers in one run.
- Scores six Open Traceability dimensions from 0 to 100.
- Optionally computes an overall total score.
- Captures score derivations for every stage and run.
- Preserves references used by the model for each score and flags references that were not found in the collected evidence bundle.
- Writes results incrementally and stores every run in its own timestamped, project-named folder.
- Produces a structured JSON file for downstream analysis.
- Produces a Markdown report with tables, consolidated references, limitations, a per-model score comparison (when more than one model is used), and a single-paragraph summary.
The tool is intended as an assessment assistant. It does not prove that a claim is true, unbiased, or scientifically valid. Instead, it helps identify whether the evidence, assumptions, methods, limitations, uncertainty, and possible errors behind a claim can be inspected by others.
The assessment uses six dimensions derived from the Open Traceability definition.
Assesses whether the relevant inputs are identifiable, documented, attributable, reusable, verifiable, and versioned. Strong traceability means that external actors can inspect where the data came from, how it was collected or produced, how it was processed, what uncertainty or quality controls apply, and under what conditions it can be reused.
Assesses whether the analytical logic is visible through code, models, methods, dependencies, documentation, configuration, and licensing. Strong traceability normally requires version-controlled source code, clear methods, dependency information, and a recognized open-source license.
Assesses whether workflows, scripts, parameters, computational environments, outputs, and provenance make the path from inputs to outputs inspectable. Strong execution traceability exists when an external actor can understand and, ideally, repeat the computation that produced the result.
Assesses whether critique, issue tracking, review, correction processes, and responses to challenge are visible. Strong review traceability means users can inspect not only the final claim, but also how it was questioned, tested, corrected, or improved.
Assesses whether reports, papers, dashboards, policy outputs, or explanatory materials are accessible and clearly documented. Strong publication traceability means public outputs state the claim clearly, describe the methods and evidence base, cite supporting artifacts, and preserve enough context for external scrutiny.
Assesses whether the full chain across data, methods, execution, review, and publications is explicit, specific, versioned, and externally verifiable. This dimension is critical because openness without linkage does not produce traceability. Public artifacts are not enough if they cannot be connected to the claim they support.
The broader Open Traceability framework proposes using open digital infrastructure to support assessment, including:
- OpenAlex for publication-layer evidence, citation networks, open-access status, licensing signals, and correction or retraction markers.
- ecosyste.ms for software metadata, repository health, dependencies, licensing, maintenance, and governance signals.
- OpenSustain.tech as a catalog of open sustainability technology.
- Large language models as assessment assistants that can identify candidate claims, surface relevant artifacts, classify evidence types, and summarize likely gaps for human review.
This runner implements the LLM-assisted part of that architecture. It collects a bounded evidence bundle and asks the model to produce structured, reference-backed assessments.
Create a virtual environment and install the dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtExample requirements.txt:
openai>=1.99.0
anthropic>=0.69.0
requests>=2.32.0
beautifulsoup4>=4.12.0
pydantic>=2.8.0
pypdf>=4.3.0The provider SDKs are imported lazily, so you only need the one(s) you actually use: openai for --provider openai, anthropic for --provider anthropic, or both for --provider both.
Provide an API key for each provider you intend to use. Create the key in the relevant platform, then expose it as an environment variable:
# Required for --provider openai (and --provider both)
export OPENAI_API_KEY="your_openai_api_key_here"
# Required for --provider anthropic (and --provider both)
export ANTHROPIC_API_KEY="your_anthropic_api_key_here"The runner only checks for the keys it needs: OPENAI_API_KEY for OpenAI runs, ANTHROPIC_API_KEY for Anthropic runs, and both when --provider both is selected.
If you are assessing GitHub repositories and expect to fetch many files, you can also provide a GitHub token to reduce rate-limit issues:
export GITHUB_TOKEN="your_github_token_here"Run the assessment against the default example repository:
python ota.py \
--project-url https://github.com/natcap/invest \
--runs 5 \
--include-total \
--out-prefix invest_open_traceabilityRun the assessment against another project, report, or web page:
python ota.py \
--project-url https://example.org/report.pdf \
--runs 3 \
--include-total \
--out-prefix example_report_traceabilityOmit the overall total score while still scoring the six dimensions:
python ota.py \
--project-url https://github.com/natcap/invest \
--runs 5 \
--no-include-totalUse a different OpenAI model:
python ota.py \
--project-url https://github.com/natcap/invest \
--runs 3 \
--model gpt-5.5 \
--reasoning-effort mediumAssess with Anthropic (Claude) instead of OpenAI:
python ota.py \
--project-url https://github.com/natcap/invest \
--runs 3 \
--provider anthropic \
--anthropic-model claude-opus-4-8Assess with both providers at once and compare them in one report:
python ota.py \
--project-url https://github.com/natcap/invest \
--runs 3 \
--provider both \
--model gpt-5.5 \
--anthropic-model claude-opus-4-8With --provider both, --runs applies to each provider, so the example above produces 6 runs in total (3 per model). The report then includes an "Average score by model" table comparing the two.
| Option | Default | Description |
|---|---|---|
--provider |
openai |
Which provider(s) to assess with: openai, anthropic, or both. |
--model |
gpt-5.5 |
OpenAI model id (used for openai and both). |
--anthropic-model |
claude-opus-4-8 |
Anthropic (Claude) model id (used for anthropic and both). |
--reasoning-effort |
medium |
none, low, medium, high, or xhigh. For OpenAI this maps to the reasoning parameter; for Anthropic it maps to adaptive thinking plus the effort parameter. Use none to disable. |
--runs |
3 |
Number of runs per selected provider. |
--output-dir |
reports |
Base directory for the per-run output folder. |
--out-prefix |
open_traceability_assessment |
Filename prefix for the JSON and Markdown outputs. |
⚠️ Warning: this tool sends a large evidence bundle to the model on every run, so it consumes a significant number of tokens. A default assessment can cost up to around $1, depending on the model and plan you use.
Each invocation writes its results into a dedicated folder named with a timestamp and the assessed project, under --output-dir (default reports). For example:
reports/20260612-101648_invest_open_traceability/
├── invest_open_traceability.runs.json
└── invest_open_traceability.report.md
Results are written incrementally — the JSON is saved after every successful run — so a transient failure on a later run does not discard the runs that already completed.
The JSON output is an object with a top-level human_review flag (approved, starting false, plus reviewer instructions) and a runs array. A human reviewer validates all claims against the references provided and sets approved to true. Each entry in runs contains the full structured assessment data for that run, including:
- Run number.
- Project name and URL.
- The model that produced the run.
- Six stage scores.
- Score derivations.
- Evidence references.
- Uncertainty level (low/medium/high) and a one-line reason.
- Optional total score.
- Per-run summary paragraph.
- Limitations.
The Markdown report summarizes across runs rather than repeating each run verbatim, and contains:
- A human-reviewer approval checkbox at the top, to be checked once all claims have been validated against the references provided.
- The provider model(s) used, with the run numbers each produced.
- A final single-paragraph summary.
- A score table across runs, with average and standard deviation by dimension.
- An "Average score by model" comparison table (only when more than one model is used).
- An optional total score table.
- Consolidated references by stage, deduplicated across runs, with the modal reported uncertainty and a
⚠️ marker on any reference whose URL was not part of the collected evidence bundle. - Consolidated, deduplicated limitations across runs.
The default scoring scale is:
| Score range | Interpretation |
|---|---|
| 0-20 | Little or no public evidence for this dimension |
| 21-40 | Partial, fragmentary, or hard-to-verify evidence |
| 41-60 | Moderate evidence, but important gaps remain |
| 61-80 | Strong public evidence with some limitations |
| 81-100 | Excellent, explicit, versioned, reusable, externally verifiable evidence chain |
Scores should be interpreted as evidence-bundle-based traceability estimates, not as a definitive judgment of scientific truth or project quality.
- Select a bounded project, report, dashboard, or claim.
- Run the assessment with at least three independent runs.
- Inspect the references and derivations, not only the scores.
- Identify where missing links reduce traceability.
- Manually validate important findings before publication or decision use.
- Use the report as a draft traceability profile, not as a final audit.
Open Traceability can be applied to:
- Open-source sustainability software.
- Scientific reports and assessment outputs.
- Environmental dashboards.
- Climate and energy policy evidence.
- Sustainability claims in journalism.
- Monitoring systems based on geospatial or operational data.
- Research outputs that have been corrected, retracted, or contested.
This tool has important limitations:
- It depends on the evidence it can fetch or is given.
- It may miss relevant artifacts that are not linked from the target URL.
- It cannot independently verify every scientific or technical claim.
- It may over- or under-score dimensions where evidence is ambiguous.
- It should be paired with human review, especially for policy-relevant or high-stakes assessments.
- Repeated runs expose variation, but they do not eliminate model uncertainty.
