GitHub - chrisx9z/DataJourney: Design first Open Source Data Management Toolkit

Design AI-native data products with open-source building blocks.
DataJourney teaches how data, AI, retrieval, dashboards, agents, evaluation, and packaging fit together as one usable system.

Recipient: GitHub Secure Open Source Fund
Sponsor DataJourneyHQ | Official announcement

What Is DataJourney?

DataJourney is a design-first open-source toolkit for learning how to assemble AI-powered data products.

Most data and AI examples teach one tool at a time. DataJourney focuses on the system around the tool: how data is discovered, profiled, retrieved, analyzed by models, turned into an interface, evaluated, and packaged so another person can actually use it.

The project is both:

a learning environment for understanding AI and data system design
a practical toolkit for composing open-source workflows into runnable examples

Start With Explain

The first user-facing layer is the DataJourney CLI. It explains what each workflow does before you run it.

pixi run DJ_explain

After installing the local package:

datajourney explain

Try the AI-first path:

datajourney explain DJ_RAG_without_memory
datajourney explain --category RAG
datajourney explain --list

The explain command answers the questions new users usually have:

What does this workflow do?
Why is it useful?
What prerequisites do I need?
What output should I expect?
What should I run next?

The Design Model

DataJourney treats a data product as a set of connected design layers.

Design Philosophy

DataJourney follows a LEGO-like design philosophy: small open-source capabilities should be understandable on their own, but more powerful when composed into a coherent system.

The toolkit is built with additive and subtractive layers:

P0 Base: static homes and project anchors that keep the system visible, such as GitHub and documentation
P1 Tooling: open-source building blocks for data, AI, retrieval, apps, and dashboards
P2 Maintenance: environments, automation, quality checks, and monitoring through Pixi and GitHub Actions
P3 Abstraction: user-facing layers such as the CLI, task runner, workflow metadata, and agents

Each layer should communicate clearly with the layer above it. That is the design goal: not just to run tools, but to make the system explainable, extensible, and beautiful to work with.

Layer	Purpose	DataJourney Examples
Source	Bring data into a visible catalog	Intake, CSV datasets, source metadata
Understanding	Inspect shape, schema, and meaning	profiling, EDA, dataset previews
Intelligence	Add AI reasoning and retrieval	LLM analysis, RAG, ChromaDB, prompt enhancement
Interface	Give users a surface to interact with	FastHTML, Flask, Panel dashboards, generated apps
Orchestration	Make workflows repeatable	Pixi, Dagster, reusable tasks
Evaluation	Check whether AI behavior is trustworthy	tracing and LLM evaluation examples
Packaging	Make the system installable and explainable	CLI, `setup.py`, workflow metadata

The important idea: each workflow is not a random demo. It is a piece of a larger product journey.

Quick Start

Fork or clone the repository, then enter the project:

cd DataJourney

Install Pixi from prefix.dev, then activate the environment:

pixi shell

Install DataJourney locally:

pixi run DJ_package

Explore the toolkit:

pixi run DJ_explain
pixi run DJ_explain -- DJ_RAG_without_memory
pixi run DJ_explain -- --category RAG
pixi run DJ_list

For model-backed workflows, configure a GITHUB_TOKEN or the relevant model-provider credentials before running AI tasks.

pixi run GIT_TOKEN_CHECK

AI-First Example Path

If you want to see DataJourney as an AI data-product toolkit, start here:

datajourney explain DJ_RAG_without_memory
datajourney explain DJ_prompt_enhancer
datajourney explain DJ_llm_analysis_gpt_4o

Then run the workflows when your credentials and prerequisites are ready:

pixi run DJ_chromadb_gen_embedding
pixi run DJ_RAG_without_memory
pixi run DJ_prompt_enhancer

This path shows the core design idea:

prepare context
retrieve relevant information
ask a model to reason over grounded data
turn the behavior into something explainable and reusable

Workflow Metadata

The CLI is powered by one extensible JSON catalog:

analytics_framework/workflows/workflow_catalog.json

Each workflow entry describes:

summary
benefit
design principle
command
prerequisites
expected output
common errors
next steps
related files

Validate the catalog after editing it:

datajourney explain --validate

This makes DataJourney easier to extend because contributors can improve the user experience without changing workflow code.

Current Building Blocks

DataJourney uses open-source tools as composable teaching material:

Capability	Tools
Environment and task running	Pixi
Data catalog and ingestion	Intake
Pipelines	Dagster
Dashboards and apps	Panel, FastHTML, Flask
AI and agent workflows	LangChain, Google ADK-style agents, GitHub Models
Retrieval	ChromaDB
Prompt design	GPT-OSS examples
Evaluation and tracing	DeepEval examples
Packaging and discovery	Click, Rich, `setup.py`, workflow metadata

Repository Map

analytics_framework/
  ai_modeling/       LLM-backed analysis examples
  build_agents/      DataJourney demo agents and agent tools
  dashboard/         Panel dashboard examples
  gpt_oss/           Prompt enhancement examples
  intake/            Cataloged datasets and web UIs
  pipeline/          Dagster pipeline example
  rag_system/        ChromaDB and RAG examples
  workflows/         CLI explanation metadata and utilities
usage_guide/         Notebook-based learning guides
assets/              Images used by docs and demos

Contributing

The easiest contribution path is improving the explanation layer.

Good first contributions:

add clearer prerequisites for a workflow
add common errors and fixes
improve expected output descriptions
connect workflows with better next steps
improve workflow categories and descriptions

Before opening a pull request, run:

datajourney explain --validate
pixi run DJ_pre_commit

Read CONTRIBUTING.md, CODE_OF_CONDUCT.md, and SECURITY.md before contributing.

License

DataJourney is licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 623 Commits
.github		.github
analytics_framework		analytics_framework
assets		assets
requirements		requirements
usage_guide		usage_guide
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.vale.ini		.vale.ini
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
INCIDENT_RESPONSE.md		INCIDENT_RESPONSE.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
cli.py		cli.py
pixi.lock		pixi.lock
pixi.toml		pixi.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What Is DataJourney?

Start With Explain

The Design Model

Design Philosophy

Quick Start

AI-First Example Path

Workflow Metadata

Current Building Blocks

Repository Map

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What Is DataJourney?

Start With Explain

The Design Model

Design Philosophy

Quick Start

AI-First Example Path

Workflow Metadata

Current Building Blocks

Repository Map

Contributing

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages