Design AI-native data products with open-source building blocks.
DataJourney teaches how data, AI, retrieval, dashboards, agents, evaluation, and packaging fit together as one usable system.
Recipient: GitHub Secure Open Source Fund
Sponsor DataJourneyHQ
|
Official announcement
DataJourney is a design-first open-source toolkit for learning how to assemble AI-powered data products.
Most data and AI examples teach one tool at a time. DataJourney focuses on the system around the tool: how data is discovered, profiled, retrieved, analyzed by models, turned into an interface, evaluated, and packaged so another person can actually use it.
The project is both:
- a learning environment for understanding AI and data system design
- a practical toolkit for composing open-source workflows into runnable examples
The first user-facing layer is the DataJourney CLI. It explains what each workflow does before you run it.
pixi run DJ_explainAfter installing the local package:
datajourney explainTry the AI-first path:
datajourney explain DJ_RAG_without_memory
datajourney explain --category RAG
datajourney explain --listThe explain command answers the questions new users usually have:
- What does this workflow do?
- Why is it useful?
- What prerequisites do I need?
- What output should I expect?
- What should I run next?
DataJourney treats a data product as a set of connected design layers.
DataJourney follows a LEGO-like design philosophy: small open-source capabilities should be understandable on their own, but more powerful when composed into a coherent system.
The toolkit is built with additive and subtractive layers:
P0 Base: static homes and project anchors that keep the system visible, such as GitHub and documentationP1 Tooling: open-source building blocks for data, AI, retrieval, apps, and dashboardsP2 Maintenance: environments, automation, quality checks, and monitoring through Pixi and GitHub ActionsP3 Abstraction: user-facing layers such as the CLI, task runner, workflow metadata, and agents
Each layer should communicate clearly with the layer above it. That is the design goal: not just to run tools, but to make the system explainable, extensible, and beautiful to work with.
| Layer | Purpose | DataJourney Examples |
|---|---|---|
| Source | Bring data into a visible catalog | Intake, CSV datasets, source metadata |
| Understanding | Inspect shape, schema, and meaning | profiling, EDA, dataset previews |
| Intelligence | Add AI reasoning and retrieval | LLM analysis, RAG, ChromaDB, prompt enhancement |
| Interface | Give users a surface to interact with | FastHTML, Flask, Panel dashboards, generated apps |
| Orchestration | Make workflows repeatable | Pixi, Dagster, reusable tasks |
| Evaluation | Check whether AI behavior is trustworthy | tracing and LLM evaluation examples |
| Packaging | Make the system installable and explainable | CLI, setup.py, workflow metadata |
The important idea: each workflow is not a random demo. It is a piece of a larger product journey.
Fork or clone the repository, then enter the project:
cd DataJourneyInstall Pixi from prefix.dev, then activate the environment:
pixi shellInstall DataJourney locally:
pixi run DJ_packageExplore the toolkit:
pixi run DJ_explain
pixi run DJ_explain -- DJ_RAG_without_memory
pixi run DJ_explain -- --category RAG
pixi run DJ_listFor model-backed workflows, configure a GITHUB_TOKEN or the relevant model-provider credentials before running AI tasks.
pixi run GIT_TOKEN_CHECKIf you want to see DataJourney as an AI data-product toolkit, start here:
datajourney explain DJ_RAG_without_memory
datajourney explain DJ_prompt_enhancer
datajourney explain DJ_llm_analysis_gpt_4oThen run the workflows when your credentials and prerequisites are ready:
pixi run DJ_chromadb_gen_embedding
pixi run DJ_RAG_without_memory
pixi run DJ_prompt_enhancerThis path shows the core design idea:
- prepare context
- retrieve relevant information
- ask a model to reason over grounded data
- turn the behavior into something explainable and reusable
The CLI is powered by one extensible JSON catalog:
analytics_framework/workflows/workflow_catalog.json
Each workflow entry describes:
- summary
- benefit
- design principle
- command
- prerequisites
- expected output
- common errors
- next steps
- related files
Validate the catalog after editing it:
datajourney explain --validateThis makes DataJourney easier to extend because contributors can improve the user experience without changing workflow code.
DataJourney uses open-source tools as composable teaching material:
| Capability | Tools |
|---|---|
| Environment and task running | Pixi |
| Data catalog and ingestion | Intake |
| Pipelines | Dagster |
| Dashboards and apps | Panel, FastHTML, Flask |
| AI and agent workflows | LangChain, Google ADK-style agents, GitHub Models |
| Retrieval | ChromaDB |
| Prompt design | GPT-OSS examples |
| Evaluation and tracing | DeepEval examples |
| Packaging and discovery | Click, Rich, setup.py, workflow metadata |
analytics_framework/
ai_modeling/ LLM-backed analysis examples
build_agents/ DataJourney demo agents and agent tools
dashboard/ Panel dashboard examples
gpt_oss/ Prompt enhancement examples
intake/ Cataloged datasets and web UIs
pipeline/ Dagster pipeline example
rag_system/ ChromaDB and RAG examples
workflows/ CLI explanation metadata and utilities
usage_guide/ Notebook-based learning guides
assets/ Images used by docs and demos
The easiest contribution path is improving the explanation layer.
Good first contributions:
- add clearer prerequisites for a workflow
- add common errors and fixes
- improve expected output descriptions
- connect workflows with better next steps
- improve workflow categories and descriptions
Before opening a pull request, run:
datajourney explain --validate
pixi run DJ_pre_commitRead CONTRIBUTING.md, CODE_OF_CONDUCT.md, and SECURITY.md before contributing.
DataJourney is licensed under the Apache License 2.0.


