Skip to content

chrisx9z/DataJourney

 
 

Repository files navigation

DataJourney Stats License OpenSSF Best Practices Code of Conduct CI Lint prose

DataJourney logo

Design AI-native data products with open-source building blocks.
DataJourney teaches how data, AI, retrieval, dashboards, agents, evaluation, and packaging fit together as one usable system.

Recipient: GitHub Secure Open Source Fund
Sponsor DataJourneyHQ  |  Official announcement

What Is DataJourney?

DataJourney is a design-first open-source toolkit for learning how to assemble AI-powered data products.

Most data and AI examples teach one tool at a time. DataJourney focuses on the system around the tool: how data is discovered, profiled, retrieved, analyzed by models, turned into an interface, evaluated, and packaged so another person can actually use it.

The project is both:

  • a learning environment for understanding AI and data system design
  • a practical toolkit for composing open-source workflows into runnable examples

Start With Explain

The first user-facing layer is the DataJourney CLI. It explains what each workflow does before you run it.

DataJourney explain overview CLI output

DataJourney RAG workflow explanation CLI output

pixi run DJ_explain

After installing the local package:

datajourney explain

Try the AI-first path:

datajourney explain DJ_RAG_without_memory
datajourney explain --category RAG
datajourney explain --list

The explain command answers the questions new users usually have:

  • What does this workflow do?
  • Why is it useful?
  • What prerequisites do I need?
  • What output should I expect?
  • What should I run next?

The Design Model

DataJourney treats a data product as a set of connected design layers.

DataJourney design vision

Design Philosophy

DataJourney follows a LEGO-like design philosophy: small open-source capabilities should be understandable on their own, but more powerful when composed into a coherent system.

The toolkit is built with additive and subtractive layers:

  • P0 Base: static homes and project anchors that keep the system visible, such as GitHub and documentation
  • P1 Tooling: open-source building blocks for data, AI, retrieval, apps, and dashboards
  • P2 Maintenance: environments, automation, quality checks, and monitoring through Pixi and GitHub Actions
  • P3 Abstraction: user-facing layers such as the CLI, task runner, workflow metadata, and agents

Each layer should communicate clearly with the layer above it. That is the design goal: not just to run tools, but to make the system explainable, extensible, and beautiful to work with.

Layer Purpose DataJourney Examples
Source Bring data into a visible catalog Intake, CSV datasets, source metadata
Understanding Inspect shape, schema, and meaning profiling, EDA, dataset previews
Intelligence Add AI reasoning and retrieval LLM analysis, RAG, ChromaDB, prompt enhancement
Interface Give users a surface to interact with FastHTML, Flask, Panel dashboards, generated apps
Orchestration Make workflows repeatable Pixi, Dagster, reusable tasks
Evaluation Check whether AI behavior is trustworthy tracing and LLM evaluation examples
Packaging Make the system installable and explainable CLI, setup.py, workflow metadata

The important idea: each workflow is not a random demo. It is a piece of a larger product journey.

Quick Start

Fork or clone the repository, then enter the project:

cd DataJourney

Install Pixi from prefix.dev, then activate the environment:

pixi shell

Install DataJourney locally:

pixi run DJ_package

Explore the toolkit:

pixi run DJ_explain
pixi run DJ_explain -- DJ_RAG_without_memory
pixi run DJ_explain -- --category RAG
pixi run DJ_list

For model-backed workflows, configure a GITHUB_TOKEN or the relevant model-provider credentials before running AI tasks.

pixi run GIT_TOKEN_CHECK

AI-First Example Path

If you want to see DataJourney as an AI data-product toolkit, start here:

datajourney explain DJ_RAG_without_memory
datajourney explain DJ_prompt_enhancer
datajourney explain DJ_llm_analysis_gpt_4o

Then run the workflows when your credentials and prerequisites are ready:

pixi run DJ_chromadb_gen_embedding
pixi run DJ_RAG_without_memory
pixi run DJ_prompt_enhancer

This path shows the core design idea:

  1. prepare context
  2. retrieve relevant information
  3. ask a model to reason over grounded data
  4. turn the behavior into something explainable and reusable

Workflow Metadata

The CLI is powered by one extensible JSON catalog:

analytics_framework/workflows/workflow_catalog.json

Each workflow entry describes:

  • summary
  • benefit
  • design principle
  • command
  • prerequisites
  • expected output
  • common errors
  • next steps
  • related files

Validate the catalog after editing it:

datajourney explain --validate

This makes DataJourney easier to extend because contributors can improve the user experience without changing workflow code.

Current Building Blocks

DataJourney uses open-source tools as composable teaching material:

Capability Tools
Environment and task running Pixi
Data catalog and ingestion Intake
Pipelines Dagster
Dashboards and apps Panel, FastHTML, Flask
AI and agent workflows LangChain, Google ADK-style agents, GitHub Models
Retrieval ChromaDB
Prompt design GPT-OSS examples
Evaluation and tracing DeepEval examples
Packaging and discovery Click, Rich, setup.py, workflow metadata

Repository Map

analytics_framework/
  ai_modeling/       LLM-backed analysis examples
  build_agents/      DataJourney demo agents and agent tools
  dashboard/         Panel dashboard examples
  gpt_oss/           Prompt enhancement examples
  intake/            Cataloged datasets and web UIs
  pipeline/          Dagster pipeline example
  rag_system/        ChromaDB and RAG examples
  workflows/         CLI explanation metadata and utilities
usage_guide/         Notebook-based learning guides
assets/              Images used by docs and demos

Contributing

The easiest contribution path is improving the explanation layer.

Good first contributions:

  • add clearer prerequisites for a workflow
  • add common errors and fixes
  • improve expected output descriptions
  • connect workflows with better next steps
  • improve workflow categories and descriptions

Before opening a pull request, run:

datajourney explain --validate
pixi run DJ_pre_commit

Read CONTRIBUTING.md, CODE_OF_CONDUCT.md, and SECURITY.md before contributing.

License

DataJourney is licensed under the Apache License 2.0.

About

Design first Open Source Data Management Toolkit

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 77.4%
  • Jupyter Notebook 21.1%
  • HTML 1.5%