Skip to content

xr843/fojin

FoJin 佛津

The World's Encyclopedic Buddhist Digital Text Platform

503 sources. 30 languages. 30 countries. 23,500+ full-text volumes. One search.

Aggregating the world's Buddhist digital heritage — 10,500+ texts with 23,500+ volumes of full content in Pali, Classical Chinese, Tibetan, and Sanskrit from 503 data sources. The first LLM-driven trilingual cross-canon parallel reading platform (CBETA × SuttaCentral × 84000), with chunk-level alignment verified by LLM, plus CBETA-style reading, AI-powered Q&A with 8 Buddhist master personas (RAG + tradition-scoped retrieval + citations), knowledge graph with 31K+ entities and 28K+ relations visualized on a 50K-entity Deck.GL geo map, 32 dictionaries with 748K entries across 6 languages, timeline visualization, activity feed, collections, citations, annotations, bookmarks, and multi-language parallel reading.

Live Demo  ·  API Docs  ·  中文文档  ·  Discussions  ·  Discord  ·  Report Bug

CI Security Scan License GitHub stars

FoJin — Global Buddhist Digital Text Platform


Why FoJin?

Buddhist texts are scattered across hundreds of databases worldwide — CBETA, SuttaCentral, BDRC, SAT, 84000, GRETIL, and many more. Each has different interfaces, languages, and data formats. Researchers spend more time finding texts than reading them.

FoJin solves this. It aggregates 503 sources into a single, searchable platform with features no other tool provides:

What you need How FoJin helps
Find a sutra across databases Multi-dimensional search across 10,500+ texts from 503 sources
Read the full text online 8,900+ texts with 23,500+ volumes of full content, CBETA-style layout
Compare translations Parallel reading in 30 languages side by side
Compare sutras across Buddhist canons Trilingual cross-canon parallel reading — 5 MVP sutras with 142 LLM-verified chunk alignments across Chinese / Pali / Tibetan (Heart Sutra, Satipaṭṭhāna, Dhammacakka, Dhammapada, Vimalakīrti)
Look up Buddhist terms 32 dictionaries, 748K entries (Chinese/Sanskrit/Pali/Tibetan/English)
Explore relationships Knowledge graph with 31K+ entities and 28K+ relations (23K lineage chains)
Discover similar texts Semantic similarity powered by 678K+ embedding vectors (pgvector + HNSW)
Ask questions about texts AI Q&A ("XiaoJin") with RAG, reranking, clickable citations, multi-language citation drawer, and follow-up suggestions
Learn from a specific master Master Persona Mode — 8 historical Buddhist masters, each with tradition-specific RAG scope
Explore Buddhist geography Knowledge Graph Map — 50K+ geo entities, monastery locations, lineage arcs on Deck.GL
Track source updates Activity Feed — real-time updates from 503 data sources
Explore history visually Timeline & Dashboard — dynasty charts, translation trends, category analytics
Save and organize Collections, bookmarks, annotations for personal study
Cite in research Citation export (BibTeX, RIS, APA) for academic use

Quick Start

git clone https://github.com/xr843/fojin.git
cd fojin
cp .env.example .env        # edit POSTGRES_PASSWORD before starting
docker compose up -d         # database migrations run automatically

Then visit: http://localhost:3000

API docs at http://localhost:8000/docs

After first startup, the platform has the database schema and source metadata but no text content. To import texts from public data sources:

# Import CBETA catalog (auto-scans local xml-p5 directory or fetches from remote)
docker exec fojin-backend python scripts/import_catalog.py

# Import CBETA full text content (requires xml-p5 repository)
docker exec fojin-backend python scripts/import_content.py --all --xml-dir /data/xml-p5

# Generate embeddings for AI Q&A (supports incremental processing)
docker exec fojin-backend python -m scripts.generate_embeddings --source cbeta

# Import SuttaCentral Early Buddhist Texts
docker exec fojin-backend python scripts/import_suttacentral.py

# See all available importers
ls backend/scripts/import_*.py

Each importer downloads data directly from the original source (CBETA, SuttaCentral, etc.) — no data is bundled in this repository.

Features

Multi-Dimensional Search

Search across Buddhist canons by title, translator, catalog number, or full-text keyword. Powered by Elasticsearch with ICU tokenizer for multi-language support.

Search results for Avatamsaka Sutra

Full-Text Reading

Read 8,900+ Buddhist texts with 23,500+ volumes of full content online. CBETA-style typography with intelligent verse/prose detection, paragraph reflow, and adjustable font size. Navigate by volume, scroll through content, and jump between related texts.

Parallel Reading (30 Languages)

Compare translations side by side — Classical Chinese, Sanskrit, Pali, Tibetan, English, Japanese, Korean, Gandhari, and 21 more languages.

Dictionary Lookup

32 authoritative dictionaries with 748,000+ entries across Chinese, Pali, Sanskrit, Tibetan, and English:

Chinese Buddhist Dictionaries (14)

  • NTI Reader (佛学辞典) — 161K entries, Chinese↔English
  • Suihan Lu (新集藏經音義隨函錄) — 72K entries, Tang dynasty phonetic glossary
  • Fo Guang (佛光大辭典) — 32K entries
  • Ding Fubao (丁福保佛学大辞典) — 31K entries
  • Yiqiejing Yinyi (一切經音義, 慧琳音義) — 23K entries, Buddhist scriptural phonetics
  • Faxiang Dictionary (法相辭典, 朱芾煌) — 15K entries, Yogācāra terminology
  • Zhonghua Encyclopedia (中華佛教百科全書) — 6K entries
  • Common Buddhist Terms (佛學常見詞彙, 陳義孝) — 6K entries
  • Agama Dictionary (阿含辭典, 莊春江) — 5K entries
  • Fanfanyu (翻梵語) — 4K entries, Sanskrit-Chinese translation glossary
  • Xu Yinyi (續一切經音義, 希麟) — 2K entries
  • Yogācāra Glossary (唯識名詞白話新解) — 2K entries
  • Sanzang Fashu (三藏法數) — 1K entries
  • Buddhist Origins of Idioms (俗語佛源) — 567 entries

Pali Dictionaries (5)

  • Digital Pali Dictionary (DPD) — 89K entries, grammar + etymology + examples
  • NCPED (New Concise Pali-English Dictionary) — 21K entries
  • PTS PED (Pali Text Society) — 16K entries
  • Buddhadatta (巴利語辭典, 達摩比丘中譯) — 11K entries, Pali→Chinese
  • SuttaCentral Glossary — 6K entries

Sanskrit Dictionaries (4)

  • Apte (Practical Sanskrit-English Dictionary) — 35K entries
  • Monier-Williams (Sanskrit-English Dictionary) — 32K entries
  • Edgerton BHS (Buddhist Hybrid Sanskrit Dictionary) — 18K entries
  • Fanyi Mingyi Ji (翻譯名義集) — 1K entries

Tibetan Dictionaries (2)

  • Rangjung Yeshe (Tibetan-English Dictionary) — 74K entries
  • Hopkins (Tibetan-Sanskrit-English Dictionary) — 18K entries

Multilingual Reference (4)

  • Soothill-Hodous (Chinese Buddhist Terms, Chinese↔English) — 17K entries
  • Mahāvyutpatti (翻譯名義大集, Sanskrit↔Tibetan↔Chinese) — 9K entries
  • Nanshan Vinaya (南山律学辞典) — 3K entries
  • Pentaglot (五體清文鑑, Manchu-Mongolian-Tibetan-Chinese-Sanskrit) — 1K entries

Specialized (3)

  • Abhidharma Dictionary (阿毗達磨辭典) — 1K entries
  • Tiantai Dictionary (天台教學辭典) — 1K entries
  • DDB (Digital Dictionary of Buddhism) — CJK Buddhist terminology

Knowledge Graph

31,000+ entities (persons, monasteries, texts, schools, concepts) and 28,000+ relationships — including 23,000 teacher-student lineage chains from the DILA Authority Database — visualized as an interactive force-directed graph. Click any node to explore connections.

Trilingual Cross-Canon Parallel Reading (三语对读)

The first LLM-driven cross-canon parallel reading system for Buddhist texts. No other platform provides this: CBETA (汉文), SuttaCentral (Pali), and 84000 (Tibetan) each operate in their own language silo. FoJin bridges them via LLM-verified chunk-level alignment.

MVP (first 5 classics, 142 alignments):

Sutra Source Target Pairs Type
《般若波羅蜜多心經》Heart Sutra T0252 (Chinese) Toh 21 Kangyur (Tibetan) 6 汉 ↔ 藏
《維摩詰所說經》Vimalakīrti T0475 (Chinese, 罗什译) Toh 176 (Tibetan) 20 汉 ↔ 藏
Mahāsatipaṭṭhāna Sutta 念处经 MN 10 (Pali) T0026 中阿含 (Chinese) 50 巴 ↔ 汉
Dhammacakkappavattana 转法轮经 SN 56.11 (Pali) T0099 杂阿含 (Chinese) 17 巴 ↔ 汉
Dhammapada 法句经 T0210 (Chinese) SC 26 vaggas (Pali) 49 汉 ↔ 巴

Hand-verified precision on random 10-pair sample: 100% (all pairs correctly identified across Chinese, Pali, and Tibetan).

How to use:

  1. In AI Q&A — When XiaoJin cites one of the 5 MVP sutras, the citation drawer shows tabs [ 汉文 ] [ 巴利 (5) ] [ 藏文 (3) ]. Click a tab to see the corresponding passage in another canon, rendered with proper Devanagari / Tibetan fonts.

  2. In the reader — Click the 🌐 「他藏对读」 button in the toolbar to open an inline side panel listing all aligned segments in the current juan. Collapse each entry to see the source Chinese + all cross-canon parallels with LLM confidence scores. The panel sits to the left of the AI reading panel; both can be open simultaneously and independently resized.

Pipeline (backend/scripts/build_alignments.py):

  • pgvector top-20 candidate recall within target text's embeddings
  • LLM verification (DeepSeek V3) returns {is_parallel, confidence, reason} JSON per candidate
  • Pairs with confidence ≥ 0.75 persisted to alignment_pairs with unique (text_a, text_b) chunk tuple for idempotent re-runs
  • $50 cost ceiling guard (actual MVP spend: ~$0.15)
  • Multi-target resolver supports cases where target is split across rows (e.g., SC Dhammapada's 26 separate vagga texts)

RAG layer automatically includes parallel_chunks in the LLM context when a retrieved chunk has alignments, so answers can naturally reference "the Pali version says…" without hallucinating.

AI Q&A — "XiaoJin"

Ask questions in natural language. XiaoJin answers based on canonical Buddhist texts using RAG (Retrieval-Augmented Generation) with 678K+ embedding vectors and HNSW index for fast semantic search. Features include:

  • Multi-turn conversation with context awareness
  • Keyword + optional API cross-encoder reranking for higher answer quality
  • Clickable citations in 【《经名》第N卷】 format — click to open a side drawer with surrounding context, plus multi-language tabs for cross-canon parallels when available (see Trilingual section above)
  • GFM markdown tables — comparative answers (e.g., "Madhyamaka vs Yogācāra") render as proper tables instead of raw pipe syntax
  • Progressive follow-up suggestions (concept → related texts → practice)
  • Smart data source recommendations — when users ask about finding databases, AI automatically recommends relevant sources from 503 data sources via semantic similarity
  • Meta-question handling — detects self-introduction queries ("who are you" / "what can you do") and skips RAG to give a clean functional overview, instead of randomly citing scriptures
  • Anti-hallucination citation rules — the system prompt strictly forbids wrapping a text name in 【…】 unless that exact source appeared in the retrieved context, preventing broken citation links
  • Inline split-view in reader — AI panel opens by default beside the text with a draggable divider; independent scrolling on each side, resizable width persisted to localStorage
  • "Ask XiaoJin" button on the reader page — select text to ask about it
  • Tab key cycles through suggested questions in the input box
  • BYOK (Bring Your Own Key) support for multiple LLM providers

AI Q&A answering about Xuanzang's disciples

Master Persona Mode (法师模式)

Select a specific Buddhist master to receive answers in their teaching style, grounded in their tradition's core scriptures. 8 historical masters available:

Master Tradition Core Teachings
智顗 Zhiyi 天台宗 一念三千、三谛圆融、五时八教、止观双修
慧能 Huineng 禅宗 直指人心、见性成佛、无念无相无住
玄奘 Xuanzang 法相唯识宗 八识、三性、五位百法、转识成智
法藏 Fazang 华严宗 法界缘起、四法界、十玄门、六相圆融
鸠摩罗什 Kumarajiva 三论宗/中观 八不中道、缘起性空、不二法门
印光 Yinguang 净土宗 信愿行、持名念佛、敦伦尽分
蕅益 Ouyi 天台/净土·跨宗派 教宗天台行归净土、六信、性相融会
虚云 Xuyun 禅宗·五宗兼嗣 参话头、起疑情、老实修行

Each master has a 100-150 line enriched system prompt with lineage, core doctrines, speaking style, teaching methods, key allusions, and terminology table. When a master is selected, RAG retrieval is scoped to their core scriptures (e.g., selecting Zhiyi only searches 《摩诃止观》《法华玄义》 etc.), providing more precise citations.

Powered by Master-skill — the open-source Buddhist master AI persona framework.

Knowledge Graph Map (知识图谱地图)

Visualize 50,000+ geo-enabled Buddhist entities on an interactive world map — monasteries, historical places, persons, and schools. Built with Deck.GL + MapLibre.

  • Entity types: Monasteries (green), Places (purple), Persons (red), Schools (blue)
  • Lineage arcs: Toggle 8,000+ teacher-student lineage relations as animated arcs on the map
  • Chinese-only filter: Quickly filter to show only Chinese-named entities
  • Entity search: Find entities by name with simplified/traditional Chinese conversion (OpenCC)
  • Interactive tooltips: Hover to see metadata, country flags, and source attribution

Activity Feed (佛学动态)

Track real-time updates from 503 data sources — new texts added, translation releases, manuscript scans, and schema changes. Includes academic content aggregation and platform-wide activity summary.

Similar Passages Discovery

When reading any text, the sidebar automatically finds semantically similar passages from other texts using pgvector cosine similarity. Discover cross-textual parallels, related commentaries, and thematic connections across the entire canon.

Timeline & Statistics Dashboard

Visualize Buddhist textual history with interactive D3 charts — dynasty distribution, translation trends, language breakdown, category treemap, and top translators. Toggle between scholarly and popular presentation modes.

Collections, Bookmarks & Annotations

Save texts to personal collections, bookmark specific passages, and add annotations for study and research.

Citation Export

Export citations in BibTeX, RIS, and APA formats for academic papers and reference managers.

Multi-Language UI

Available in 9 languages: Simplified Chinese, Traditional Chinese, English, Japanese, Korean, Thai, Vietnamese, Sinhala, and Burmese.

Data Sources

503 data sources from 30 countries

FoJin aggregates data from major Buddhist digital projects worldwide. Sources are categorized by research field (Han, Theravada, Tibetan, Sanskrit, Dunhuang, Art, Dictionary, Digital Humanities) and filterable by region, language, and type:

Source Content Languages
CBETA Chinese Buddhist Canon Classical Chinese
SuttaCentral Early Buddhist Texts Pali, Chinese, English
84000 Tibetan Buddhist Canon Tibetan, English, Sanskrit
BDRC Tibetan manuscripts (IIIF) Tibetan
SAT Taisho Tripitaka Chinese, Japanese
DILA Authority databases (persons, places, catalogs) Multi-language
GRETIL Sanskrit e-texts Sanskrit
DSBC Digital Sanskrit Buddhist Canon Sanskrit
Gandhari.org Gandhari manuscripts Gandhari
VRI Tipitaka Pali Canon (Chattha Sangayana) Pali
Korean Tripitaka Goryeo Tripitaka Chinese, Korean
+ 492 more...

Tech Stack

Layer Technology
Frontend React 18, TypeScript, Vite, Ant Design 5, Zustand, TanStack Query, D3.js, Deck.GL + MapLibre (geo map)
Backend FastAPI, SQLAlchemy (async), Pydantic v2, SSE streaming
Database PostgreSQL 15 + pgvector (HNSW index) + pg_trgm
Search Elasticsearch 8 (ICU tokenizer)
Cache Redis 7
AI RAG (678K+ vectors, BGE-M3, HNSW) + 8 master personas + multi-provider LLM (OpenAI/Anthropic/DeepSeek/DashScope/Gemini/+10 more)
Deploy Docker Compose, Nginx (gzip, security headers), Cloudflare CDN
CI GitHub Actions (lint, test, security scan)

Architecture

                  +-------------+
                  | Cloudflare  |  (CDN, SSL, DDoS protection)
                  +------+------+
                         |
                  +------+------+
                  |   Nginx     |  (gzip, security headers, static cache)
                  +------+------+
                         |
             +-----------+-----------+
             |                       |
       +-----+------+         +-----+------+
       |  React 18   |         |  FastAPI    |
       |  Vite + D3  |         |  async SSE  |
       +-------------+         +------+------+
                                      |
                   +--------+---------+---------+
                   |        |         |         |
             +-----+--+ +--+----+ +--+---+ +---+--------+
             | PG 15   | | ES 8  | |Redis | | LLM APIs   |
             | pgvector | | ICU   | |cache | | (multi-    |
             | HNSW idx | |       | |      | |  provider) |
             +---------+ +-------+ +------+ +------------+

Development

# Backend
cd backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements-dev.txt
alembic upgrade head
uvicorn app.main:app --reload

# Frontend
cd frontend
npm install
npm run dev

# Tests
cd backend && pytest tests/ -q

Security

  • Non-root containers (backend: app, frontend: nginx)
  • Multi-stage Docker builds (no build tools in production)
  • Internal services bound to 127.0.0.1 only
  • Memory/CPU limits per container
  • CSP, X-Frame-Options, X-Content-Type-Options headers
  • Query length limits on all search parameters
  • JWT with 8h expiry, production requires strong secret

Contributing

Contributions are welcome! Whether it's adding a new data source, improving search, fixing bugs, or translating the UI — we'd love your help.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feat/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feat/amazing-feature)
  5. Open a Pull Request

See CONTRIBUTING.md for detailed guidelines.

Roadmap

  • Citation export (BibTeX, RIS, APA)
  • Mobile-responsive reader
  • Public REST API with rate limiting
  • User annotations
  • Community-contributed data sources
  • Internationalization (i18n) — 9 UI languages
  • Embedding-based semantic search (678K+ vectors, HNSW index)
  • AI Q&A with RAG, multi-turn context, and streaming
  • Similar passages discovery (cross-text semantic matching)
  • Timeline visualization and statistics dashboard
  • User feedback system and notification center
  • Admin dashboard (user management, platform analytics)
  • API documentation (OpenAPI/Swagger at /docs, ReDoc at /redoc)
  • AI answer reranking (keyword + optional API cross-encoder)
  • Clickable citation links in AI answers
  • Progressive follow-up suggestions after AI answers
  • "Ask XiaoJin" floating button on reader page
  • Tab key to cycle through suggested questions
  • CBETA-style text layout with verse/prose detection
  • Auto database migration on Docker startup
  • AI answer rating (thumbs up/down) for quality tracking
  • Research field filtering for data sources (8 categories)
  • Admin feedback reply with notification system
  • AI-powered data source recommendations in chat (semantic similarity)
  • DILA Authority lineage import (23K teacher-student relations)
  • DILA catalog associations (contributors, places for 2,300+ texts)
  • Nanshan Vinaya Dictionary (3,200+ Buddhist precept terms)
  • CBETA full-text import — Taishō (T) + Xuzangjing (X): 3,600+ texts, 143M characters, 432K embedding vectors
  • Dictionary expansion — 32 dictionaries, 748K entries (DPD, Apte, Mahāvyutpatti, Buddhadatta, Pentaglot, buddhaspace 7 dicts)
  • Master Persona Mode — 8 Buddhist masters with tradition-scoped RAG (powered by Master-skill)
  • Knowledge Graph Map — 50K+ geo entities, Deck.GL + MapLibre, lineage arcs
  • Activity Feed — real-time source update tracking, academic feeds
  • Inline split-view AI reader panel with draggable divider and independent scrolling
  • AI panel auto-open in reader for one-click access to interpretation
  • Meta-question detection in chat — recognizes "who are you / what can you do" and skips RAG
  • Trilingual cross-canon parallel reading (MVP) — 5 sutras × 142 LLM-verified chunk alignments across CBETA / SuttaCentral / 84000
  • Chat citation drawer with multi-language tabs (汉 / 巴 / 藏 side-by-side)
  • Reader "他藏对读" inline sidebar — juan-level alignment index, coexists with AI panel, independent drag-resize
  • GFM markdown tables in AI answers (remark-gfm) — comparative responses render as proper tables
  • Anti-hallucination citation rules (Rule 4/4b) — forbid wrapping non-retrieved sutra names in 【…】
  • Server-side SEO meta injection for /texts/{id} pages — proper <title> / <description> per sutra for search engines (SPA route crawlability)
  • Trilingual MVP v1.1 — expand to 20+ sutras (Lotus, Avataṃsaka, Madhyamakakārikā, Laṅkāvatāra, full Āgama↔Nikāya)
  • Topic ontology browsing page
  • Cross-lingual search (query in Chinese, find Sanskrit/Pali/Tibetan results)
  • Open data export (JSON/CSV for researchers)
  • MCP Server for AI assistant integration
  • OCR pipeline for scanned texts
  • Collaborative annotation sharing
  • Integration with Zotero and reference managers

License

Apache License 2.0 — applies to FoJin source code only. Third-party data sources retain their own licenses (CC BY-NC-SA, CC0, CC BY-NC-ND, etc.). See NOTICE for details.

Acknowledgments

FoJin is built on the generous work of the global Buddhist digital humanities community. Special thanks to:

  • CBETA — Chinese Buddhist Electronic Text Association
  • SuttaCentral — Early Buddhist Texts
  • BDRC — Buddhist Digital Resource Center
  • 84000 — Translating the Words of the Buddha
  • SAT — SAT Daizokyo Text Database
  • All other data source providers listed in the Sources page

Community

  • LINUX DO — Thanks to the LINUX DO community for support and feedback

Related Projects


If FoJin is useful for your research, please consider giving it a star!

Discussions  ·  Issues  ·  Contributing  ·  contact@fojin.app

Made with care for the Buddhist studies community.