DiagramBank

DiagramBank: A Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation (RAG).

DiagramBank is a large-scale, retrieval-ready collection of scientific schematic diagrams mined from top-tier AI/ML publications, paired with rich paper metadata and figure-local context. It is designed to support diagram retrieval, exemplar-driven scientific figure authoring, and broader multimodal research beyond generation.

Overview

Motivation

Autonomous “AI scientist” pipelines can draft text and code, but producing a publication-grade teaser/overview diagram is still a major bottleneck. Unlike routine data plots, a good scientific diagram requires conceptual synthesis, layout planning, consistent topology (arrows/relations), and readable annotations.

DiagramBank addresses this gap by providing a large bank of real, high-quality diagram exemplars with multiple levels of grounding text (paper title/abstract, figure caption, and in-text reference spans) so you can retrieve relevant designs at different granularities and use them for retrieval-augmented authoring.

What’s in DiagramBank

Each diagram record is enriched with both figure-level and paper-level information. Depending on availability, metadata can include:

Diagram image + caption
Figure context: paragraphs that cite the figure in the paper body (how authors explain the figure)
Paper title + abstract (paper intent / domain)
Additional OpenReview metadata such as decision status, reviewer scores, keywords/subject areas, URLs, BibTeX, etc.
A CLIP-based label and confidence score (to support controllable filtering)

The repository also provides core retrieval artifacts (e.g., FAISS indices + DuckDB) so you can get started quickly.

Example use cases

DiagramBank is intentionally broader than “diagram generation”:

Retrieval-augmented diagram authoring (RAG)

Retrieve exemplar diagrams similar to your paper’s title/abstract/caption
Use retrieved exemplars to guide layout, style, grouping, iconography, and composition for teaser-style figures

Multimodal retrieval and benchmarking

Build and evaluate retrieval systems that operate on diagram-centric scientific content
Explore coarse-to-fine retrieval (paper-level → figure-level)

Diagram understanding / classification / clustering

Train or evaluate figure-type classifiers, diagram style classifiers, topic/style clustering, etc.
Study diagram conventions across venues/years

Paper-level analytics & scientometrics with visual signals

Because records are linked to paper metadata, you can explore questions like:

How do diagram properties correlate with acceptance decisions, review scores, or venue/year?
How does diagram density or caption verbosity evolve over time?
What diagram styles dominate specific subject areas?

Context-aware tasks

Figure context spans enable tasks that require more than captions:

Context-aware diagram retrieval
Diagram-to-text alignment and grounding
Studying how authors narrate and operationalize diagrams in scientific writing

Prerequisites

conda

conda env create --file environment.yml

Download the DiagramBank dataset

1) Set the target folder

The default download is large. Make sure you have enough disk space.

# Run it (downloads ~60GB of diagrams from accepted papers)
# Set the target folder using the FIG_RAG_DIR environment variable
export FIG_RAG_DIR=<a scratch folder with at least 60 GB of space>

2) Download options

# 1. Default: Download Accepted papers + Core files (DBs/FAISS)
python huggingface/download_diagrambank.py

# 2. Download Everything: All papers (Accept + Reject) + Core files
# python huggingface/download_diagrambank.py --subset all

# 3. Download Rejected papers only + Core files
# python huggingface/download_diagrambank.py --subset reject

# 4. Skip Core Files: Download only images (no DBs or FAISS)
# python huggingface/download_diagrambank.py --no-core

# 5. Combine Flags: Download all images but skip core files
# python huggingface/download_diagrambank.py --subset all --no-core

The script will automatically download and extract the diagram folder, FAISS index, duckdb database to $FIG_RAG_DIR. The process can take 15–30 minutes depending on network speed.

Check installation

du -sh $FIG_RAG_DIR
51G

tree -L 4 $FIG_RAG_DIR

├── faiss
│   ├── abstract_index
│   │   ├── index.faiss
│   │   └── index.pkl
│   ├── caption_index
│   │   ├── index.faiss
│   │   └── index.pkl
│   ├── research.db
│   └── title_index
│       ├── index.faiss
│       └── index.pkl
└── OpenReview
    ├── ICLR
    │   ├── figures
    │   │   ├── 2017
    │   │   ├── 2018
    │   │   ├── 2019
    │   │   ├── 2020
    │   │   ├── 2021
    │   │   ├── 2022
    │   │   ├── 2023
    │   │   ├── 2024
    │   │   ├── 2025
    │   │   └── 2026
    │   └── research.db
    ├── ICML
    │   ├── figures
    │   │   ├── 2023
    │   │   ├── 2024
    │   │   └── 2025
    │   └── research.db
    ├── NeurIPS
    │   ├── figures
    │   │   ├── 2021
    │   │   ├── 2022
    │   │   ├── 2024
    │   │   └── 2025
    │   └── research.db
    └── TMLR
        ├── figures
        │   ├── 2022
        │   ├── 2023
        │   ├── 2024
        │   ├── 2025
        │   └── 2026
        └── research.db

What are these indices?

title_index: coarse paper-level filtering (topic/domain alignment)
abstract_index: paper-level refinement (problem/method alignment)
caption_index: figure-level matching (diagram content alignment)

Usage

Set your OpenAI API key

export OPENAI_API_KEY=<your openai api key>

This is only used for embedding the query text at runtime, so the cost is very low ($0.13/1M tokens with Text Embedding 3 Large) (https://costgoat.com/pricing/openai-embeddings) A paper title is 5-25 words, an abstract is 150-250 words, and a caption is 10-100 words. Take an upper bound of 500 words, and 1.33 tokens per word, yielding an upper bound of 1000 token per query. Then, for 1000 queries, the cost will be $0.13.

Note: pricing can change; treat the above as a back-of-the-envelope estimate.

Retrieve similar diagrams for your figure

To retrieve the similar diagrams for your figures, go to demo/query-diagram.ipynb. Set title, abstract, and caption for your paper, and then keep running the fourth cell to get the similar diagrams.

t1: number of diagrams with similar title
t2: number of diagrams with similar abstract
k: number of diagrams with similar caption

hierarchical_retrieval() will retrieve the top-k similar diagrams based on your title, abstract, and caption.

Hugging Face

The dataset and model card is hosted at: https://huggingface.co/datasets/zhangt20/DiagramBank

Reproduce this work

If you want to reproduce this work, see reproduce/README.md. Might take a few days up to a week.

Notes on responsible use

DiagramBank is mined from publicly accessible scientific PDFs and includes metadata for attribution and traceability.
Figures may be subject to the original authors’/publishers’ licenses and terms. Please use responsibly and cite the relevant sources.
If you use DiagramBank for generative authoring, we recommend provenance tracking and disclosure for AI-generated figures where appropriate.

Citation

If you use DiagramBank in your research, please cite our paper (and consider citing the original papers for any retrieved exemplars you use directly).

@misc{zhang2026diagrambank,
  title        = {DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation},
  author       = {Zhang, Tingwen and Yue, Ling and Xu, Zhen and Pan, Shaowu},
  year         = {2026},
  month        = feb,
  howpublished = {Research Square},
  note         = {Preprint (Version 1)},
  doi          = {10.21203/rs.3.rs-8917857/v1},
  url          = {https://doi.org/10.21203/rs.3.rs-8917857/v1}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
demo		demo
faiss		faiss
huggingface		huggingface
reproduce		reproduce
.gitattributes		.gitattributes
.gitignore		.gitignore
DiagramBank.png		DiagramBank.png
LICENSE		LICENSE
README.md		README.md
data.jsonl		data.jsonl
environment.yml		environment.yml
relations.md		relations.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiagramBank

Overview

Motivation

What’s in DiagramBank

Example use cases

Retrieval-augmented diagram authoring (RAG)

Multimodal retrieval and benchmarking

Diagram understanding / classification / clustering

Paper-level analytics & scientometrics with visual signals

Context-aware tasks

Prerequisites

conda

Download the DiagramBank dataset

1) Set the target folder

2) Download options

Check installation

What are these indices?

Usage

Set your OpenAI API key

Retrieve similar diagrams for your figure

Hugging Face

Reproduce this work

Notes on responsible use

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DiagramBank

Overview

Motivation

What’s in DiagramBank

Example use cases

Retrieval-augmented diagram authoring (RAG)

Multimodal retrieval and benchmarking

Diagram understanding / classification / clustering

Paper-level analytics & scientometrics with visual signals

Context-aware tasks

Prerequisites

conda

Download the DiagramBank dataset

1) Set the target folder

2) Download options

Check installation

What are these indices?

Usage

Set your OpenAI API key

Retrieve similar diagrams for your figure

Hugging Face

Reproduce this work

Notes on responsible use

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages