Skip to content

seungbinshin/drive-og

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

drive_og

Archive your Google Drive documents and generate an interconnected Obsidian knowledge graph. The personal counterpart to opsidian_graph — together they map two sides of yourself: personal (Drive) and professional (Git/JIRA/Confluence).

What It Does

drive_og connects to your Google Drive, extracts text from documents (Docs, Sheets, Slides, PDFs), categorizes them by topic, and generates an Obsidian vault where everything is linked by folder, topic, and time.

Google Drive                          Obsidian Vault
┌─────────────────┐                  ┌──────────────────────────┐
│ University/      │    drive_og     │ GoogleDrive/University/  │
│   2023/          │  ──────────>    │ Folders/University--2023 │
│     ML-Notes.doc │   sync         │ Topics/Machine-Learning  │
│     Textbook.pdf │                │ Weekly/2024-W03          │
│ Finance/         │                │ Monthly/2024-01          │
│   Budget.xlsx    │                │ Dashboard.md             │
└─────────────────┘                  │ Expertise.md             │
                                     └──────────────────────────┘

Open the vault in Obsidian and explore the graph view. Install the 3D Graph community plugin for a 3D visualization.

Quick Start

1. Set Up Google API Credentials

  1. Go to Google Cloud Console
  2. Create a project (or select existing), then enable Google Drive API under APIs & Services > Library
  3. Go to APIs & Services > OAuth consent screen:
    • Choose "External" user type
    • Fill in app name and your email
    • Add scope: https://www.googleapis.com/auth/drive.readonly
    • Important: Under "Test users", click + ADD USERS and add your own Gmail address (the app is in Testing mode, so only listed test users can log in)
  4. Go to APIs & Services > Credentials:
    • Click + CREATE CREDENTIALS > OAuth client ID
    • Application type: Desktop app
    • Copy the Client ID and Client Secret

2. Configure

# Copy example configs
cp config/drives.yaml.example config/drives.yaml
cp config/topics.yaml.example config/topics.yaml

# Add your credentials
cat > .env << 'EOF'
GOOGLE_CLIENT_ID=your-client-id-here
GOOGLE_CLIENT_SECRET=your-client-secret-here
ANTHROPIC_API_KEY=sk-ant-...   # Optional: enables LLM topic categorization
EOF

Edit config/drives.yaml to specify which folders to scan:

folders:
  - id: "root"                    # Scan entire My Drive
    name: "My Drive"
    color: "#50C878"
  - id: "1aBcDeFgHiJkLmNoPqR"    # Or specific folder IDs
    name: "University"
    color: "#4A90D9"

extraction:
  shallow_threshold_mb: 1         # Files above this get partial extraction
  shallow_max_pages: 5            # Pages to extract from large files (e.g., textbooks)

Define topic patterns in config/topics.yaml:

topics:
  Machine Learning:
    - "\\bml\\b"
    - "\\bdeep.learning"
  Finance:
    - "\\bbudget\\b"
    - "\\brevenue\\b"

3. Install and Run

pip install -e ../opsidian_core   # Install shared library
pip install -e .                   # Install drive_og

drive-og init                      # Authenticate with Google (opens browser)
drive-og sync                      # Fetch docs, extract text, generate vault

4. Open in Obsidian

Open the vault/ directory as an Obsidian vault.

Commands

Command Description
drive-og init [--reauth] OAuth authentication + discover top-level folders
drive-og sync [--full] [--no-llm] Fetch documents, cache, and generate vault
drive-og generate [--no-llm] Rebuild vault from local cache (offline, no API calls)

Flags:

  • --full — re-fetch all documents (ignores incremental sync state)
  • --no-llm — skip Claude-based topic categorization (keyword-only)
  • --reauth — force re-authentication even if token exists
  • -v / --verbose — debug logging

Generated Vault Structure

vault/
├── Dashboard.md                 # Stats, recent activity, quick links
├── Expertise.md                 # Auto-generated interest/knowledge profile
│
├── GoogleDrive/                 # One .md per document, mirroring folder structure
│   ├── University/
│   │   ├── 2023/
│   │   │   ├── ML-Lecture-Notes.md
│   │   │   └── Deep-Learning-Textbook.md   (shallow extraction)
│   │   └── 2024/
│   │       └── Thesis-Draft.md
│   ├── Projects/
│   │   └── Side-Project-Spec.md
│   └── Finance/
│       └── Budget-2024.md
│
├── Folders/                     # Map of Content per folder
│   ├── University.md
│   ├── University--2023.md      # Nested folders use -- separator
│   ├── Projects.md
│   └── Finance.md
│
├── Topics/                      # Cross-cutting topic MOCs
│   ├── Machine-Learning.md      # Links ALL ML docs across folders
│   ├── Finance.md
│   └── Research.md
│
├── Weekly/                      # Activity by ISO week
│   └── 2024-W03.md
├── Monthly/                     # Monthly rollups
│   └── 2024-01.md
│
└── .obsidian/
    └── graph.json               # Color-coding for graph view

Graph Connections

Three dimensions of wikilinks create a rich knowledge graph:

  1. Folder links[[University--2023]] — preserves your original Drive hierarchy
  2. Topic links[[Machine Learning]] — connects related docs across folders
  3. Temporal links[[2024-W03]] — shows when you worked on what

A document about ML in University/2023/ and another in Projects/ both link to [[Machine Learning]], connecting them even though they live in different folders.

Content Extraction

File Type Method Notes
Google Docs Export as plain text Full text via Drive API
Google Sheets Export as CSV Tabular data preserved
Google Slides Export as plain text Slide content extracted
PDFs pypdf text extraction Handles multi-page documents

Size-Tiered Extraction

Large files (e.g., university textbooks) are handled intelligently:

  • Small files (<1MB default): full text extraction
  • Large files (>1MB): first N pages only (shallow_max_pages in config)
  • Near-empty extraction (<50 chars, e.g., scanned PDFs): metadata only

The extraction field in each note's frontmatter records which mode was used (full, shallow, or metadata_only).

Topic Categorization

Priority stack (highest wins):

  1. Keyword match — regex patterns from config/topics.yaml
  2. LLM fallback — Claude Haiku classifies unmatched text (requires ANTHROPIC_API_KEY)
  3. Default"Uncategorized"

Documents can have multiple topics. When keyword and LLM disagree, keyword wins (user-defined intent).

Incremental Sync

drive-og sync only fetches documents modified since the last sync. State is tracked in .drive_og_state.json.

  • First run: fetches everything
  • Subsequent runs: only changed documents
  • --full: re-fetches everything regardless

The local cache (cache/) is the source of truth for vault generation. You can edit cache JSON files manually, then run drive-og generate to rebuild.

Project Structure

drive_og/
├── src/drive_og/
│   ├── cli.py                # CLI entry point (init, sync, generate)
│   ├── models.py             # GoogleDriveDocument dataclass
│   ├── config.py             # drives.yaml + topics.yaml loading
│   ├── auth.py               # OAuth 2.0 desktop flow
│   ├── gdrive_client.py      # Drive API: list files, resolve paths
│   └── content_extractor.py  # Size-tiered text extraction
├── config/
│   ├── drives.yaml.example   # Template for user configuration
│   └── topics.yaml.example   # Template for topic patterns
├── templates/
│   ├── gdrive_note.md.j2     # Per-document note template
│   └── folder_moc.md.j2      # Folder Map of Content template
├── tests/                     # 17 tests (unit + integration)
├── docs/superpowers/
│   ├── specs/                 # Design specification
│   └── plans/                 # Implementation plan
└── pyproject.toml

Dependencies

  • opsidian_core — shared library (cache, categorizer, vault writer, etc.)
  • google-api-python-client — Google Drive API
  • google-auth-oauthlib — OAuth 2.0 authentication
  • pypdf — PDF text extraction
  • anthropic — Claude API for LLM categorization (optional)

Ecosystem

drive_og is part of the opsidian knowledge graph ecosystem:

                    ┌─────────────────────┐
                    │   opsidian_core      │
                    │   (shared library)   │
                    └──┬────────┬────────┬─┘
                       │        │        │
              ┌────────┘        │        └────────┐
              v                 v                 v
     opsidian_graph         drive_og         opsidian_meta
     (work self)       (personal self)    (unified analysis)
     Git/JIRA/Confluence  Google Drive    reads both caches
              │                 │                 │
              v                 v                 v
     work vault/          personal vault/    meta vault/
                                            (timeline, focus
                                             reports, gaps)
Project What it does
opsidian_core Shared library for all graph generators
opsidian_graph Work knowledge graph (GitHub PRs, JIRA, Confluence)
drive_og Personal knowledge graph (Google Drive)
opsidian_meta Unified productivity analysis (timeline, focus reports, gap detection)

After syncing with drive_og, you can run opsidian_meta to generate cross-domain productivity reports that combine your work and personal activity.

Tests

pip install -e ../opsidian_core
pip install -e .
python -m pytest tests/ -v

17 tests covering models, auth, config, client, extractor, CLI, and end-to-end integration.

License

MIT

About

Archive Google Drive documents and generate Obsidian knowledge graphs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors