Skip to content

Conversation

@cbc3929
Copy link

@cbc3929 cbc3929 commented Dec 31, 2025

Summary

This PR adds comprehensive PostGIS documentation support to pg-aiguide, enabling AI coding assistants to provide better guidance for spatial database operations.

New Features

  • PostGIS Documentation Scraper (ingest/postgis_docs.py)
    • Dedicated scraper for PostGIS manual (DocBook HTML format)
    • Supports both file and database storage modes
    • Header-based markdown chunking with token counting
  • Search APIs
    • semantic_search_postgis_docs - Vector similarity search for PostGIS documentation
    • keyword_search_postgis_docs - BM25 keyword search for PostGIS documentation
  • Database Migration
    • postgis_pages and postgis_chunks tables
    • HNSW index for fast vector similarity search

Enhanced Embedding Configuration

Added support for custom embedding providers (beyond OpenAI):

  • OPENAI_BASE_URL - Custom OpenAI-compatible API endpoint (e.g., Ollama, SiliconFlow)
  • EMBEDDING_MODEL - Custom embedding model name
  • EMBEDDING_DIMENSIONS - Configurable vector dimensions
    This allows users to use alternative embedding services while maintaining compatibility with the existing database schema.

Testing

  • ✅ TypeScript build passes
  • ✅ Python syntax validation passes
  • ✅ Database migration tested
  • ✅ Scraper tested with file mode (5 pages)
  • ✅ Scraper tested with database mode (3 pages, 43 chunks)
  • ✅ Semantic search verified with vector similarity queries

Usage

# Scrape PostGIS documentation
cd ingest
uv run python postgis_docs.py --version 3.5 --storage-type database
# With custom embedding provider
export OPENAI_BASE_URL=https://api.siliconflow.cn/v1
export EMBEDDING_MODEL=Qwen/Qwen3-Embedding-8B
uv run python postgis_docs.py --version 3.5 --storage-type database
Checklist
- Code follows project conventions
- All comments in English
- Documentation updated (README.md, .env.sample)
- Database migration included
- Tests performed locally

@CLAassistant
Copy link

CLAassistant commented Dec 31, 2025

CLA assistant check
All committers have signed the CLA.

Add comprehensive PostGIS documentation scraping and semantic search capabilities:

New features:
- PostGIS manual scraper (postgis_docs.py) for DocBook HTML documentation
- Semantic search API (semantic_search_postgis_docs)
- Keyword search API (keyword_search_postgis_docs)
- Database migration for postgis_pages and postgis_chunks tables
- HNSW vector index for fast similarity search

Enhanced embedding configuration:
- Add OPENAI_BASE_URL for custom OpenAI-compatible endpoints
- Add EMBEDDING_MODEL for custom embedding model selection
- Add EMBEDDING_DIMENSIONS for configurable vector dimensions
- Support third-party embedding services (e.g., SiliconFlow, Ollama)

Updated files:
- README.md: Add PostGIS to supported extensions
- .env.sample: Document new embedding configuration options
- tiger_docs.py, postgres_docs.py: Add custom embedding support
@cbc3929 cbc3929 force-pushed the feat/postgis-support branch from b716454 to e5d7b77 Compare December 31, 2025 06:30
Copy link
Member

@murrayju murrayju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the submission! A few minor requests, if you don't mind fixing them.

conn.execute("DROP TABLE IF EXISTS docs.postgis_chunks_tmp CASCADE")
conn.execute("DROP TABLE IF EXISTS docs.postgis_pages_tmp CASCADE")

# 创建页面表
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use English for all comments, please?

)
""")

# 创建块表
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too


args = parser.parse_args()

# 验证数据库存储需求
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well.

, sub_chunk_index INTEGER NOT NULL DEFAULT 0
, content TEXT NOT NULL
, metadata JSONB
, embedding vector(1536)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema here (and in the existing pg and tiger schemas) hard codes the vector size to 1536. I don't think adding EMBEDDING_DIMENSIONS as an environment variable adds value, unless the schema would be updated/migrated to match. Otherwise, inserts will just fail.

I'd be inclined to keep this fixed as 1536 for now.

pg_database = os.environ.get("PGDATABASE")

if all([pg_user, pg_password, pg_host, pg_port, pg_database]):
return f"postgresql://{pg_user}:{pg_password}@{pg_host}:{pg_port}/{pg_database}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The password is not being url-encoded, so a password containing e.g. @ would break the connection string. I can see this is also an issue in the existing code, so we could file a ticket to address this later if you'd prefer.

"""

import argparse
from dataclasses import dataclass, field
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field is not used

from psycopg.sql import SQL, Identifier
import re
import requests
from urllib.parse import urljoin, urlparse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

urlparse is not used

import time

THIS_DIR = Path(__file__).parent.resolve()
load_dotenv(dotenv_path=os.path.join(THIS_DIR, "..", ".env"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
load_dotenv(dotenv_path=os.path.join(THIS_DIR, "..", ".env"))
load_dotenv(dotenv_path=THIS_DIR.parent / ".env")

- Replace Chinese comments with English
- Remove unused imports (field, urlparse)
- Use pathlib style for dotenv path
- Fix EMBEDDING_DIMENSIONS to 1536 to match database schema
- Add URL encoding for password in connection string
Comprehensive reference covering:
- Geometry vs Geography selection guide
- Coordinate systems (SRID) best practices
- Spatial indexing (GiST, BRIN, SP-GiST)
- Table design examples (POI, parcels, GPS tracking)
- Performance optimization patterns
- Data validation techniques
@cbc3929
Copy link
Author

cbc3929 commented Jan 7, 2026

Hi @murrayju, thank you for the thorough review!

I've addressed all the feedback in my latest commits:

  • ✅ Replaced all Chinese comments with English - apologies for that oversight!
  • ✅ Removed unused imports (field, urlparse)
  • ✅ Updated load_dotenv to use pathlib style
  • ✅ Fixed EMBEDDING_DIMENSIONS to 1536 to match the database schema
  • ✅ Added URL encoding for password to handle special characters

Regarding EMBEDDING_DIMENSIONS: My original intent was to support alternative embedding providers like Qwen3-Embedding-8B (which defaults to 4096 dimensions) for users who prefer local/self-hosted models. However, I understand that without updating the database schema to match, this would just cause insert failures. I've reverted it to a fixed value of 1536 as you suggested. If there's interest in supporting configurable dimensions in the future, we could address the schema side as well.

I've also added a new skills/design-postgis-tables/SKILL.md - a comprehensive PostGIS spatial table design reference that complements PR #62's coding style guide.

Thanks again for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants