-
Notifications
You must be signed in to change notification settings - Fork 69
feat(postgis): add PostGIS documentation support #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add comprehensive PostGIS documentation scraping and semantic search capabilities: New features: - PostGIS manual scraper (postgis_docs.py) for DocBook HTML documentation - Semantic search API (semantic_search_postgis_docs) - Keyword search API (keyword_search_postgis_docs) - Database migration for postgis_pages and postgis_chunks tables - HNSW vector index for fast similarity search Enhanced embedding configuration: - Add OPENAI_BASE_URL for custom OpenAI-compatible endpoints - Add EMBEDDING_MODEL for custom embedding model selection - Add EMBEDDING_DIMENSIONS for configurable vector dimensions - Support third-party embedding services (e.g., SiliconFlow, Ollama) Updated files: - README.md: Add PostGIS to supported extensions - .env.sample: Document new embedding configuration options - tiger_docs.py, postgres_docs.py: Add custom embedding support
b716454 to
e5d7b77
Compare
murrayju
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the submission! A few minor requests, if you don't mind fixing them.
ingest/postgis_docs.py
Outdated
| conn.execute("DROP TABLE IF EXISTS docs.postgis_chunks_tmp CASCADE") | ||
| conn.execute("DROP TABLE IF EXISTS docs.postgis_pages_tmp CASCADE") | ||
|
|
||
| # 创建页面表 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use English for all comments, please?
ingest/postgis_docs.py
Outdated
| ) | ||
| """) | ||
|
|
||
| # 创建块表 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too
ingest/postgis_docs.py
Outdated
|
|
||
| args = parser.parse_args() | ||
|
|
||
| # 验证数据库存储需求 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as well.
| , sub_chunk_index INTEGER NOT NULL DEFAULT 0 | ||
| , content TEXT NOT NULL | ||
| , metadata JSONB | ||
| , embedding vector(1536) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema here (and in the existing pg and tiger schemas) hard codes the vector size to 1536. I don't think adding EMBEDDING_DIMENSIONS as an environment variable adds value, unless the schema would be updated/migrated to match. Otherwise, inserts will just fail.
I'd be inclined to keep this fixed as 1536 for now.
ingest/postgis_docs.py
Outdated
| pg_database = os.environ.get("PGDATABASE") | ||
|
|
||
| if all([pg_user, pg_password, pg_host, pg_port, pg_database]): | ||
| return f"postgresql://{pg_user}:{pg_password}@{pg_host}:{pg_port}/{pg_database}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The password is not being url-encoded, so a password containing e.g. @ would break the connection string. I can see this is also an issue in the existing code, so we could file a ticket to address this later if you'd prefer.
ingest/postgis_docs.py
Outdated
| """ | ||
|
|
||
| import argparse | ||
| from dataclasses import dataclass, field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
field is not used
ingest/postgis_docs.py
Outdated
| from psycopg.sql import SQL, Identifier | ||
| import re | ||
| import requests | ||
| from urllib.parse import urljoin, urlparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
urlparse is not used
ingest/postgis_docs.py
Outdated
| import time | ||
|
|
||
| THIS_DIR = Path(__file__).parent.resolve() | ||
| load_dotenv(dotenv_path=os.path.join(THIS_DIR, "..", ".env")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| load_dotenv(dotenv_path=os.path.join(THIS_DIR, "..", ".env")) | |
| load_dotenv(dotenv_path=THIS_DIR.parent / ".env") |
- Replace Chinese comments with English - Remove unused imports (field, urlparse) - Use pathlib style for dotenv path - Fix EMBEDDING_DIMENSIONS to 1536 to match database schema - Add URL encoding for password in connection string
Comprehensive reference covering: - Geometry vs Geography selection guide - Coordinate systems (SRID) best practices - Spatial indexing (GiST, BRIN, SP-GiST) - Table design examples (POI, parcels, GPS tracking) - Performance optimization patterns - Data validation techniques
|
Hi @murrayju, thank you for the thorough review! I've addressed all the feedback in my latest commits:
Regarding I've also added a new Thanks again for the review! |
Summary
This PR adds comprehensive PostGIS documentation support to pg-aiguide, enabling AI coding assistants to provide better guidance for spatial database operations.
New Features
ingest/postgis_docs.py)semantic_search_postgis_docs- Vector similarity search for PostGIS documentationkeyword_search_postgis_docs- BM25 keyword search for PostGIS documentationpostgis_pagesandpostgis_chunkstablesEnhanced Embedding Configuration
Added support for custom embedding providers (beyond OpenAI):
OPENAI_BASE_URL- Custom OpenAI-compatible API endpoint (e.g., Ollama, SiliconFlow)EMBEDDING_MODEL- Custom embedding model nameEMBEDDING_DIMENSIONS- Configurable vector dimensionsThis allows users to use alternative embedding services while maintaining compatibility with the existing database schema.
Testing
Usage