Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
431 changes: 431 additions & 0 deletions code/cli-app.py

Large diffs are not rendered by default.

243 changes: 144 additions & 99 deletions code/config/config.py

Large diffs are not rendered by default.

17 changes: 10 additions & 7 deletions code/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ seaborn>=0.13.0
# anthropic>=0.18.1

# For OpenAI (including Azure OpenAI, Llama Azure, DeepSeek Azure):
# openai>=1.12.0
openai>=1.12.0

# For Google Gemini/Vertex AI:
# google-genai>=0.7.1
Expand All @@ -37,8 +37,8 @@ seaborn>=0.13.0
# huggingface_hub>=0.31.0

# For Azure AI Inference:
# azure-ai-inference>=1.0.0b9
# azure-core>=1.30.0
azure-ai-inference>=1.0.0b9
azure-core>=1.30.0

# Optional Retrieval Backend dependencies
# NOTE: These packages will also be installed AUTOMATICALLY at runtime when you first use a backend.
Expand All @@ -48,15 +48,18 @@ seaborn>=0.13.0
# If you prefer to pre-install specific backends, you can uncomment the lines below:

# For Azure AI Search:
# azure-core>=1.30.0
# azure-search-documents>=11.4.0
azure-core>=1.30.0
azure-search-documents>=11.4.0

# For Milvus:
# pymilvus>=1.1.0
# numpy

# For Qdrant:
# qdrant-client>=1.14.0
qdrant-client>=1.14.0

# For OpenSearch or Snowflake (both use httpx):
# httpx>=0.28.1
# httpx>=0.28.1

# For Java Interface strategy:
javalang>=0.13.0
341 changes: 341 additions & 0 deletions code/tools/verb_ingress/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,341 @@
# Various data types => Vector Database

A modular, extensible data ingestion system for NLWeb using the Strategy Pattern + Factory Pattern. This system supports multiple data types for ingestion, each with its own processing logic, and outputs standardized documents for storage in a vector database.

## How to use

### Quick Start with Examples

Run the examples script to see the ingress system in action:

```bash
# From the code/ directory
python -m tools.verb_ingress.examples
```

### A Bit Longer Quick Start

```bash
python -m venv myenv
python -m pip install -r code\requirements.txt
.\myenv\Scripts\activate
cd code
python -m tools.verb_ingress.db_load ../demo/verb_demo/alaskaair_com.java alaskaair_com-api
python cli-app.py -q "what api should I use to book an alaska airline flight?" --num-results 10 --format json
```


### Loading Real Files with db_load.py

The `db_load.py` script provides a CLI interface compatible with the legacy loader but uses the new ingress system for supported file types (OpenAPI JSON, Java interfaces).

**Load OpenAPI JSON files:**
```bash
# Load a local OpenAPI specification
python -m tools.verb_ingress.db_load ../demo/verb_demo/github.openapi.json github-api

# Delete existing data
python -m tools.verb_ingress.db_load github-api --only-delete
```

**Load Java interface files:**
```bash
# Load a Java interface
python -m tools.verb_ingress.db_load ../demo/verb_demo/alaskaair_com.java alaskaair_com-api
python -m tools.verb_ingress.db_load ../demo/verb_demo/amazon_com.java amazon_com-api
python -m tools.verb_ingress.db_load ../demo/verb_demo/booking_com.java booking_com-api
python -m tools.verb_ingress.db_load ../demo/verb_demo/costco_com.java costco_com-api
python -m tools.verb_ingress.db_load ../demo/verb_demo/Wikimedia.java wikimedia-api
python -m tools.verb_ingress.db_load ../demo/verb_demo/GitHub.java GitHub-api
python -m tools.verb_ingress.db_load ../demo/verb_demo/maps_google_com.java maps_google_com-api
python -m tools.verb_ingress.db_load ../demo/verb_demo/Nasa.java Nasa-api
python -m tools.verb_ingress.db_load ../demo/verb_demo/News.java News-api
python -m tools.verb_ingress.db_load ../demo/verb_demo/OpenLibrary.java OpenLibrary-api
python -m tools.verb_ingress.db_load ../demo/verb_demo/OpenWeather.java OpenWeather-api
python -m tools.verb_ingress.db_load ../demo/verb_demo/redfin_com.java redfin_com-api
python -m tools.verb_ingress.db_load ../demo/verb_demo/Spotify.java Spotify-api
python -m tools.verb_ingress.db_load ../demo/verb_demo/teams_microsoft_com.java teams_microsoft_com-api
python -m tools.verb_ingress.db_load ../demo/verb_demo/Wikimedia.java Wikimedia-api
python -m tools.verb_ingress.db_load ../demo/verb_demo/youtube_com.java youtube_com-api

python cli-app.py -q "1. Search for hotels in Paris for the dates 2025-07-10 to 2025-07-15. Find a list of top-rated hotels. 2. Find the direction to the Eiffel Tower. 3. On Amazon, buy a travel adaptor for France. 4. Send a Microsoft Teams message to [email protected] that contains the hotel and the purchase information." --num-results 20
```

**Use specific database endpoint:**
```bash
# Load to a specific vector database
python -m tools.verb_ingress.db_load ../demo/verb_demo/github.openapi.json --database qdrant_local
```

## Architecture

The ingress system uses two key design patterns:

1. **Strategy Pattern**: Each data type (OpenAPI JSON, Java Interface, etc.) has its own strategy class that implements the `BaseIngressStrategy` interface
2. **Factory Pattern**: The `IngressFactory` automatically selects the appropriate strategy based on file extension or data content analysis

### Core Components

```
verb_ingress/
├── __init__.py # Module exports
├── base_strategy.py # Abstract base strategy interface
├── factory.py # Factory for strategy selection
├── openapi_strategy.py # OpenAPI/Swagger JSON strategy
├── java_strategy.py # Java interface strategy
└── README.md # This documentation
```

## Design Principles

### 1. Standardized Document Format

All strategies convert their input data to a standardized document format for vector database storage:

```python
{
"id": str, # Unique identifier (hash of URL)
"schema_json": str, # JSON string of processed content
"url": str, # Source URL or synthetic identifier
"name": str, # Human-readable name
"site": str # Site identifier for categorization
}
```

### 2. Strategy Interface

Every ingress strategy implements the `BaseIngressStrategy` interface:

```python
class BaseIngressStrategy(ABC):
@abstractmethod
def validate_input(self, data: Any) -> bool:
"""Validate input data for this strategy"""

@abstractmethod
def process_data(self, data: Any, source_url: str, site: str) -> Tuple[List[Dict[str, Any]], List[str]]:
"""Process data and return (documents, texts_for_embedding)"""

@abstractmethod
def get_supported_extensions(self) -> List[str]:
"""Return supported file extensions"""

@abstractmethod
def get_strategy_name(self) -> str:
"""Return human-readable strategy name"""
```

### 3. Factory-Based Selection

The factory automatically selects strategies based on:
- **File extension**: `.java` → Java strategy, `.json` → OpenAPI strategy
- **Content analysis**: Validates data structure to determine compatibility
- **Explicit specification**: Direct strategy name specification

## Supported Data Types

### OpenAPI/Swagger JSON (`openapi_strategy.py`)

**Supported formats:**
- OpenAPI 3.x specifications
- Swagger 2.x specifications
- JSON format from URLs or files

**Output format:** Schema.org APIReference objects representing API endpoints

**Extensions:** `.json`, `.yaml`, `.yml`

**Example usage:**
```python
from verb_ingress.factory import auto_select_strategy

# Auto-detect strategy from file
strategy = auto_select_strategy(file_path="api-spec.json")

# Process OpenAPI data
with open("api-spec.json") as f:
data = f.read()

documents, texts = strategy.process_data(data, "https://api.example.com", "example-api")
```

### Java Interfaces (`java_strategy.py`)

**Supported formats:**
- Java interface files (`.java`)
- Parses methods, parameters, return types, documentation

**Output format:** Schema.org APIReference objects representing Java methods

**Extensions:** `.java`

**Example usage:**
```python
from verb_ingress.factory import auto_select_strategy

# Auto-detect strategy from file
strategy = auto_select_strategy(file_path="MyInterface.java")

# Process Java interface
with open("MyInterface.java") as f:
java_code = f.read()

documents, texts = strategy.process_data(java_code, "java://com.example.MyInterface", "example-java")
```

## Usage Examples

### Basic Usage

```python
from verb_ingress.factory import auto_select_strategy

# Automatic strategy selection
strategy = auto_select_strategy(file_path="data.json")

if strategy:
print(f"Using strategy: {strategy.get_strategy_name()}")

# Load and process data
with open("data.json") as f:
data = f.read()

documents, texts = strategy.process_data(data, "https://example.com/data.json", "example-site")
print(f"Generated {len(documents)} documents")
else:
print("No suitable strategy found")
```

### Explicit Strategy Selection

```python
from verb_ingress.factory import create_strategy

# Create specific strategy
openapi_strategy = create_strategy("openapi")

# Process data
documents, texts = openapi_strategy.process_data(json_data, source_url, site_name)
```

### Complete Processing Pipeline

```python
import asyncio
from verb_ingress.factory import auto_select_strategy
from embedding.embedding import batch_get_embeddings
from retrieval.retriever import get_vector_db_client

async def process_file(file_path: str, site: str):
"""Complete pipeline: detect strategy, process data, embed, store"""

# Auto-select strategy
strategy = auto_select_strategy(file_path=file_path)
if not strategy:
print(f"No strategy found for {file_path}")
return

# Load and process data
with open(file_path) as f:
data = f.read()

documents, texts = strategy.process_data(data, file_path, site)

if not documents:
print("No documents generated")
return

# Generate embeddings
embeddings = await batch_get_embeddings(texts)

# Add embeddings to documents
for doc, embedding in zip(documents, embeddings):
doc["embedding"] = embedding

# Store in vector database
client = get_vector_db_client()
await client.upload_documents(documents)

print(f"Processed {len(documents)} documents from {file_path}")

# Usage
asyncio.run(process_file("api-spec.json", "my-api"))
```

## Adding New Strategies

To add support for a new data type:

### 1. Create Strategy Class

Create a new file `your_strategy.py`:

```python
from typing import List, Dict, Any, Tuple
from .base_strategy import BaseIngressStrategy
from tools.db_load_utils import int64_hash

class YourStrategy(BaseIngressStrategy):
def validate_input(self, data: Any) -> bool:
# Implement validation logic
return True

def process_data(self, data: Any, source_url: str, site: str) -> Tuple[List[Dict[str, Any]], List[str]]:
# Process data and convert to standard format
documents = []
texts = []

# Your processing logic here...

return documents, texts

def get_supported_extensions(self) -> List[str]:
return ['.your_extension']

def get_strategy_name(self) -> str:
return "Your Data Type"
```

### 2. Register in Factory

Update `factory.py`:

```python
from .your_strategy import YourStrategy

class IngressFactory:
def __init__(self):
self._strategies: Dict[str, Type[BaseIngressStrategy]] = {
'openapi': OpenAPIStrategy,
'java': JavaStrategy,
'your_type': YourStrategy, # Add your strategy
}
# ... rest of init
```

### 3. Update Module Exports

Update `__init__.py`:

```python
from .your_strategy import YourStrategy

__all__ = [
'BaseIngressStrategy',
'IngressFactory',
'OpenAPIStrategy',
'JavaStrategy',
'YourStrategy' # Add your strategy
]
```

## Dependencies

- **Core**: No additional dependencies for the base system
- **OpenAPI Strategy**: `aiohttp` for URL fetching
- **Java Strategy**: `javalang` for parsing Java code

Install optional dependencies:
```bash
pip install aiohttp javalang
```
Loading