A high-performance graph database library with Python bindings written in Rust.
- Installation
- Introduction
- Basic Usage
- Important: Node Property Mapping
- Working with Nodes
- Working with Dates
- Creating Connections
- Filtering and Querying
- Traversing the Graph
- Set Operations on Selections
- Path Finding and Graph Algorithms
- Pattern Matching
- Subgraph Extraction
- Spatial and Geometry Operations
- Schema Definition and Validation
- Index Management
- Export Formats
- Statistics and Calculations
- Saving and Loading
- Operation Reports
- Performance Tips
- Performance Model
- Lightweight Methods Reference
pip install rusty-graph
# upgrade
pip install rusty-graph --upgradeRusty Graph is a Rust-based project that aims to empower the generation of high-performance knowledge graphs within Python environments. Specifically designed for aggregating and merging data from SQL databases, Rusty Graph facilitates the seamless transition of relational database information into structured knowledge graphs. By leveraging Rust's efficiency and Python's flexibility, Rusty Graph offers an optimal solution for data scientists and developers looking to harness the power of knowledge graphs in their data-driven applications.
- Efficient Data Integration: Easily import and merge data from SQL databases to construct knowledge graphs, optimizing for performance and scalability.
- High-Performance Operations: Utilize Rust's performance capabilities to handle graph operations, making Rusty Graph ideal for working with large-scale data.
- Python Compatibility: Directly integrate Rusty Graph into Python projects, allowing for a smooth workflow within Python-based data analysis and machine learning pipelines.
- Flexible Graph Manipulation: Create, modify, and query knowledge graphs with a rich set of features, catering to complex data structures and relationships.
- Graph Algorithms: Built-in shortest path, all paths, and connected components algorithms powered by petgraph.
- Pattern Matching: Cypher-like query syntax for expressive multi-hop graph traversals.
- Spatial Operations: Geographic queries including bounding box, distance (Haversine), and WKT geometry intersection.
- Export Formats: Export to GraphML, GEXF, D3 JSON, and CSV for visualization and interoperability.
import rusty_graph
import pandas as pd
# Create a new knowledge graph
graph = rusty_graph.KnowledgeGraph()
# Create some data using pandas
users_df = pd.DataFrame({
'user_id': [1001, 1002, 1003],
'name': ['Alice', 'Bob', 'Charlie'],
'age': [28, 35, 42]
})
# Add nodes to the graph
graph.add_nodes(
data=users_df,
node_type='User',
unique_id_field='user_id',
node_title_field='name'
)
# View graph schema
print(graph.get_schema())When you add nodes to the graph, the column names are mapped to internal property names. Understanding this mapping is critical for correct filtering and querying.
| Your DataFrame Column | Stored As | Why |
|---|---|---|
unique_id_field (e.g., user_id) |
id |
Canonical node identifier for lookups and filtering |
node_title_field (e.g., name) |
title |
Display/label field (renamed, not preserved) |
| All other columns | Same name | Properties are preserved as-is |
Important: The unique_id_field and node_title_field columns are renamed, not copied. The original column names (user_id, name) do not exist as properties on the stored node.
# Your DataFrame
users_df = pd.DataFrame({
'user_id': [1001, 1002],
'name': ['Alice', 'Bob'],
'age': [28, 35]
})
# Adding nodes
graph.add_nodes(
data=users_df,
node_type='User',
unique_id_field='user_id', # → stored as 'id'
node_title_field='name' # → stored as 'title'
)
# What the node actually looks like:
# {'id': 1001, 'title': 'Alice', 'type': 'User', 'age': 28}
#
# NOTE: There is NO 'user_id' or 'name' property!# ❌ WRONG - These properties don't exist:
graph.type_filter('User').filter({'user_id': 1001}) # Returns 0 nodes!
graph.type_filter('User').filter({'name': 'Alice'}) # Returns 0 nodes!
# âś… CORRECT - Use the mapped property names:
graph.type_filter('User').filter({'id': 1001}) # Works!
graph.type_filter('User').filter({'title': 'Alice'}) # Works!
graph.type_filter('User').filter({'age': 28}) # Works (not mapped)Always use explain() when developing queries to verify node counts at each step:
result = graph.type_filter('User').filter({'id': 1001})
print(result.explain())
# Output: TYPE_FILTER User (1000 nodes) -> FILTER (1 nodes)
# ^^^^^^^^
# Verify this is > 0!If you see (0 nodes) after a filter, you're likely using the wrong property name.
# Add products to graph
products_df = pd.DataFrame({
'product_id': [101, 102, 103],
'title': ['Laptop', 'Phone', 'Tablet'],
'price': [999.99, 699.99, 349.99],
'stock': [45, 120, 30]
})
graph.add_nodes(
data=products_df,
node_type='Product',
unique_id_field='product_id',
node_title_field='title',
# Optional: specify which columns to include
columns=['product_id', 'title', 'price', 'stock', 'category'],
# Optional: how to handle conflicts with existing nodes
conflict_handling='update' # Options: 'update', 'replace', 'skip', 'preserve'
)# Get all products
products = graph.type_filter('Product')
# Get node information
product_nodes = products.get_nodes()
print(product_nodes)
# Get specific properties
prices = products.get_properties(['price', 'stock'])
print(prices)
# Get only titles
titles = products.get_titles()
print(titles)Rusty Graph supports native DateTime values for date-based filtering and operations.
When adding nodes, use the column_types parameter to specify which columns should be parsed as dates:
import pandas as pd
# Create data with date columns
estimates_df = pd.DataFrame({
'estimate_id': [1, 2, 3],
'name': ['Estimate A', 'Estimate B', 'Estimate C'],
'valid_from': ['2020-01-01', '2020-06-15', '2021-01-01'],
'valid_to': ['2020-12-31', '2021-06-14', '2021-12-31'],
'value': [100.5, 250.3, 180.0]
})
# Add nodes with date columns specified
graph.add_nodes(
data=estimates_df,
node_type='Estimate',
unique_id_field='estimate_id',
node_title_field='name',
column_types={'valid_from': 'datetime', 'valid_to': 'datetime'}
)Date fields can be filtered using comparison operators. ISO format strings (YYYY-MM-DD) work correctly for date comparisons:
# Find estimates valid after a specific date
recent_estimates = graph.type_filter('Estimate').filter({
'valid_from': {'>=': '2020-06-01'}
})
# Find estimates within a date range
active_in_2020 = graph.type_filter('Estimate').filter({
'valid_from': {'<=': '2020-12-31'},
'valid_to': {'>=': '2020-01-01'}
})For entities with validity periods (like estimates, contracts, or versions), Rusty Graph provides convenient methods to query based on time:
# Find entities valid at a specific point in time
# Default field names: 'date_from' and 'date_to'
valid_estimates = graph.type_filter('Estimate').valid_at('2020-06-15')
# Use custom field names if your data uses different column names
active_contracts = graph.type_filter('Contract').valid_at(
'2021-03-01',
date_from_field='start_date',
date_to_field='end_date'
)
# Find entities valid during a date range (overlapping periods)
overlapping = graph.type_filter('Estimate').valid_during('2020-01-01', '2020-06-30')
# Chain with other operations
high_value_valid = (
graph.type_filter('Estimate')
.valid_at('2020-06-15')
.filter({'value': {'>=': 100.0}})
)Note: valid_at(date) finds nodes where date_from <= date <= date_to. valid_during(start, end) finds nodes whose validity period overlaps with the given range.
Update properties on multiple nodes at once based on a selection:
# Select nodes and update them in batch
result = graph.type_filter('Prospect').filter({'status': 'Inactive'}).update({
'is_active': False,
'deactivation_reason': 'status_inactive'
})
# Access the updated graph and count
updated_graph = result['graph']
nodes_updated = result['nodes_updated']
print(f"Updated {nodes_updated} nodes")
# Use keep_selection=True to preserve the selection for chaining
result = selection.update({'processed': True}, keep_selection=True)
# Update with different value types
graph.type_filter('Node').update({
'count': 42, # Integer
'ratio': 3.14159, # Float
'active': True, # Boolean
'category': 'updated' # String
})Note: The update() method returns a dictionary with graph (the updated KnowledgeGraph), nodes_updated (count of updated nodes), and report_index (index of the operation report). By default, the selection is cleared after update; use keep_selection=True to preserve it.
Get insight into how your queries are executed with the explain() method:
# Build a query chain
result = (
graph.type_filter('Prospect')
.filter({'region': 'North'})
.traverse('HAS_ESTIMATE')
)
# See the execution plan
print(result.explain())
# Output: TYPE_FILTER Prospect (6775 nodes) -> FILTER (3200 nodes) -> TRAVERSE HAS_ESTIMATE (10954 nodes)
# Works with temporal queries too
valid_estimates = graph.type_filter('Estimate').valid_at('2020-06-15')
print(valid_estimates.explain())
# Output: TYPE_FILTER Estimate (1000 nodes) -> VALID_AT (450 nodes)Note: The explain() method shows each operation in the query chain with the actual number of nodes at each step. This helps you understand query performance and optimize your queries.
# Purchase data
purchases_df = pd.DataFrame({
'user_id': [1001, 1001, 1002],
'product_id': [101, 103, 102],
'date': ['2023-01-15', '2023-02-10', '2023-01-20'],
'quantity': [1, 2, 1]
})
# Create connections
graph.add_connections(
data=purchases_df,
connection_type='PURCHASED',
source_type='User',
source_id_field='user_id',
target_type='Product',
target_id_field='product_id',
# Optional additional fields to include
columns=['date', 'quantity']
)
# Create connections from currently selected nodes
users = graph.type_filter('User')
products = graph.type_filter('Product')
# This would connect all users to all products with a 'VIEWED' connection
users.selection_to_new_connections(connection_type='VIEWED')# Filter by exact match
expensive_products = graph.type_filter('Product').filter({'price': 999.99})
# Filter using operators
affordable_products = graph.type_filter('Product').filter({
'price': {'<': 500.0}
})
# Multiple conditions
popular_affordable = graph.type_filter('Product').filter({
'price': {'<': 500.0},
'stock': {'>': 50}
})
# In operator (note: product_id was the unique_id_field, so use 'id')
selected_products = graph.type_filter('Product').filter({
'id': {'in': [101, 103]}
})You can filter nodes based on whether a field is null (missing) or not null:
# Find nodes where a field is null or missing
nodes_without_category = graph.type_filter('Product').filter({
'category': {'is_null': True}
})
# Find nodes where a field exists and is not null
nodes_with_category = graph.type_filter('Product').filter({
'category': {'is_not_null': True}
})
# Combine with other conditions
incomplete_products = graph.type_filter('Product').filter({
'description': {'is_null': True},
'price': {'>': 0}
})Orphan nodes are nodes that have no connections (no incoming or outgoing edges). You can filter to include or exclude orphan nodes:
# Get only orphan nodes
orphans = graph.filter_orphans(include_orphans=True)
# Get only nodes that have at least one connection
connected = graph.filter_orphans(include_orphans=False)
# Filter orphans with sorting and limits
recent_orphans = graph.filter_orphans(
include_orphans=True,
sort_spec='created_date',
max_nodes=100
)
# Chain with other operations
product_orphans = graph.type_filter('Product').filter_orphans(include_orphans=True)Rusty Graph offers flexible options for sorting nodes based on their properties. The sort_spec parameter can be used in various methods including type_filter(), filter(), filter_orphans(), traverse(), and the standalone sort() method.
-
Single field string: Sorts by the specified field in ascending order.
# Sort products by price (lowest to highest) sorted_products = graph.type_filter('Product').sort('price') # Can also be used in other methods cheap_products = graph.type_filter('Product').filter( {'stock': {'>': 10}}, sort_spec='price' )
-
Field with direction: Explicitly specify ascending or descending order.
# Sort products by price (highest to lowest) expensive_first = graph.type_filter('Product').sort('price', ascending=False)
-
List of tuples: For multi-field sorting with different directions.
# First sort by stock (descending), then by price (ascending) # This prioritizes high-stock items, and for items with equal stock, # shows the cheapest ones first complex_sort = graph.type_filter('Product').sort([ ('stock', False), # False = descending order ('price', True) # True = ascending order ])
-
Dictionary with field and direction: Alternative format for single field sorting.
# Sort by rating in descending order top_rated = graph.type_filter('Product').filter( {}, sort_spec={'field': 'rating', 'ascending': False} )
Sort specifications work consistently across methods:
# In type_filter
latest_users = graph.type_filter('User', sort_spec='creation_date', max_nodes=10)
# In filter
new_expensive = graph.type_filter('Product').filter(
{'price': {'>': 500.0}},
sort_spec=[('creation_date', False), ('price', True)]
)
# In traversal
alice_recent_purchases = graph.type_filter('User').filter({'title': 'Alice'}).traverse(
connection_type='PURCHASED',
sort_target='date',
max_nodes=5
)
# In filter_orphans
recent_orphans = graph.filter_orphans(
include_orphans=True,
sort_spec='last_modified',
max_nodes=20
)
# In children_properties_to_list
expensive_products = graph.type_filter('User').traverse('PURCHASED').children_properties_to_list(
property='title',
sort_spec='price', # Sort children by price before creating the list
max_nodes=3,
store_as='top_expensive_purchases'
)# Get at most 5 nodes per group
limited_products = graph.type_filter('Product').max_nodes(5)# Find products purchased by a specific user
alice = graph.type_filter('User').filter({'title': 'Alice'})
alice_products = alice.traverse(
connection_type='PURCHASED',
direction='outgoing'
)
# Access the resulting products
alice_product_data = alice_products.get_nodes()
# Filter the traversal target nodes
expensive_purchases = alice.traverse(
connection_type='PURCHASED',
filter_target={'price': {'>=': 500.0}},
sort_target='price',
max_nodes=10
)
# Get connection information
connection_data = alice.get_connections(include_node_properties=True)You can filter traversals based on properties stored on the connections themselves:
# Traverse only through connections with specific property values
high_share_blocks = graph.type_filter('Discovery').traverse(
connection_type='EXTENDS_INTO',
filter_connection={'share_pct': {'>=': 50.0}}
)
# Combine connection and target filters
result = graph.type_filter('Discovery').traverse(
connection_type='EXTENDS_INTO',
filter_connection={'year': 2021},
filter_target={'status': 'active'}
)
# Filter connections with null/not-null checks
discounted = user.traverse(
connection_type='PURCHASED',
filter_connection={'discount': {'is_not_null': True}}
)Rusty Graph supports set operations to combine, intersect, or subtract selections. These operations create new selections without modifying the originals.
Combines all nodes from both selections (logical OR):
# Select prospects from different geoprovinces
n3_prospects = graph.type_filter('Prospect').filter({'geoprovince': 'N3'})
m3_prospects = graph.type_filter('Prospect').filter({'geoprovince': 'M3'})
# Combine both selections
combined = n3_prospects.union(m3_prospects)
print(f"Total prospects: {len(combined.get_nodes())}")Keeps only nodes present in both selections (logical AND):
# Select large discoveries and discoveries in a specific block
large_discoveries = graph.type_filter('Discovery').filter({'oil_reserves': {'>=': 100.0}})
block_34_discoveries = graph.type_filter('Block').filter({'block_id': 34}).traverse('CONTAINS', direction='incoming')
# Get large discoveries in block 34
result = large_discoveries.intersection(block_34_discoveries)Keeps nodes in the first selection but not in the second (subtraction):
# Get all prospects
all_prospects = graph.type_filter('Prospect')
# Get prospects that have estimates
with_estimates = graph.type_filter('ProspectEstimate').traverse('BELONGS_TO', direction='incoming')
# Get prospects WITHOUT estimates
without_estimates = all_prospects.difference(with_estimates)Keeps nodes that are in exactly one selection but not both (exclusive OR):
# Nodes in category A or B but not both
exclusive_nodes = category_a.symmetric_difference(category_b)Set operations can be chained for complex queries:
# (A union B) intersection C
result = selection_a.union(selection_b).intersection(selection_c)
# A difference (B intersection C)
b_inter_c = selection_b.intersection(selection_c)
result = selection_a.difference(b_inter_c)Rusty Graph provides efficient implementations of common graph algorithms powered by petgraph.
Find the shortest path between two nodes:
# Find shortest path between two nodes
path = graph.shortest_path(
source_type='Person',
source_id=1,
target_type='Person',
target_id=100
)
# Path is a list of node dictionaries
for node in path:
print(f"{node['node_type']}: {node['title']}")Find all paths between nodes up to a maximum number of hops:
# Find all paths up to 4 hops
paths = graph.all_paths(
source_type='Play',
source_id=1,
target_type='Wellbore',
target_id=100,
max_hops=4
)
# Returns a list of paths, each path is a list of nodes
print(f"Found {len(paths)} paths")
for i, path in enumerate(paths):
print(f"Path {i+1}: {' -> '.join(n['title'] for n in path)}")Identify connected components in the graph:
# Get all connected components
components = graph.connected_components()
# Returns a list of components, each component is a list of node IDs
print(f"Found {len(components)} connected components")
for i, component in enumerate(components):
print(f"Component {i+1}: {len(component)} nodes")Query the graph using Cypher-like pattern syntax for expressive multi-hop queries:
# Simple pattern: Find plays with prospects that became discoveries
results = graph.match_pattern(
'(p:Play)-[:HAS_PROSPECT]->(pr:Prospect)-[:BECAME_DISCOVERY]->(d:Discovery)'
)
# Access matched variables
for match in results:
print(f"Play: {match['p']['title']}")
print(f"Prospect: {match['pr']['title']}")
print(f"Discovery: {match['d']['title']}")
# Pattern with property conditions
results = graph.match_pattern(
'(u:User)-[:PURCHASED]->(p:Product {category: "Electronics"})'
)
# Limit results for performance on large graphs
results = graph.match_pattern(
'(a:Person)-[:KNOWS]->(b:Person)',
max_matches=100
)Supported pattern syntax:
- Node patterns:
(variable:NodeType)or(variable:NodeType {property: "value"}) - Relationship patterns:
-[:CONNECTION_TYPE]-> - Multiple hops: Chain patterns like
(a)-[:REL1]->(b)-[:REL2]->(c)
Extract a portion of the graph for isolated analysis or export:
# Start with a selection and expand to include neighbors
subgraph = (
graph.type_filter('Company')
.filter({'title': 'Acme Corp'})
.expand(hops=2) # Include all nodes within 2 hops
.to_subgraph() # Create independent subgraph
)
# The subgraph is a fully functional KnowledgeGraph
print(f"Subgraph has {subgraph.node_count()} nodes")
# Save the subgraph
subgraph.save('acme_network.bin')
# Export to visualization format
subgraph.export('acme_network.graphml', format='graphml')The expand() method uses breadth-first search to include neighboring nodes:
# Expand selection by 1 hop (immediate neighbors)
expanded = graph.type_filter('Person').filter({'title': 'Alice'}).expand(hops=1)
# Expand by 3 hops for broader context
broad_context = selection.expand(hops=3)Query nodes based on geographic location and geometry. Useful for GIS applications and location-based analysis.
Find nodes within a rectangular geographic area:
# Find discoveries within a bounding box
north_sea_discoveries = graph.type_filter('Discovery').within_bounds(
lat_field='latitude',
lon_field='longitude',
min_lat=58.0,
max_lat=62.0,
min_lon=1.0,
max_lon=5.0
)Find nodes within a radius of a point using great-circle distance:
# Find wellbores within 50km of a location
nearby_wellbores = graph.type_filter('Wellbore').near_point_km(
center_lat=60.5,
center_lon=3.2,
max_distance_km=50.0,
lat_field='latitude',
lon_field='longitude'
)Find nodes whose geometry intersects with a WKT polygon:
# Define a polygon in WKT format
search_area = 'POLYGON((1 58, 5 58, 5 62, 1 62, 1 58))'
# Find fields that intersect the polygon
fields_in_area = graph.type_filter('Field').intersects(
geometry_field='wkt_geometry',
wkt=search_area
)Find nodes within a radius of a reference point, using the centroid of WKT geometries for distance calculation. This eliminates the need for external dependencies like shapely:
# Find fields within 100km of a location, using polygon centroids
nearby_fields = graph.type_filter('Field').near_point_km_from_wkt(
center_lat=60.5,
center_lon=3.2,
max_distance_km=100.0,
geometry_field='wkt_geometry' # Field containing WKT polygons
)
# The centroid is automatically extracted from WKT geometries
# Supports: POLYGON, MULTIPOLYGON, LINESTRING, MULTILINESTRING, POINT, MULTIPOINTFind nodes whose WKT geometry contains a specific point:
# Find which blocks contain a specific coordinate
containing_blocks = graph.type_filter('Block').contains_point(
lat=60.5,
lon=3.2,
geometry_field='wkt_geometry'
)
# This is useful for determining which regions/areas a point falls withinDefine expected structure and validate your graph data:
# Define schema for node types and connections
graph.define_schema({
'nodes': {
'Prospect': {
'required': ['npdid_prospect', 'prospect_name'],
'optional': ['prospect_status', 'prospect_geoprovince'],
'types': {
'npdid_prospect': 'integer',
'prospect_name': 'string',
'prospect_ns_dec': 'float'
}
},
'ProspectEstimate': {
'required': ['estimate_id'],
'types': {
'estimate_id': 'integer',
'value': 'float'
}
}
},
'connections': {
'HAS_ESTIMATE': {
'source': 'Prospect',
'target': 'ProspectEstimate'
}
}
})# Validate the graph against the defined schema
errors = graph.validate_schema()
if errors:
print("Validation errors found:")
for error in errors:
print(f" - {error}")
else:
print("Graph validates successfully!")# View the current schema (auto-generated from data)
schema = graph.get_schema()
print(schema)Note: get_schema() returns a formatted string representation of the schema for display purposes, not a Python dictionary.
Create indexes for faster filtering on frequently queried properties:
# Create an index on a property
graph.create_index('Prospect', 'prospect_geoprovince')
# Indexed properties get O(1) lookup for equality filters
# This query will be much faster with an index:
north_prospects = graph.type_filter('Prospect').filter({
'prospect_geoprovince': 'North Sea'
})# List all indexes
indexes = graph.list_indexes()
for idx in indexes:
print(f"Index on {idx['node_type']}.{idx['property']}")
# Drop an index
graph.drop_index('Prospect', 'prospect_geoprovince')Performance Note: Benchmarks show ~3.3x speedup for equality filters on indexed properties. Create indexes on properties you frequently filter by exact value.
Export your graph to various formats for visualization and interoperability:
# GraphML format (compatible with Gephi, yEd, etc.)
graph.export('my_graph.graphml', format='graphml')
# GEXF format (Gephi native format)
graph.export('my_graph.gexf', format='gexf')
# D3.js JSON format (for web visualization)
graph.export('my_graph.json', format='d3')
# CSV format (nodes and edges as CSV)
graph.export('my_graph.csv', format='csv')Get export data as a string for programmatic use:
# Get GraphML as string
graphml_string = graph.export_string(format='graphml')
# Get D3 JSON as string
d3_json = graph.export_string(format='d3')
# Export only current selection
selected_json = graph.type_filter('Person').export_string(
format='d3',
selection_only=True
)Combine with subgraph extraction for partial exports:
# Export just a portion of the graph
subgraph = (
graph.type_filter('Company')
.filter({'region': 'Europe'})
.expand(hops=2)
.to_subgraph()
)
subgraph.export('europe_companies.graphml', format='graphml')# Get statistics for a property
price_stats = graph.type_filter('Product').statistics('price')
print(price_stats)
# Calculate unique values
unique_categories = graph.type_filter('Product').unique_values(
property='category',
# Store result in node property
store_as='category_list',
max_length=10
)
# Convert children properties to a comma-separated list in parent nodes
# Option 1: Store results in parent nodes
users_with_products = graph.type_filter('User').traverse('PURCHASED').children_properties_to_list(
property='title', # Default is 'title' if not specified
filter={'price': {'<': 500.0}}, # Optional filtering of children
sort_spec='price', # Optional sorting of children
max_nodes=5, # Optional limit of children per parent
store_as='purchased_products', # Property name to store the list in parent
max_length=100, # Optional maximum string length (adds "..." if truncated)
keep_selection=False # Whether to keep the current selection
)
# Option 2: Get results as a dictionary without storing them
product_names = graph.type_filter('User').traverse('PURCHASED').children_properties_to_list(
property='title',
sort_spec='price',
max_nodes=5
)
print(product_names) # Returns {'User1': 'Product1, Product2', 'User2': 'Product3, Product4, Product5'}# Simple calculation: tax inclusive price
with_tax = graph.type_filter('Product').calculate(
expression='price * 1.1',
store_as='price_with_tax'
)
# Aggregate calculations per group
user_spending = graph.type_filter('User').traverse('PURCHASED').calculate(
expression='sum(price * quantity)',
store_as='total_spent'
)
# Count operations
products_per_user = graph.type_filter('User').traverse('PURCHASED').count(
store_as='product_count',
group_by_parent=True
)Aggregate properties stored on connections (edges) rather than nodes. This is useful when you have data like ownership percentages, weights, or quantities stored on the connections themselves.
# Sum connection properties
# For each Discovery, sum the share_pct on its EXTENDS_INTO connections
total_shares = graph.type_filter('Discovery').traverse('EXTENDS_INTO').calculate(
expression='sum(share_pct)',
aggregate_connections=True # Key parameter for connection aggregation
)
print(total_shares) # Returns {'Discovery A': 100.0, 'Discovery B': 100.0}
# Average connection properties
avg_ownership = graph.type_filter('Company').traverse('OWNS').calculate(
expression='avg(ownership_pct)',
aggregate_connections=True
)
# Count connections
connection_count = graph.type_filter('Parent').traverse('HAS_CHILD').calculate(
expression='count(any_property)', # Use any property that exists on connections
aggregate_connections=True
)
# Store aggregated results on parent nodes
updated_graph = graph.type_filter('Prospect').traverse('HAS_ESTIMATE').calculate(
expression='sum(weight)',
aggregate_connections=True,
store_as='total_weight' # Stores result on parent Prospect nodes
)Supported aggregate functions for connections:
sum(property)- Sum of property valuesavg(property)/mean(property)- Average of property valuesmin(property)- Minimum valuemax(property)- Maximum valuecount(property)- Count of connections (with non-null property values)std(property)- Standard deviation
Note: Connection aggregation requires a traversal before calculate(). The results are grouped by the parent (source) node of the traversal.
# Save graph to file
graph.save("my_graph.bin")
# Load graph from file
loaded_graph = rusty_graph.load("my_graph.bin")Rusty Graph provides detailed reports for operations that modify the graph, helping you track what changed and diagnose issues.
# Add nodes and get the report
report = graph.add_nodes(
data=df,
node_type='Product',
unique_id_field='product_id'
)
print(f"Created {report['nodes_created']} nodes in {report['processing_time_ms']}ms")
# Check for errors
if report['has_errors']:
print(f"Errors: {report['errors']}")Node operation reports include:
operation: Type of operation performedtimestamp: When the operation occurrednodes_created: Number of new nodes creatednodes_updated: Number of existing nodes updatednodes_skipped: Number of nodes skipped (e.g., due to conflicts)processing_time_ms: Time taken in millisecondshas_errors: Boolean indicating if errors occurrederrors: List of error messages (if any)
Connection operation reports include:
connections_created: Number of new connections createdconnections_skipped: Number of connections skippedproperty_fields_tracked: Number of property fields on connections
# Get the most recent operation report
last_report = graph.get_last_report()
# Get the operation index (sequential counter)
op_index = graph.get_operation_index()
# Get full operation history
history = graph.get_report_history()
for report in history:
print(f"{report['operation']}: {report['timestamp']}")-
Batch Operations: Add nodes and connections in batches rather than individually.
-
Specify Columns: When adding nodes or connections, explicitly specify which columns to include to reduce memory usage.
-
Use Indexing: Filter on node type first before applying other filters.
-
Avoid Overloading: Keep node property count reasonable; too many properties per node will increase memory usage.
-
Conflict Handling: Choose the appropriate conflict handling strategy:
- Use
'update'to merge new properties with existing ones - Use
'replace'for a complete overwrite - Use
'skip'to avoid any changes to existing nodes - Use
'preserve'to only add missing properties
- Use
-
Connection Direction: Specify direction in traversals when possible to improve performance.
-
Limit Results: Use
max_nodes()to limit result size when working with large datasets. -
Create Indexes: Use
create_index()on frequently filtered properties for ~3.3x speedup on equality filters. -
Use Pattern Matching Limits: When using
match_pattern(), setmax_matchesto avoid scanning the entire graph. -
Use Lightweight Methods: For counting or index-only operations, use
node_count()orindices()instead ofget_nodes()- they're 50-1000x faster because they skip property materialization. -
Use get_node_by_id(): For single-node lookups by ID, use
get_node_by_id('NodeType', id)instead oftype_filter().filter()- it's O(1) after the first call.
Rusty Graph is optimized for knowledge graph workloads - complex multi-step queries on heterogeneous, property-rich graphs. It is NOT optimized for micro-benchmarks of raw graph algorithms.
- Complex multi-hop traversals with filtering at each step
- Property-rich nodes with many attributes
- Schema-aware queries with validation
- Aggregations and calculations across traversals
- Combining data from multiple sources
Rusty Graph operations have overhead compared to raw graph algorithms because they provide additional features:
- Building and tracking selections at each step (enables
explain(), chaining) - Materializing Python dictionaries for node properties
- Schema validation and type checking
- Supporting the full query API (undo, reports, etc.)
When benchmarking, distinguish between two modes:
| Mode | Methods | What's Measured |
|---|---|---|
| Engine-only | node_count(), indices(), get_node_by_id(), explain() |
Pure graph traversal speed |
| End-to-end | get_nodes(), get_properties() |
Includes Python dict materialization |
For fair performance comparisons, use engine-only methods. Use end-to-end methods when measuring the full Python-facing workload including property access.
# SLOW: Materializes all properties for every node
nodes = graph.type_filter('User').get_nodes()
count = len(nodes)
# FAST: Just counts without materialization (1000x faster)
count = graph.type_filter('User').node_count()
# SLOW: Scans all nodes of type
user = graph.type_filter('User').filter({'id': 12345}).get_nodes()[0]
# FAST: O(1) hash lookup after first call
user = graph.get_node_by_id('User', 12345)For performance-critical code, use these methods that skip full property materialization:
| Method | Returns | Use When |
|---|---|---|
node_count() |
Integer count | You only need to know how many nodes matched |
indices() |
List of node indices | You need indices for internal processing |
id_values() |
List of ID values | You need IDs for external lookups or joins |
get_ids() |
List of {id, title, type} dicts |
You need basic node identity without properties |
get_nodes() |
List of full node dicts | You need all node properties |
# Examples
count = graph.type_filter('User').node_count() # 5000
indices = graph.type_filter('User').indices() # [0, 1, 2, ...]
ids = graph.type_filter('User').id_values() # [1001, 1002, 1003, ...]
id_info = graph.type_filter('User').get_ids() # [{'id': 1001, 'title': 'Alice', 'type': 'User'}, ...]
nodes = graph.type_filter('User').get_nodes() # [{'id': 1001, 'title': 'Alice', 'type': 'User', 'age': 28, ...}, ...]| Method | Returns | Use When |
|---|---|---|
shortest_path_length() |
Integer hop count | You only need the distance |
shortest_path_indices() |
List of node indices | You need the path for internal processing |
shortest_path_ids() |
List of (type, id) tuples |
You need node identities along the path |
shortest_path() |
List of full node dicts | You need all properties for nodes in path |
# Examples
length = graph.shortest_path_length('User', 1, 'User', 100) # 3
indices = graph.shortest_path_indices('User', 1, 'User', 100) # [0, 42, 87, 156]
ids = graph.shortest_path_ids('User', 1, 'User', 100) # [('User', 1), ('Company', 5), ('User', 100)]
path = graph.shortest_path('User', 1, 'User', 100) # [{'id': 1, ...}, {'id': 5, ...}, {'id': 100, ...}]# O(1) lookup by type and ID (much faster than type_filter + filter)
user = graph.get_node_by_id('User', 12345)