Resolved issue with upsert of PG entity and relation #1127

OxidBurn · 2025-03-19T20:36:22Z

Description

This PR complements PR #1120 "fix: correct chunk_ids as array type and remove incorrect filepath type conversion" with a few additional improvements

Related Issues

Addresses the "entity and relation upsert error" as well as missing "metadata" key errors encountered during the insertion of entities and relationships into the PostgreSQL database.

Changes Made

Fixed SQL template for entity upsert by adding CURRENT_TIMESTAMP for update_time column
Added conditional check for metadata and file_path keys in graph node extraction
Fixed parameter type casting for content_vector in relationship insertion query
Ensured proper alignment between SQL parameter placeholders and data dictionary values

Checklist

Changes tested locally
Code reviewed
Documentation updated (if necessary)
Unit tests added (if applicable)

Additional Notes

The fix ensures data integrity during entity and relationship extraction operations by resolving SQL syntax errors that were preventing proper document processing.

resolved issue with upsert of pg entity and relation;

LarFii · 2025-03-20T05:18:15Z

lightrag/kg/postgres_impl.py

@@ -1570,7 +1571,7 @@ def namespace_to_table_name(namespace: str) -> str:
                    content_vector VECTOR,
                    create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    update_time TIMESTAMP,
-                    chunk_id TEXT NULL,
+                    chunk_ids VARCHAR(255)[] NULL,


This fix introduces another issue: when there are too many associated chunks, the chunk_id exceeds the limit. Do you have any good solutions for this?

This is a VARCHAR so 255 unique characters. The code uses a md5 hash which is only 32 VARCHAR wide. It is unlikely that there will ever be such a large ID.

In PostgreSQL, using a join table is an efficient way to store large amounts of related data. By creating an intermediary table with foreign keys, you can establish one-to-many or many-to-many relationships between the main table and associated data, overcoming the 1 GB array limit. Join tables provide better query performance and data management capabilities, with flexible indexing options that optimize search, insert, and update operations. This approach is particularly suitable for scenarios where you need to store large quantities of data like "chunk-ecc952703fbaabc97fee8f328ae8de56"

This is a VARCHAR so 255 unique characters. The code uses a md5 hash which is only 32 VARCHAR wide. It is unlikely that there will ever be such a large ID.

It was mentioned in the issue 1088.

This is a VARCHAR so 255 unique characters. The code uses a md5 hash which is only 32 VARCHAR wide. It is unlikely that there will ever be such a large ID.

I understand the difference now. I'm not very familiar with PostgreSQL—sorry about that!

Previously, the field in the PostgreSQL table was chunk_id. Modifying it this way will make the old data unavailable. Is there a way to add a data migration to transfer the old data to the new table?

LarFii · 2025-03-20T08:40:22Z

lightrag/operate.py

-        already_file_paths.extend(
-            split_string_by_multi_markers(
-                already_node["metadata"]["file_path"], [GRAPH_FIELD_SEP]
+        if "metadata" in already_node and "file_path" in already_node["metadata"]:


In the PR 1138, I have removed the metadata-related code (as it had no actual effect).

Perfect, thank you for that update.

LarFii · 2025-03-20T08:49:14Z

To avoid changing the table structure, could we consider modifying chunk_ids in the upsert to correspond to chunk_id? Since chunk_id is currently concatenated directly, the TEXT type should be sufficient.

OxidBurn · 2025-03-20T10:16:32Z

To avoid changing the table structure, could we consider modifying chunk_ids in the upsert to correspond to chunk_id? Since chunk_id is currently concatenated directly, the TEXT type should be sufficient.

I reviewed the Postgres solution again and here are my thoughts:

The current code (main branch) already uses chunk_ids as an array in both insertion logic and query logic
The SQL templates specifically cast the values to varchar[] with $6::varchar[] syntax
The JOIN conditions in queries use the PostgreSQL ANY operator: c.chunk_id = ANY(r.chunk_ids)
The data processing in _upsert_entities and _upsert_relationships splits values into arrays

Schema vs. Implementation Mismatch:
- In case the schema has chunk_id TEXT NULL, but the code is working with arrays, there's a fundamental mismatch

To use chunk_ids VARCHAR(255)[] NULL is correct and aligns the schema with the actual code usage
- It makes the database reflect what the code is actually doing
- There should be no migration issues as long as:
1. Existing data in the database is already being stored in array format
2. The name used in code consistently matches the name used in the schema

Conclusions:
The change from chunk_id TEXT NULL to chunk_ids VARCHAR(255)[] NULL is actually fixing a schema-code mismatch
Without this change, the code would be treating a text field as an array, which is incorrect
This is not a breaking change if the code was already working with arrays (as it appears to be)
Performance will actually be better with proper array types vs. implicit casting

Howe829 · 2025-03-20T12:25:58Z

To avoid changing the table structure, could we consider modifying chunk_ids in the upsert to correspond to chunk_id? Since chunk_id is currently concatenated directly, the TEXT type should be sufficient.

I reviewed the Postgres solution again and here are my thoughts:

The current code (main branch) already uses chunk_ids as an array in both insertion logic and query logic

The SQL templates specifically cast the values to varchar[] with $6::varchar[] syntax

The JOIN conditions in queries use the PostgreSQL ANY operator: c.chunk_id = ANY(r.chunk_ids)

The data processing in _upsert_entities and _upsert_relationships splits values into arrays

Schema vs. Implementation Mismatch: - In case the schema has chunk_id TEXT NULL, but the code is working with arrays, there's a fundamental mismatch

To use chunk_ids VARCHAR(255)[] NULL is correct and aligns the schema with the actual code usage - It makes the database reflect what the code is actually doing - There should be no migration issues as long as: 1. Existing data in the database is already being stored in array format 2. The name used in code consistently matches the name used in the schema

Conclusions: The change from chunk_id TEXT NULL to chunk_ids VARCHAR(255)[] NULL is actually fixing a schema-code mismatch Without this change, the code would be treating a text field as an array, which is incorrect This is not a breaking change if the code was already working with arrays (as it appears to be) Performance will actually be better with proper array types vs. implicit casting

Hey, I was just wondering—what if we use a join table to store the relations between entities and chunks instead of a chunk_ids column?

CoachbotAI · 2025-03-20T12:35:09Z

To avoid changing the table structure, could we consider modifying chunk_ids in the upsert to correspond to chunk_id? Since chunk_id is currently concatenated directly, the TEXT type should be sufficient.

I reviewed the Postgres solution again and here are my thoughts:

The current code (main branch) already uses chunk_ids as an array in both insertion logic and query logic

The SQL templates specifically cast the values to varchar[] with $6::varchar[] syntax

The JOIN conditions in queries use the PostgreSQL ANY operator: c.chunk_id = ANY(r.chunk_ids)

The data processing in _upsert_entities and _upsert_relationships splits values into arrays

Schema vs. Implementation Mismatch: - In case the schema has chunk_id TEXT NULL, but the code is working with arrays, there's a fundamental mismatch
To use chunk_ids VARCHAR(255)[] NULL is correct and aligns the schema with the actual code usage - It makes the database reflect what the code is actually doing - There should be no migration issues as long as: 1. Existing data in the database is already being stored in array format 2. The name used in code consistently matches the name used in the schema
Conclusions: The change from chunk_id TEXT NULL to chunk_ids VARCHAR(255)[] NULL is actually fixing a schema-code mismatch Without this change, the code would be treating a text field as an array, which is incorrect This is not a breaking change if the code was already working with arrays (as it appears to be) Performance will actually be better with proper array types vs. implicit casting

Hey, I was just wondering—what if we use a join table to store the relations between entities and chunks instead of a chunk_ids column?

I like your suggestion.
It provided us with flexibility and potential performance boosts since we no longer need to keep this data in the same table.

LarFii · 2025-03-20T14:38:45Z

I have merged PR1138. I think file_path can also be synchronized as an array since an entity may be associated with multiple file_paths.

OxidBurn · 2025-03-20T16:31:06Z

The PR is no longer relevant since we now have both fixes. I can release it later after refactoring with separators.

OxidBurn added 2 commits March 19, 2025 22:27

fixed issue with missing key "metadata" in pg node;

eb063b0

resolved issue with upsert of pg entity and relation;

Refactor import statement for GRAPH_FIELD_SEP in postgres_impl.py

e2be0b0

LarFii reviewed Mar 20, 2025

View reviewed changes

OxidBurn closed this Mar 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolved issue with upsert of PG entity and relation #1127

Resolved issue with upsert of PG entity and relation #1127

OxidBurn commented Mar 19, 2025

LarFii Mar 20, 2025

canesin Mar 20, 2025

Howe829 Mar 20, 2025

LarFii Mar 20, 2025 •

edited

Loading

LarFii Mar 20, 2025

LarFii Mar 20, 2025

LarFii Mar 20, 2025

OxidBurn Mar 20, 2025

LarFii commented Mar 20, 2025

OxidBurn commented Mar 20, 2025

Howe829 commented Mar 20, 2025 •

edited

Loading

CoachbotAI commented Mar 20, 2025

LarFii commented Mar 20, 2025

OxidBurn commented Mar 20, 2025

Resolved issue with upsert of PG entity and relation #1127

Resolved issue with upsert of PG entity and relation #1127

Conversation

OxidBurn commented Mar 19, 2025

Description

Related Issues

Changes Made

Checklist

Additional Notes

LarFii Mar 20, 2025

Choose a reason for hiding this comment

canesin Mar 20, 2025

Choose a reason for hiding this comment

Howe829 Mar 20, 2025

Choose a reason for hiding this comment

LarFii Mar 20, 2025 • edited Loading

Choose a reason for hiding this comment

LarFii Mar 20, 2025

Choose a reason for hiding this comment

LarFii Mar 20, 2025

Choose a reason for hiding this comment

LarFii Mar 20, 2025

Choose a reason for hiding this comment

OxidBurn Mar 20, 2025

Choose a reason for hiding this comment

LarFii commented Mar 20, 2025

OxidBurn commented Mar 20, 2025

Howe829 commented Mar 20, 2025 • edited Loading

CoachbotAI commented Mar 20, 2025

LarFii commented Mar 20, 2025

OxidBurn commented Mar 20, 2025

LarFii Mar 20, 2025 •

edited

Loading

Howe829 commented Mar 20, 2025 •

edited

Loading