Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolved issue with upsert of PG entity and relation #1127

Conversation

OxidBurn
Copy link
Contributor

Description

This PR complements PR #1120 "fix: correct chunk_ids as array type and remove incorrect filepath type conversion" with a few additional improvements

Related Issues

Addresses the "entity and relation upsert error" as well as missing "metadata" key errors encountered during the insertion of entities and relationships into the PostgreSQL database.

Changes Made

  • Fixed SQL template for entity upsert by adding CURRENT_TIMESTAMP for update_time column
  • Added conditional check for metadata and file_path keys in graph node extraction
  • Fixed parameter type casting for content_vector in relationship insertion query
  • Ensured proper alignment between SQL parameter placeholders and data dictionary values

Checklist

  • Changes tested locally
  • Code reviewed
  • Documentation updated (if necessary)
  • Unit tests added (if applicable)

Additional Notes

The fix ensures data integrity during entity and relationship extraction operations by resolving SQL syntax errors that were preventing proper document processing.

@@ -1570,7 +1571,7 @@ def namespace_to_table_name(namespace: str) -> str:
content_vector VECTOR,
create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
update_time TIMESTAMP,
chunk_id TEXT NULL,
chunk_ids VARCHAR(255)[] NULL,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix introduces another issue: when there are too many associated chunks, the chunk_id exceeds the limit. Do you have any good solutions for this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a VARCHAR so 255 unique characters. The code uses a md5 hash which is only 32 VARCHAR wide. It is unlikely that there will ever be such a large ID.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PostgreSQL, using a join table is an efficient way to store large amounts of related data. By creating an intermediary table with foreign keys, you can establish one-to-many or many-to-many relationships between the main table and associated data, overcoming the 1 GB array limit. Join tables provide better query performance and data management capabilities, with flexible indexing options that optimize search, insert, and update operations. This approach is particularly suitable for scenarios where you need to store large quantities of data like "chunk-ecc952703fbaabc97fee8f328ae8de56"

Copy link
Collaborator

@LarFii LarFii Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a VARCHAR so 255 unique characters. The code uses a md5 hash which is only 32 VARCHAR wide. It is unlikely that there will ever be such a large ID.

It was mentioned in the issue 1088.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a VARCHAR so 255 unique characters. The code uses a md5 hash which is only 32 VARCHAR wide. It is unlikely that there will ever be such a large ID.

I understand the difference now. I'm not very familiar with PostgreSQL—sorry about that!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the field in the PostgreSQL table was chunk_id. Modifying it this way will make the old data unavailable. Is there a way to add a data migration to transfer the old data to the new table?

already_file_paths.extend(
split_string_by_multi_markers(
already_node["metadata"]["file_path"], [GRAPH_FIELD_SEP]
if "metadata" in already_node and "file_path" in already_node["metadata"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the PR 1138, I have removed the metadata-related code (as it had no actual effect).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thank you for that update.

@LarFii
Copy link
Collaborator

LarFii commented Mar 20, 2025

To avoid changing the table structure, could we consider modifying chunk_ids in the upsert to correspond to chunk_id? Since chunk_id is currently concatenated directly, the TEXT type should be sufficient.

@OxidBurn
Copy link
Contributor Author

To avoid changing the table structure, could we consider modifying chunk_ids in the upsert to correspond to chunk_id? Since chunk_id is currently concatenated directly, the TEXT type should be sufficient.

I reviewed the Postgres solution again and here are my thoughts:

  • The current code (main branch) already uses chunk_ids as an array in both insertion logic and query logic
  • The SQL templates specifically cast the values to varchar[] with $6::varchar[] syntax
  • The JOIN conditions in queries use the PostgreSQL ANY operator: c.chunk_id = ANY(r.chunk_ids)
  • The data processing in _upsert_entities and _upsert_relationships splits values into arrays

Schema vs. Implementation Mismatch:
- In case the schema has chunk_id TEXT NULL, but the code is working with arrays, there's a fundamental mismatch

To use chunk_ids VARCHAR(255)[] NULL is correct and aligns the schema with the actual code usage
- It makes the database reflect what the code is actually doing
- There should be no migration issues as long as:
1. Existing data in the database is already being stored in array format
2. The name used in code consistently matches the name used in the schema

Conclusions:
The change from chunk_id TEXT NULL to chunk_ids VARCHAR(255)[] NULL is actually fixing a schema-code mismatch
Without this change, the code would be treating a text field as an array, which is incorrect
This is not a breaking change if the code was already working with arrays (as it appears to be)
Performance will actually be better with proper array types vs. implicit casting

@Howe829
Copy link
Contributor

Howe829 commented Mar 20, 2025

To avoid changing the table structure, could we consider modifying chunk_ids in the upsert to correspond to chunk_id? Since chunk_id is currently concatenated directly, the TEXT type should be sufficient.

I reviewed the Postgres solution again and here are my thoughts:

  • The current code (main branch) already uses chunk_ids as an array in both insertion logic and query logic
  • The SQL templates specifically cast the values to varchar[] with $6::varchar[] syntax
  • The JOIN conditions in queries use the PostgreSQL ANY operator: c.chunk_id = ANY(r.chunk_ids)
  • The data processing in _upsert_entities and _upsert_relationships splits values into arrays

Schema vs. Implementation Mismatch: - In case the schema has chunk_id TEXT NULL, but the code is working with arrays, there's a fundamental mismatch

To use chunk_ids VARCHAR(255)[] NULL is correct and aligns the schema with the actual code usage - It makes the database reflect what the code is actually doing - There should be no migration issues as long as: 1. Existing data in the database is already being stored in array format 2. The name used in code consistently matches the name used in the schema

Conclusions: The change from chunk_id TEXT NULL to chunk_ids VARCHAR(255)[] NULL is actually fixing a schema-code mismatch Without this change, the code would be treating a text field as an array, which is incorrect This is not a breaking change if the code was already working with arrays (as it appears to be) Performance will actually be better with proper array types vs. implicit casting

Hey, I was just wondering—what if we use a join table to store the relations between entities and chunks instead of a chunk_ids column?

@CoachbotAI
Copy link

To avoid changing the table structure, could we consider modifying chunk_ids in the upsert to correspond to chunk_id? Since chunk_id is currently concatenated directly, the TEXT type should be sufficient.

I reviewed the Postgres solution again and here are my thoughts:

  • The current code (main branch) already uses chunk_ids as an array in both insertion logic and query logic
  • The SQL templates specifically cast the values to varchar[] with $6::varchar[] syntax
  • The JOIN conditions in queries use the PostgreSQL ANY operator: c.chunk_id = ANY(r.chunk_ids)
  • The data processing in _upsert_entities and _upsert_relationships splits values into arrays

Schema vs. Implementation Mismatch: - In case the schema has chunk_id TEXT NULL, but the code is working with arrays, there's a fundamental mismatch
To use chunk_ids VARCHAR(255)[] NULL is correct and aligns the schema with the actual code usage - It makes the database reflect what the code is actually doing - There should be no migration issues as long as: 1. Existing data in the database is already being stored in array format 2. The name used in code consistently matches the name used in the schema
Conclusions: The change from chunk_id TEXT NULL to chunk_ids VARCHAR(255)[] NULL is actually fixing a schema-code mismatch Without this change, the code would be treating a text field as an array, which is incorrect This is not a breaking change if the code was already working with arrays (as it appears to be) Performance will actually be better with proper array types vs. implicit casting

Hey, I was just wondering—what if we use a join table to store the relations between entities and chunks instead of a chunk_ids column?

I like your suggestion.
It provided us with flexibility and potential performance boosts since we no longer need to keep this data in the same table.

@LarFii
Copy link
Collaborator

LarFii commented Mar 20, 2025

I have merged PR1138. I think file_path can also be synchronized as an array since an entity may be associated with multiple file_paths.

@OxidBurn
Copy link
Contributor Author

The PR is no longer relevant since we now have both fixes. I can release it later after refactoring with separators.

@OxidBurn OxidBurn closed this Mar 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants