-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolved issue with upsert of PG entity and relation #1127
Resolved issue with upsert of PG entity and relation #1127
Conversation
resolved issue with upsert of pg entity and relation;
@@ -1570,7 +1571,7 @@ def namespace_to_table_name(namespace: str) -> str: | |||
content_vector VECTOR, | |||
create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP, | |||
update_time TIMESTAMP, | |||
chunk_id TEXT NULL, | |||
chunk_ids VARCHAR(255)[] NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix introduces another issue: when there are too many associated chunks, the chunk_id
exceeds the limit. Do you have any good solutions for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a VARCHAR so 255 unique characters. The code uses a md5 hash which is only 32 VARCHAR wide. It is unlikely that there will ever be such a large ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PostgreSQL, using a join table is an efficient way to store large amounts of related data. By creating an intermediary table with foreign keys, you can establish one-to-many or many-to-many relationships between the main table and associated data, overcoming the 1 GB array limit. Join tables provide better query performance and data management capabilities, with flexible indexing options that optimize search, insert, and update operations. This approach is particularly suitable for scenarios where you need to store large quantities of data like "chunk-ecc952703fbaabc97fee8f328ae8de56"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a VARCHAR so 255 unique characters. The code uses a md5 hash which is only 32 VARCHAR wide. It is unlikely that there will ever be such a large ID.
It was mentioned in the issue 1088.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a VARCHAR so 255 unique characters. The code uses a md5 hash which is only 32 VARCHAR wide. It is unlikely that there will ever be such a large ID.
I understand the difference now. I'm not very familiar with PostgreSQL—sorry about that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, the field in the PostgreSQL table was chunk_id. Modifying it this way will make the old data unavailable. Is there a way to add a data migration to transfer the old data to the new table?
already_file_paths.extend( | ||
split_string_by_multi_markers( | ||
already_node["metadata"]["file_path"], [GRAPH_FIELD_SEP] | ||
if "metadata" in already_node and "file_path" in already_node["metadata"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the PR 1138, I have removed the metadata-related code (as it had no actual effect).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thank you for that update.
To avoid changing the table structure, could we consider modifying chunk_ids in the upsert to correspond to chunk_id? Since chunk_id is currently concatenated directly, the TEXT type should be sufficient. |
I reviewed the Postgres solution again and here are my thoughts:
Schema vs. Implementation Mismatch: To use chunk_ids VARCHAR(255)[] NULL is correct and aligns the schema with the actual code usage Conclusions: |
Hey, I was just wondering—what if we use a join table to store the relations between entities and chunks instead of a chunk_ids column? |
I like your suggestion. |
I have merged PR1138. I think |
The PR is no longer relevant since we now have both fixes. I can release it later after refactoring with separators. |
Description
This PR complements PR #1120 "fix: correct chunk_ids as array type and remove incorrect filepath type conversion" with a few additional improvements
Related Issues
Addresses the "entity and relation upsert error" as well as missing "metadata" key errors encountered during the insertion of entities and relationships into the PostgreSQL database.
Changes Made
Checklist
Additional Notes
The fix ensures data integrity during entity and relationship extraction operations by resolving SQL syntax errors that were preventing proper document processing.