-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I do not understand the partition error: ValueError: Could not find in old schema: 2: {field}: identity(2) #1100
Comments
Hello @cfrancois7, thank you for your report. I can't test it right now, but I believe the issue might be because you're using the Arrow schema to create the table, and we don't have the field_ids. Have you tried using the PyIceberg Schema/NestedField definition instead? |
@ndrluis I remmembered why I used the Arrow schema. It is because of the typing and requirement alignment between the data I want to append and the expected schema. For instance, the following code raised one error
|
@cfrancois7 You can take the schema and call as_arrow() ts_df = pa.Table.from_pydict(ts_list, schema=ts_schema.as_arrow()) |
Anyway, I think we need to do something to avoid this problem in the future. Since we require the field ID, we should only accept the PyIceberg Schema. What do you think? @Fokko @sungwy @kevinjqliu @HonahX (Maybe it would be nice to create a committers team to tag all of you 🤔). |
Hi @cfrancois7 - thank you very much for raising this issue! And thank you @ndrluis for jumping on to dig into the root cause as well. We've made some enhancements to PyIceberg to be able to support defining PartitionSpec on table creation (this wasn't even possible before), but there's still two problems here that you helped outline:
The root cause of the problem is that the IDs of the Iceberg Table schema are reassigned when a table is created. So the constraint the API has on trying to match the PartitionSpec by ID doesn't really work on table creation. Instead, the newly introduced practice is to do the following: with catalog.create_table_transaction(
identifier='my_namespace.time_series',
schema=ts_schema,
) as txn:
with txn.update_spec() as update_spec:
update_spec.add_identity("campaign_id")
table = catalog.load_table('my_namespace.time_series') This approach relies on just matching the partition field by its field name, similar to how Spark and Flink APIs handle partition updates. Please let me know if this works for you! I think it'll also be worthwhile for us to leave this issue open until we can clarify our API and our documentation to prevent other users from running into the same issues. |
Agree with @sungwy that this is mostly a documentation issue, so let's extend the docs so ChatGPT can give better answers. Another solution would be: ts_table = catalog.create_table_if_not_exists(
'default.time_series',
schema=ts_schema,
location = "local_s3"
)
with ts_table.update_spec() as update_spec:
update_spec.add_identity("campaign_id") This will first create the table, and then set the spec, but that's probably alright.
I don't think this is the most user-friendly option. In the end, we don't want to put the burden of field-IDs on the users. Keep in mind that they also get re-assigned: ts_schema = Schema(
NestedField(field_id=1925, name="timestamp", field_type=TimestampType(), required=True),
)
ts_table = catalog.create_table('default.time_series', schema=ts_schema)
assert ts_table.schema.fields[0] == 1 # Field-ID starts now from 1 as they are being re-assigned. Another thing I noticed:
Arrow by default sets everything to nullable, while there are no nulls in the data. We could check if the nullable is set correctly by checking if there are any null-records. This could become expensive when the table is big, so we probably only want to do it when we actually want to write an optional field to a required field in the table. |
in case it's not clear from @Fokko 's example, here's how you add a (non-identity) partition. This is the only way I've found to define a table with the Arrow schema and include a partition. date_column = "some_date_col"
iceberg_table = catalog.create_table(
identifier=f"default.table_name",
schema=schema, # arrow schema
)
with iceberg_table.update_spec() as update_spec:
update_spec.add_field(
source_column_name=date_column,
transform=DayTransform(),
partition_field_name=f"{date_column}_day"
) |
Hi, guys! This thread is very interesting! Has the documentation been updated yet? |
i see something similar in https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md#partition-evolution |
This is not ideal as it always add a partition evolution to the tables... I would like to use StarRocks materialized views that does not support those. |
Question
By trying partitionning my table I've got one error:
ValueError: Could not find in old schema: 2: {field}: identity(2)
I've drowned myself in the documentation, stackoverflow and medium to find one answer.
I even tried chatGPT but without sucess :D
I've used local SQLite and MinIO server to develop a "proof-of-concept".
Next, the code to reproduce the issue:
My purpose is to partition the table by campaign_id.
Is it possible? If yes, how?
How to interpret the documentation from api documentation ?
I tried with timestamp field and the DayTransform such as:
It raised the same error.
ValueError: Could not find in old schema: 100: timestamp_day: Day(1)
The text was updated successfully, but these errors were encountered: