-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve JSON processing performance #396
base: master
Are you sure you want to change the base?
Conversation
Increase the speed of the json_handler by migrating from a list to a set. Move from O(n) to O(1)
else: | ||
for i in range(0, len(column["Columns"])): | ||
if is_column_type_decimal(schema, column["Columns"][i]): | ||
for composite_match in column["MatchIds"]: | ||
composite_match[i] = Decimal(composite_match[i]) | ||
columns_copy = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if there is a way to do this without creating another copy of the column.
Do you think that there is a possibility for there to be particularly large columns here?
Primarily driven by the fact we have seen some large simple match lists but I haven't seen any use of composite personally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The possibility is real but composite are definitely uncommon, yes, and also this is effectively running if a column identifier is also of type decimal, which is very uncommon, so I think the possibility of hitting both use-cases is very remote.
But your comment is valid. I guess before we were operating on arrays, and therefore re-iterating and casting the Decimals in this particular use-case wasn't memory intensive because we were operating on the existing data structure. I guess here we are sort of doing the opposite really, working out the hashmaps before, and copying to array only here.
I guess that if you have more than a decimal column identifier, this could be doing the copy multiple times, and perhaps I can optimise that to happen only once. I'll give it a go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did re-write some logic to optimise the multi-column decimal scenario. I think this is the best I can think of, because I don't think iterating over a Python set allows me to dynamically add/remove values? I think the best is to just create a copy and re-write it at the end. Thoughts @ctd @cmclel7 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think you need to build a new collection.
You could use generator expressions here as well, for instance:
def cast_column_values(column, schema):
"""
Method to cast stringified MatchIds to their actual types
"""
if column["Type"] == "Simple":
if is_column_type_decimal(schema, column["Column"]):
column["MatchIds"] = set(Decimal(m) for m in column["MatchIds"])
else:
decimal_columns = set(
i
for i, col in enumerate(column["Columns"])
if is_column_type_decimal(schema, col)
)
if decimal_columns:
decimal_casted = set(
tuple(
Decimal(m) if i in decimal_columns else m
for i, m in enumerate(composite_match_tuple)
)
for composite_match_tuple in column["MatchIds"]
)
column["MatchIds"] = decimal_casted
return column
tbf after writing this I'm not sure that it will actually perform better. It saves having to build an intermediate list, but rather than iterating over decimal_columns (and only looping that many times) it adds a check step to len(composite_match_tuple). Swings and roundabouts perhaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I actually like your changes. Using a set on the decimal_columns
will have slight impact given I would assume the array to be usually around 2 or 3 but if we consider doing that lookup to up to millions, it could help.
I also like the idea of not using an intermediate array for match_array
, so I incorporated this in my change. Thanks!
Co-authored-by: Chris Deigan <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #396 +/- ##
=========================================
Coverage ? 99.71%
=========================================
Files ? 31
Lines ? 1744
Branches ? 0
=========================================
Hits ? 1739
Misses ? 5
Partials ? 0 ☔ View full report in Codecov by Sentry. |
Description of changes:
Follow-up PR from #395 - move the optimisation logic outside of the JSON handler to simplify logic and use less memory (as the hashmap is created at the start of the job, rather than copied on each line iterator.
I also included File Size in the log as it could be useful during troubleshooting.
PR Checklist:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.