Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve JSON processing performance #396

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Improve JSON processing performance #396

wants to merge 11 commits into from

Conversation

matteofigus
Copy link
Member

@matteofigus matteofigus commented Jan 30, 2024

Description of changes:

Follow-up PR from #395 - move the optimisation logic outside of the JSON handler to simplify logic and use less memory (as the hashmap is created at the start of the job, rather than copied on each line iterator.

I also included File Size in the log as it could be useful during troubleshooting.

PR Checklist:

  • Changelog updated
  • Unit tests (and integration tests if applicable) provided
  • All tests pass
  • Pre-commit checks pass
  • Debugging code removed
  • If releasing a new version, have you bumped the version in the main CFN template?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@matteofigus matteofigus marked this pull request as ready for review January 30, 2024 19:00
else:
for i in range(0, len(column["Columns"])):
if is_column_type_decimal(schema, column["Columns"][i]):
for composite_match in column["MatchIds"]:
composite_match[i] = Decimal(composite_match[i])
columns_copy = set()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if there is a way to do this without creating another copy of the column.

Do you think that there is a possibility for there to be particularly large columns here?

Primarily driven by the fact we have seen some large simple match lists but I haven't seen any use of composite personally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The possibility is real but composite are definitely uncommon, yes, and also this is effectively running if a column identifier is also of type decimal, which is very uncommon, so I think the possibility of hitting both use-cases is very remote.

But your comment is valid. I guess before we were operating on arrays, and therefore re-iterating and casting the Decimals in this particular use-case wasn't memory intensive because we were operating on the existing data structure. I guess here we are sort of doing the opposite really, working out the hashmaps before, and copying to array only here.

I guess that if you have more than a decimal column identifier, this could be doing the copy multiple times, and perhaps I can optimise that to happen only once. I'll give it a go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did re-write some logic to optimise the multi-column decimal scenario. I think this is the best I can think of, because I don't think iterating over a Python set allows me to dynamically add/remove values? I think the best is to just create a copy and re-write it at the end. Thoughts @ctd @cmclel7 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think you need to build a new collection.

You could use generator expressions here as well, for instance:

def cast_column_values(column, schema):
    """
    Method to cast stringified MatchIds to their actual types
    """
    if column["Type"] == "Simple":
        if is_column_type_decimal(schema, column["Column"]):
            column["MatchIds"] = set(Decimal(m) for m in column["MatchIds"])
    else:
        decimal_columns = set(
            i
            for i, col in enumerate(column["Columns"])
            if is_column_type_decimal(schema, col)
        )
        if decimal_columns:
            decimal_casted = set(
                tuple(
                    Decimal(m) if i in decimal_columns else m
                    for i, m in enumerate(composite_match_tuple)
                )
                for composite_match_tuple in column["MatchIds"]
            )
            column["MatchIds"] = decimal_casted
    return column

tbf after writing this I'm not sure that it will actually perform better. It saves having to build an intermediate list, but rather than iterating over decimal_columns (and only looping that many times) it adds a check step to len(composite_match_tuple). Swings and roundabouts perhaps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I actually like your changes. Using a set on the decimal_columns will have slight impact given I would assume the array to be usually around 2 or 3 but if we consider doing that lookup to up to millions, it could help.

I also like the idea of not using an intermediate array for match_array, so I incorporated this in my change. Thanks!

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ No coverage uploaded for pull request base (master@3efdceb). Click here to learn what that means.
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master     #396   +/-   ##
=========================================
  Coverage          ?   99.71%           
=========================================
  Files             ?       31           
  Lines             ?     1744           
  Branches          ?        0           
=========================================
  Hits              ?     1739           
  Misses            ?        5           
  Partials          ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants