Skip to content

AnonymizedFaker should have a fallback in the case that it's unable to generate enough unique values #981

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
npatki opened this issue Apr 10, 2025 · 0 comments · May be fixed by #986
Open
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Apr 10, 2025

Problem Description

I'm filing this issue based on a discussion in SDV #2427: Faker functions generally have a max number of unique values that they are able to generate. In the case where you request a larger # of unique values, the transformer will crash.

import pandas as pd
from rdt import HyperTransformer
from rdt.transformers.pii import AnonymizedFaker

ht = HyperTransformer()
ht.set_config({
    'sdtypes': {
        'name': 'pii'
    },
    'transformers': {
        'name': AnonymizedFaker(provider_name='person', function_name='first_name',
                                cardinality_rule='unique')
    }
})

test_data = pd.DataFrame(data={
    'name': ['Alice', 'Bob', 'Carol']
})

ht.fit(test_data)
ht.create_anonymized_columns(num_rows=1000, column_names=['name'])
TransformerProcessingError: The Faker function you specified is not able to generate 1000 unique values. Please use a different Faker function for column ('name').

Expected behavior

The RegexGenerator RDT runs into a very similar problem where uniqueness cannot be enforced if the Regex format string doesn't allow for a large enough # of values. In this case, we do not crash; we continue to generate new strings while warning the user that they will not follow the format.

We can take a similar approach for the AnonymizedFaker:

ht.create_anonymized_columns(num_rows=1000, column_names=['name'])
UserWarning: Unable to generate enough unique values for column 'name' in a human-readable format. Additional values may be created randomly.

Approach: After exhausting all possibilities in the original function, swap over to using the bothify function set to '??????' (with uniqueness on). This allows for 19 billion more unique possibilities (more than enough).

Alternative considered: Similar to RegexGenerater, we can repeat the values of the original function, but with numerical value after them to indicate that they are copies. For example, Alice(1), Bob(1), etc. This has the benefit that we can create an infinite amount of unique strings by just incrementing the value. The con is that it may lead to people assuming that Alice is actually related to Alice(1), which is not true.

@npatki npatki added the feature request Request for a new feature label Apr 10, 2025
@pvk-developer pvk-developer self-assigned this Apr 18, 2025
@pvk-developer pvk-developer added this to the 1.17.0 milestone Apr 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants