You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm filing this issue based on a discussion in SDV #2427: Faker functions generally have a max number of unique values that they are able to generate. In the case where you request a larger # of unique values, the transformer will crash.
TransformerProcessingError: The Faker function you specified is not able to generate 1000 unique values. Please use a different Faker function for column ('name').
Expected behavior
The RegexGenerator RDT runs into a very similar problem where uniqueness cannot be enforced if the Regex format string doesn't allow for a large enough # of values. In this case, we do not crash; we continue to generate new strings while warning the user that they will not follow the format.
We can take a similar approach for the AnonymizedFaker:
UserWarning: Unable to generate enough unique values for column 'name' in a human-readable format. Additional values may be created randomly.
Approach: After exhausting all possibilities in the original function, swap over to using the bothify function set to '??????' (with uniqueness on). This allows for 19 billion more unique possibilities (more than enough).
Alternative considered: Similar to RegexGenerater, we can repeat the values of the original function, but with numerical value after them to indicate that they are copies. For example, Alice(1), Bob(1), etc. This has the benefit that we can create an infinite amount of unique strings by just incrementing the value. The con is that it may lead to people assuming that Alice is actually related to Alice(1), which is not true.
The text was updated successfully, but these errors were encountered:
Problem Description
I'm filing this issue based on a discussion in SDV #2427: Faker functions generally have a max number of unique values that they are able to generate. In the case where you request a larger # of unique values, the transformer will crash.
Expected behavior
The RegexGenerator RDT runs into a very similar problem where uniqueness cannot be enforced if the Regex format string doesn't allow for a large enough # of values. In this case, we do not crash; we continue to generate new strings while warning the user that they will not follow the format.
We can take a similar approach for the AnonymizedFaker:
Approach: After exhausting all possibilities in the original function, swap over to using the bothify function set to
'??????'
(with uniqueness on). This allows for 19 billion more unique possibilities (more than enough).Alternative considered: Similar to
RegexGenerater
, we can repeat the values of the original function, but with numerical value after them to indicate that they are copies. For example,Alice(1)
,Bob(1)
, etc. This has the benefit that we can create an infinite amount of unique strings by just incrementing the value. The con is that it may lead to people assuming thatAlice
is actually related toAlice(1)
, which is not true.The text was updated successfully, but these errors were encountered: