Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define Synthetic Data Features #34

Open
amc-corey-cox opened this issue Feb 21, 2025 · 0 comments
Open

Define Synthetic Data Features #34

amc-corey-cox opened this issue Feb 21, 2025 · 0 comments
Assignees
Labels

Comments

@amc-corey-cox
Copy link
Collaborator

We should define the features we need in our near-term and ideal synthetic data sets. Much of this is captured in the parent issue of this issue.

These are copied from there:

I think the main features of a good synthetic data set are:

  • Contains no real data and thereby is immune from having any sensitive or potentially sensitive information
  • Substantially represents one or more real-world data set we are targeting

I'll be using this issue to track our effort in creating gold standard synthetic data sets for the Data Model-Based Ingestion Pipeline.

Some key aspects of representing a data set are:

  • similar data field names (exactly the same if reasonable)
  • similar data field ranges
  • similar data distribution
  • similar data field discrepancies
  • representation of missing values within appropriate fields
  • representation of data errors within appropriate fields
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant