Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Synthetic Data Sets #16

Open
amc-corey-cox opened this issue Jan 28, 2025 · 1 comment
Open

Create Synthetic Data Sets #16

amc-corey-cox opened this issue Jan 28, 2025 · 1 comment
Assignees
Labels

Comments

@amc-corey-cox
Copy link
Collaborator

We need at least one synthetic data set for expansive testing of the pipeline and each tool within the pipeline suite, preferably 2.

I think the main features of a good synthetic data set are:

  • Contains no real data and thereby is immune from having any sensitive or potentially sensitive information
  • Substantially represents one or more real-world data set we are targeting

I'll be using this issue to track our effort in creating gold standard synthetic data sets for the Data Model-Based Ingestion Pipeline.

Some key aspects of representing a data set are:

  • similar data field names (exactly the same if reasonable)
  • similar data field ranges
  • similar data distribution
  • similar data field discrepancies
  • representation of missing values within appropriate fields
  • representation of data errors within appropriate fields

The synthetic dataset can be significantly smaller than the original data set as long as it presents most of the key features in the dataset, especially key challenges.

@amc-corey-cox
Copy link
Collaborator Author

Here is the initial synthetic data set that we were given. It has some shortfalls for being a useful synthetic data set for broad testing.

  • There are a lot of fields (columns) with just the field name and no representative data.
  • The columns that have representative data do not appear to reflect especially useful data cases.
  • When the fields without representative data are ignored most of the interesting prioritized variables are lost.

Unfortunately, our discussion with Ozzy has clarified that a better synthetic data set will not be forthcoming and we will likely be better off attempting to create the data ourselves. An additional roadblock to creating the synthetic data ourselves is that we do not currently have access to the data we will be transforming so we don't have any example data set to work from.

@amc-corey-cox amc-corey-cox changed the title Synthetic Data Sets Create Synthetic Data Sets Feb 21, 2025
@amc-corey-cox amc-corey-cox self-assigned this Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant