Create Synthetic Data Sets #16

amc-corey-cox · 2025-01-28T15:05:11Z

We need at least one synthetic data set for expansive testing of the pipeline and each tool within the pipeline suite, preferably 2.

I think the main features of a good synthetic data set are:

Contains no real data and thereby is immune from having any sensitive or potentially sensitive information
Substantially represents one or more real-world data set we are targeting

I'll be using this issue to track our effort in creating gold standard synthetic data sets for the Data Model-Based Ingestion Pipeline.

Some key aspects of representing a data set are:

similar data field names (exactly the same if reasonable)
similar data field ranges
similar data distribution
similar data field discrepancies
representation of missing values within appropriate fields
representation of data errors within appropriate fields

The synthetic dataset can be significantly smaller than the original data set as long as it presents most of the key features in the dataset, especially key challenges.

amc-corey-cox · 2025-02-03T14:54:51Z

Here is the initial synthetic data set that we were given. It has some shortfalls for being a useful synthetic data set for broad testing.

There are a lot of fields (columns) with just the field name and no representative data.
The columns that have representative data do not appear to reflect especially useful data cases.
When the fields without representative data are ignored most of the interesting prioritized variables are lost.

Unfortunately, our discussion with Ozzy has clarified that a better synthetic data set will not be forthcoming and we will likely be better off attempting to create the data ourselves. An additional roadblock to creating the synthetic data ourselves is that we do not currently have access to the data we will be transforming so we don't have any example data set to work from.

amc-corey-cox changed the title ~~Synthetic Data Sets~~ Create Synthetic Data Sets Feb 21, 2025

amc-corey-cox added the Datasets label Feb 25, 2025

amc-corey-cox self-assigned this Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Synthetic Data Sets #16

Create Synthetic Data Sets #16

amc-corey-cox commented Jan 28, 2025

amc-corey-cox commented Feb 3, 2025

Create Synthetic Data Sets #16

Create Synthetic Data Sets #16

Comments

amc-corey-cox commented Jan 28, 2025

amc-corey-cox commented Feb 3, 2025