You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need at least one synthetic data set for expansive testing of the pipeline and each tool within the pipeline suite, preferably 2.
I think the main features of a good synthetic data set are:
Contains no real data and thereby is immune from having any sensitive or potentially sensitive information
Substantially represents one or more real-world data set we are targeting
I'll be using this issue to track our effort in creating gold standard synthetic data sets for the Data Model-Based Ingestion Pipeline.
Some key aspects of representing a data set are:
similar data field names (exactly the same if reasonable)
similar data field ranges
similar data distribution
similar data field discrepancies
representation of missing values within appropriate fields
representation of data errors within appropriate fields
The synthetic dataset can be significantly smaller than the original data set as long as it presents most of the key features in the dataset, especially key challenges.
The text was updated successfully, but these errors were encountered:
Here is the initial synthetic data set that we were given. It has some shortfalls for being a useful synthetic data set for broad testing.
There are a lot of fields (columns) with just the field name and no representative data.
The columns that have representative data do not appear to reflect especially useful data cases.
When the fields without representative data are ignored most of the interesting prioritized variables are lost.
Unfortunately, our discussion with Ozzy has clarified that a better synthetic data set will not be forthcoming and we will likely be better off attempting to create the data ourselves. An additional roadblock to creating the synthetic data ourselves is that we do not currently have access to the data we will be transforming so we don't have any example data set to work from.
We need at least one synthetic data set for expansive testing of the pipeline and each tool within the pipeline suite, preferably 2.
I think the main features of a good synthetic data set are:
I'll be using this issue to track our effort in creating gold standard synthetic data sets for the Data Model-Based Ingestion Pipeline.
Some key aspects of representing a data set are:
The synthetic dataset can be significantly smaller than the original data set as long as it presents most of the key features in the dataset, especially key challenges.
The text was updated successfully, but these errors were encountered: