GitHub - jdamodhar/sdgp: This project synthetic data generator is a python script that generates mock data based on given configurations. It can also edit and scale existing data to create high volume data. It is useful for testing and prototyping purposes.

Utilization Synthetic data generator plus Project

For questions on this package contact the package Developer Damodhar Jangam or Vyasa Koundinya, Lanka.

Overview

This project Synthetic data generator plus is a python script that generates mock data based on given configurations. It can also edit and scale existing data to create high volume data. It is useful for testing and prototyping purposes.

Features

Generate mock data for different types of configuration items
Edit the mock data and generate mock data for different types of configuration items
Configuration rules include generating unique indices, fixed or random dates/times, categorical values, float values within a range, integer values within a range, or constant values.
Generate high volume data
Save a DataFrame in CSV and Parquet file formats

Package Installation

Install on a Local Machine (optional)

Go through the following sequence:

Clone repo
Create a virtual environment and install the package

PS > python -m venv .venv
PS > .\.venv\Scripts\activate
PS > pip install -r requirements.txt
PS > deactivate # when you need exit

Install on a edge node (optional)

Go through the following sequence:

Clone repo
Create a virtual environment and install the package

$ python3 -m venv .venv
$ source env/bin/activate
$ pip install -r requirements.txt
$ deactivate # when you need exit

At that point you're good to go and the package Synthetic data generater plus and its module will be available for use in your virtual environment.

Usage

To run the script, you need to provide some arguments:

-c or --choice: The type of function to select. m for mock data, e for edit mock data, g for generate high volume data.
volume: The size. An integer value that specifies how many rows to generate mock data. Recommended minimum value is more than volume size or more than 1000.
format: The type of format to save the mock data. csv for CSV format, parquet for Parquet format.
csv_file: The CSV file name. A string value that specifies the name of the CSV file to read if there or to write output.
conf_csv_file: The configuration CSV file name. A string value that specifies the name of the configuration CSV file to read. This argument is required if mode is e or g. Example configuration file:

name	type	values
id1	uniqueIndex	800000000
date1	date	2022-10-26 \|%Y-%m-%d
time1	time	00:00:00\|23:59:59
dateRange1	dateRange	2021-10-10 \| 2022-10-26 \|%Y-%m-%d
incometime2	dateRange	2021-10-10 \| 2022-10-26 \|%Y-%m-%d %H:%M:%S
outcometime3	dependentDateRange	incometime2\|1D\|3W\|%Y-%m-%d %H:%M:%S
model1	category	Customers\|Lending\|Web_Lending
probability1	floatRange	0.001\|1\|3
float1	floatRange	0.001\|0.3\|5
number1	intRange	10\|25
test1	constant	Done
name1	regexPattern	([a-z]{3,10}), ([a-z]{3,10})
phone_number	regexPattern	(+[4-9]{2,3})-([4-9]{5})-([4-9]{5})
zip_code	regexPattern	([4-9]{5})
email_address	regexPattern	([a-zA-Z0-9]{1,10})@[a-z]{1,5}.(com\|net\|org\|in)
compositeKey	composite	dateRange1 \| model1 \|number1 \|phone_number\|zip_code

name,type,values
id1,uniqueIndex,800000000
date1,date,2022-10-26|%Y-%m-%d
time1,time,00:00:00|23:59:59|%H:%M:%S
dateRange1,dateRange,2021-10-10 | 2022-10-26|%Y-%m-%d
incometime2,dateRange,2021-10-10 | 2022-10-26|%Y-%m-%d %H:%M:%S
outcometime3,dependentDateRange,incometime2|1D|3W|%Y-%m-%d %H:%M:%S
model1,category,Customers|Lending|Web_Lending||
probability1,floatRange,0.001|1|3
float1,floatRange,0.001|0.3|5
number1,intRange,10|25
test1,constant,Done
name1,regexPattern,"([a-z]{3,10})\, ([a-z]{3,10})"
phone_number,regexPattern,"(\+[4-9]{2,3})\-([4-9]{5})\-([4-9]{5})"
zip_code,regexPattern,([4-9]{5})
email_address,regexPattern,"([a-zA-Z0-9]{1,10})\@[a-z]{1,5}\.(com|net|org|in)"
compositeKey1,composite,dateRange1|model1|number1|phone_number|zip_code

Explanation of above file and possible data types with this tool:

uniqueIndex: This indicates that the id1 column should contain unique and sequential values, starting from 800000000.
date: This indicates that the date1 column should contain a fixed date value (2022-10-26) for all rows. %Y-%m-%d format is used.
time: This indicates that the time1 column should contain random time values between 00:00:00 and 23:59:59.
dateRange: This indicates that the dateRange1 and incometime2 columns should contain random date values within the range from 2021-10-10 to 2022-10-26. The format of the dates in incometime2 also includes%Y-%m-%d %H:%M:%S. format reference given below.
dependentDateRange: This indicates that the outcometime3 column should contain random duration values within the range from 1D to 3W in addition to the incometime2.Here 1D means 1 day and 3W means 3 weeks. Other compatable inputs are 10S means 10 seconds, 5m means 5 minutes, 2h means 2 hours, 3d means 3 days, 4W means 4 weeks. The format of the dates in outcometime3 also includes%Y-%m-%d %H:%M:%S. format reference given below.
category: This indicates that the model column should contain random categorical values chosen from the options "Customers", "Lending", and "Web_Lending".
floatRange: This indicates that the probability1 and float columns should contain random float values within a given range. The range for probability1 is from 0.001 to 1, with a precision of 3 decimal places. The range for float is from 0.001 to 0.3, with a precision of 5 decimal places.
intRange: This indicates that the number1 column should contain random integer values within the range from 10 to 25.
constant: This indicates that the test1 column should contain a constant value (Done) for all rows.
regexPattern: This indicates that the name1 column should contain a fixed pattren range value (([a-z]{3,10})\, ([a-z]{3,10})) for all records. Thephone_number column should contain a fixed length phone number value ((\+[4-9]{2,3})\-([1-9]{5})\-([1-9]{5})) for all records. The zip_code column should contain a fixed length zip code value (([4-9]{5})) for all records. email_address column should contain a fixed length email address value (([a-zA-Z0-9]{1,10})\@[a-z]{1,5}\.(com|net|org|in)) for all records. For more regex pattren check here and play around with it.
composite: This indicates that the compositeKey1 column should contain sha256 hashed value from these combinations: dateRange1|model1|number1|phone_number|zip_code Each row in this CSV file defines a rule for generating or handling data in a specific column of another dataset. The rules include generating unique indices, fixed or random dates/times, categorical values, float values within a range, integer values within a range, or constant values. datetime formats you can use in the script:
%a: Weekday as locale’s abbreviated name. Example: Mon
%A: Weekday as locale’s full name. Example: Monday
%w: Weekday as a decimal number, where 0 is Sunday and 6 is Saturday. Example: 1
%d: Day of the month as a zero-padded decimal number. Example: 30
%b: Month as locale’s abbreviated name. Example: Sep
%B: Month as locale’s full name. Example: September
%m: Month as a zero-padded decimal number. Example: 09
%y: Year without century as a zero-padded decimal number. Example: 99
%Y: Year with century as a decimal number. Example: 2013
%H: Hour (24-hour clock) as a zero-padded decimal number. Example: 07
%I: Hour (12-hour clock) as a zero-padded decimal number. Example: 07
%p: Locale’s equivalent of either AM or PM. Example: AM
%M: Minute as a zero-padded decimal number. Example: 06
%S: Second as a zero-padded decimal number. Example: 05 To run the script, use the following command:

# python main.py -c <choice> <volume> <format> <csv_file> <conf_csv_file>
#
# positional arguments:
#   volume                The size. An integer value that specifies how many rows to generate mock data. Recommended
#                         minimum value is more than volume size or more than 1000.
#   {csv,parquet}         The type of format to save the mock data. csv for CSV format, parquet for Parquet format.
#   csv_file              The CSV file name. A string value that specifies the name of the CSV file to read or write.
#   conf_csv_file         The configuration CSV file name. A string value that specifies the name of the configuration
#                         CSV file to read. This argument is required if mode is e or g.
# options:
#   -h, --help            show this help message and exit
#   -c {m,e,g}, --choice {m,e,g}
#                         The type of function to select. m for mock data, e for edit mock data, g for generate high
#                         volume data.

For example:

python main.py -c m 50000 csv mock_table conf.csv # Generate 50000 rows of mock data and save as mock_table_50000.csv
python main.py -c e 100000 parquet edit_table.csv conf.csv # Along with given data can edit with conf.csv, generate 100000 recrds and save as edit_table_100000.parquet\n
python main.py -c g 1000000 csv scale.csv # Generate 1000000 rows of mock data by scaling existing data and save as scale_1000000.csv

Sample output for python .\main.py -c m 1000000 csv test .\test_conf.csv :

id1,date1,model1,probability1,float1,number1,test1,time1,dateRange1,incometime2,outcometime3,name1,phone_number,zip_code,email_address,compositeKey1
800000004,2022-10-26,,0.792,0.14948,12,Done,11:34:20,2022-04-07,2022-06-28 21:33:32,2022-07-03 09:41:10,"gkxtawx, pfuf",+65-67845-69497,65957,[email protected],c05b0a767331f3176ec3cdf3dee852759a858e30
800000001,2022-10-26,Lending,0.442,0.11305,24,Done,06:01:02,2022-06-18,2022-07-04 01:51:18,2022-07-20 04:31:45,"ttjwjy, zesc",+48-89997-49658,78754,[email protected],ac8759aac34e718dad0ef46c62edb5bff07cb003
800000009,2022-10-26,Lending,0.267,0.17349,17,Done,08:43:08,2022-01-31,2021-12-11 02:33:20,2021-12-19 22:29:15,"vlflyewer, ilj",+564-44495-77467,98785,[email protected],c068d7d1a8d5e1c6527f84246d8b9dc911b52884
800000003,2022-10-26,,0.565,0.20937,11,Done,02:52:08,2022-04-27,2022-10-25 22:21:19,2022-11-15 16:22:14,"orkilkzh, xozrfwwrtq",+88-95566-65789,68677,[email protected],d51540c711301c6badc2aad051bb048fd175201b

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

If you have any questions, feedback, or suggestions, please feel free to contact me at [email protected], [email protected]. You can also open an issue or submit a pull request on GitHub if you want to contribute to this project. I hope you find this project useful and interesting. Thank you for reading! 😊

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github		.github
conf		conf
docs		docs
sdgp		sdgp
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
AUTHORS.rst		AUTHORS.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
README.rst		README.rst
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Utilization Synthetic data generator plus Project

Overview

Features

Package Installation

Install on a Local Machine (optional)

Install on a edge node (optional)

Usage

License

Acknowledgments

About

Releases

Packages

Languages

License

jdamodhar/sdgp

Folders and files

Latest commit

History

Repository files navigation

Utilization Synthetic data generator plus Project

Overview

Features

Package Installation

Install on a Local Machine (optional)

Install on a edge node (optional)

Usage

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages