Architecture of the projects

My_project/ <br/>
├── main.py <br/>                 
├── config/                  # Configuration, parameters
│   └── __init__.py       
│   └── configuration.py
│   └── table.py       
├── modules/                 # Functions
│   ├── __init__.py         
│   ├── generateQuery.py        
│   ├── standarize.py
│   ├── generateDictionary.py             
├── data/                    # Data（.csv/.json）
└── README.md

Explanation of pipeline

Step1: Data cleaning

1.1 Format-level Cleaning

Convert multiple space to single space REGEXP_REPLACE(_, r'\\s+', ' ')
Convert to upper case UPPER()
Delete spaces before and after TRIM()
Replace diacritics (accents) REGEXP_REPLACE(NORMALIZE(_, NFD), r'\pM', '')

1.2 Semantic-level Cleaning

Handle missing values (convert variants like "N/A", "null", "" to NULL)
Unify the brands' names:

To begin with, we remove all non-alphanumeric characters from brand names to normalize similar variants (e.g., "L'OREAL", "L OREAL" → "LOREAL").

Next, we group the original brand names based on their normalized form and count the frequency of each original name within each group.

We then select the most frequent original name in each group as the standardized brand name and store this mapping in a dictionary.

As a result, all variants like "L'OREAL" and "L OREAL" will be unified under the most common form depending on the frequency.

Step2: Data validaton

Check duplicates across the entire line
Check duplicates based on primary key
[Option] Check barcode length
[Option] Check consistency of barcode with hierarchy

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
manuel.md		manuel.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Architecture of the projects

Explanation of pipeline

Step1: Data cleaning

1.1 Format-level Cleaning

1.2 Semantic-level Cleaning

Step2: Data validaton

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Architecture of the projects

Explanation of pipeline

Step1: Data cleaning

1.1 Format-level Cleaning

1.2 Semantic-level Cleaning

Step2: Data validaton

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages