My_project/ <br/>
├── main.py <br/>
├── config/ # Configuration, parameters
│ └── __init__.py
│ └── configuration.py
│ └── table.py
├── modules/ # Functions
│ ├── __init__.py
│ ├── generateQuery.py
│ ├── standarize.py
│ ├── generateDictionary.py
├── data/ # Data(.csv/.json)
└── README.md
- Convert multiple space to single space
REGEXP_REPLACE(_, r'\\s+', ' ') - Convert to upper case
UPPER() - Delete spaces before and after
TRIM() - Replace diacritics (accents)
REGEXP_REPLACE(NORMALIZE(_, NFD), r'\pM', '')
- Handle missing values (convert variants like "N/A", "null", "" to NULL)
- Unify the brands' names:
To begin with, we remove all non-alphanumeric characters from brand names to normalize similar variants (e.g., "L'OREAL", "L OREAL" → "LOREAL").
Next, we group the original brand names based on their normalized form and count the frequency of each original name within each group.
We then select the most frequent original name in each group as the standardized brand name and store this mapping in a dictionary.
As a result, all variants like "L'OREAL" and "L OREAL" will be unified under the most common form depending on the frequency.
- Check duplicates across the entire line
- Check duplicates based on primary key
- [Option] Check barcode length
- [Option] Check consistency of barcode with hierarchy