Skip to content

Experiments on PySpark UUID5 generation implementation

Notifications You must be signed in to change notification settings

YevIgn/pyspark-uuid5

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple project that was sparked out of idea to compare potential performance and drawbacks of several ways to calculate UUID5 in PySpark as there is no apparent default implementation.

There are several versions implemented here, note that not all of them are similar and equally stable, especially PyArrow which was written with significant ChatGPT assistance in guiding to study specific APIs over the course of less than couple of hours.

Performance on different datasets can vary substantially as well. Very rough estimation is that Pandas (or pure PySpark if it's actually on par) approach is still the most advantageous as being the easiest to read and maintain.

Most of the codebase was created by Brend Braeckmans and Danny Meijer.

PyArrow (by wrangling with ChatGPT and docs) and pure PySpark mimic of UUID5 (by just reverse implementing its Python function + ChatGPT assistance) are added as an idea mostly.

data_generator.py was written to generate the exemplar dataset (10k lorem-ipsum-like rows).

About

Experiments on PySpark UUID5 generation implementation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages