gal-task

Introduction

Much improved codebase

Could refector to classes and pass in settings as a class variable, but I am not unhappy with explicitly passing settings around. I prefer a functional approach such as this.

Bonus Objectives

Provide commentary and ideas how to structure and optimise the code in the future.

Overall I am pretty happy with the codebase.

Separate embedding model functionality into a local and remote model and use dependency injection to switch between them
Enable passing in of arguments to overwrite default settings
Better documentation

Process the data in chunks and in parallel by spinning up processes in parallel or by using a distributed processing framework (including the ones limited to a single node such as Polars)

Completed, polars provides out of memory streaming functionality under the covers

Write some unit tests and test that could server as an entrypoint into the app (pytest preferably)

Done a couple, shows what I would do

Structure code as pipeline step execution. Separate pipeline orchestration elements from data manipulation elements. Separate data manipulation from IO. Configure IO via configuration.

Completed to a large extent.

Write an custom error handling decorator - raises an custom exception if code of the method raised an error. Include the previous errors in args. Annotate your methods with it instead of having try/except explicitly in your methods.

Happy to do, but will leave this

In Step 1 clean duplicates, outliers and stopwords from the phrases. In cases where the exact match is not found in the list of words from word2vec set, use the Levenshtein distance to find the closest similar word and use its vector instead

Happy to do, but will leave this. I would use the Levenshtein library to calculate

Prepare a docker file building the image hosting the python app with all dependencies. If you are using a DB for data storage, have a separate image for the database and provide docker compose file to spawn the app.

Not done, would like

Instructions for use

Step 1

Installation requires uv - which can be installed from here https://github.com/astral-sh/uv

Step 2

To install all the required libraries, run uv sync in the folder

Step 3

The CLI should be available at this time.

Project structure

Input data goes in the input_data folder The processes use the working_data folder to store intermediate data The output data is stored in the output_data folder

CLI Commands

Get cross similarities for any given model uv run cli get-cross-similarities-for-phrase <phrase_filename> <model_filename> e.g. uv run cli get-cross-similarities-for-phrase phrases.csv GoogleNews-vectors-negative300.bin.gz

Get most similar phrases for any given phrase uv run cli get-most-similar-phrases-for-phrase <phrase_filename> <model_filename> e.g uv run cli get-most-similar-phrases-for-phrase phrases.csv GoogleNews-vectors-negative300.bin.gz "I am a business"

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
input_data		input_data
output_data		output_data
src		src
tests		tests
.editorconfig		.editorconfig
.env.template		.env.template
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gal-task

Introduction

Bonus Objectives

Provide commentary and ideas how to structure and optimise the code in the future.

Process the data in chunks and in parallel by spinning up processes in parallel or by using a distributed processing framework (including the ones limited to a single node such as Polars)

Write some unit tests and test that could server as an entrypoint into the app (pytest preferably)

Structure code as pipeline step execution. Separate pipeline orchestration elements from data manipulation elements. Separate data manipulation from IO. Configure IO via configuration.

Write an custom error handling decorator - raises an custom exception if code of the method raised an error. Include the previous errors in args. Annotate your methods with it instead of having try/except explicitly in your methods.

In Step 1 clean duplicates, outliers and stopwords from the phrases. In cases where the exact match is not found in the list of words from word2vec set, use the Levenshtein distance to find the closest similar word and use its vector instead

Prepare a docker file building the image hosting the python app with all dependencies. If you are using a DB for data storage, have a separate image for the database and provide docker compose file to spawn the app.

Instructions for use

Step 1

Step 2

Step 3

Project structure

CLI Commands

About

Releases

Packages

Languages

PascalJrm/gal-task

Folders and files

Latest commit

History

Repository files navigation

gal-task

Introduction

Bonus Objectives

Provide commentary and ideas how to structure and optimise the code in the future.

Process the data in chunks and in parallel by spinning up processes in parallel or by using a distributed processing framework (including the ones limited to a single node such as Polars)

Write some unit tests and test that could server as an entrypoint into the app (pytest preferably)

Structure code as pipeline step execution. Separate pipeline orchestration elements from data manipulation elements. Separate data manipulation from IO. Configure IO via configuration.

Write an custom error handling decorator - raises an custom exception if code of the method raised an error. Include the previous errors in args. Annotate your methods with it instead of having try/except explicitly in your methods.

In Step 1 clean duplicates, outliers and stopwords from the phrases. In cases where the exact match is not found in the list of words from word2vec set, use the Levenshtein distance to find the closest similar word and use its vector instead

Prepare a docker file building the image hosting the python app with all dependencies. If you are using a DB for data storage, have a separate image for the database and provide docker compose file to spawn the app.

Instructions for use

Step 1

Step 2

Step 3

Project structure

CLI Commands

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages