Much improved codebase
- Could refector to classes and pass in settings as a class variable, but I am not unhappy with explicitly passing settings around. I prefer a functional approach such as this.
Overall I am pretty happy with the codebase.
- Separate embedding model functionality into a local and remote model and use dependency injection to switch between them
- Enable passing in of arguments to overwrite default settings
- Better documentation
Process the data in chunks and in parallel by spinning up processes in parallel or by using a distributed processing framework (including the ones limited to a single node such as Polars)
- Completed, polars provides out of memory streaming functionality under the covers
- Done a couple, shows what I would do
Structure code as pipeline step execution. Separate pipeline orchestration elements from data manipulation elements. Separate data manipulation from IO. Configure IO via configuration.
- Completed to a large extent.
Write an custom error handling decorator - raises an custom exception if code of the method raised an error. Include the previous errors in args. Annotate your methods with it instead of having try/except explicitly in your methods.
- Happy to do, but will leave this
In Step 1 clean duplicates, outliers and stopwords from the phrases. In cases where the exact match is not found in the list of words from word2vec set, use the Levenshtein distance to find the closest similar word and use its vector instead
- Happy to do, but will leave this. I would use the Levenshtein library to calculate
Prepare a docker file building the image hosting the python app with all dependencies. If you are using a DB for data storage, have a separate image for the database and provide docker compose file to spawn the app.
- Not done, would like
Installation requires uv - which can be installed from here https://github.com/astral-sh/uv
To install all the required libraries, run uv sync
in the folder
The CLI should be available at this time.
Input data goes in the input_data folder The processes use the working_data folder to store intermediate data The output data is stored in the output_data folder
Get cross similarities for any given model uv run cli get-cross-similarities-for-phrase <phrase_filename> <model_filename> e.g. uv run cli get-cross-similarities-for-phrase phrases.csv GoogleNews-vectors-negative300.bin.gz
Get most similar phrases for any given phrase uv run cli get-most-similar-phrases-for-phrase <phrase_filename> <model_filename> e.g uv run cli get-most-similar-phrases-for-phrase phrases.csv GoogleNews-vectors-negative300.bin.gz "I am a business"