Skip to content

Commit

Permalink
docs: Refactor docs
Browse files Browse the repository at this point in the history
  • Loading branch information
iusztinpaul committed Aug 1, 2024
1 parent e0f268d commit 6621975
Show file tree
Hide file tree
Showing 4 changed files with 33 additions and 92 deletions.
22 changes: 22 additions & 0 deletions GENERATE_INSTRUCT_DATASET.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Generate Data for LLM finetuning task component

## Component Structure

### File Handling
- `file_handler.py`: Manages file I/O operations, enabling reading and writing of JSON formatted data.

### LLM Communication
- `llm_communication.py`: Handles communication with OpenAI's LLMs, sending prompts and processing responses.

### Data Generation
- `generate_data.py`: Orchestrates the generation of training data by integrating file handling, LLM communication, and data formatting.


### Usage

The project includes a `Makefile` for easy management of common tasks. Here are the main commands you can use:

- `make help`: Displays help for each make command.
- `make local-start`: Build and start mongodb, mq and qdrant.
- `make local-test-github`: Insert data to mongodb
- `make generate-dataset`: Generate dataset for finetuning and version it in CometML
10 changes: 5 additions & 5 deletions INSTALL_AND_USAGE.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ Behind the scenes it will build and run all the Docker images defined in the [do
> 127.0.0.1 mongo3
> ```
>
> From what we know, on `Windows`, it `works out-of-the-box`.
> From what we know, on `Windows`, it `works out-of-the-box`. For more details, check out this article: https://medium.com/workleap/the-only-local-mongodb-replica-set-with-docker-compose-guide-youll-ever-need-2f0b74dd8384
> [!WARNING]
> For `arm` users (e.g., `M1/M2/M3 macOS devices`), go to your Docker desktop application and enable `Use Rosetta for x86_64/amd64 emulation on Apple Silicon` from the Settings. There is a checkbox you have to check.
Expand Down Expand Up @@ -112,7 +112,7 @@ make local-test-retriever

The last step, before fine-tuning is to generate an instruct dataset and track it as an artifact in Comet ML. To do so, run:
```shell
make local-generate-dataset
make generate-dataset
```

> Now open [Comet ML](https://www.comet.com/signup/?utm_source=decoding_ml&utm_medium=partner&utm_content=github), go to your workspace, and open the `Artifacts` tab. There, you should find three artifacts as follows:
Expand All @@ -123,19 +123,19 @@ make local-generate-dataset

### Step 5: Fine-tuning

For details on setting up the training pipeline on [Qwak](https://www.qwak.com/lp/end-to-end-mlops/?utm_source=github&utm_medium=referral&utm_campaign=decodingml) and running it, please referr to the [TRAINING]() document.
For details on setting up the training pipeline on [Qwak](https://www.qwak.com/lp/end-to-end-mlops/?utm_source=github&utm_medium=referral&utm_campaign=decodingml) and running it, please refer to the [TRAINING](https://github.com/decodingml/llm-twin-course/blob/main/TRAINING.md) document.

### Step 6: Inference

After you finetuned your model, the first step is to deploy the inference pipeline to Qwak as a REST API service:
After you have finetuned your model, the first step is to deploy the inference pipeline to Qwak as a REST API service:
```shell
deploy-inference-pipeline
```

> [!NOTE]
> You can check out the progress of the deployment on [Qwak](https://www.qwak.com/lp/end-to-end-mlops/?utm_source=github&utm_medium=referral&utm_campaign=decodingml).
After the deployment is finished (it will take a while) you can call it by calling:
After the deployment is finished (it will take a while), you can call it by calling:
```shell
make call-inference-pipeline
```
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ cloud-test-github: # Send command to the cloud lambda with a Github repository
local-feature-pipeline: # Run the RAG feature pipeline
RUST_BACKTRACE=full poetry run python -m bytewax.run 3-feature-pipeline/main.py

local-generate-dataset: # Generate dataset for finetuning and version it in Comet ML
generate-dataset: # Generate dataset for finetuning and version it in Comet ML
docker exec -it llm-twin-bytewax python -m finetuning.generate_data

# ------ RAG ------
Expand Down
91 changes: 5 additions & 86 deletions course/module-3/README.md → RAG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,3 @@
# Introduction
This module is composed from 2 components:
- RAG component
- Finetuning dataset preparation component

# RAG component
A production RAG system is split into 3 main components:

Expand Down Expand Up @@ -48,8 +43,6 @@ To prepare your environment for these components, follow these steps:
- `poetry init`
- `poetry install`



## Docker Settings
### Host Configuration
To ensure that your Docker containers can communicate with each other you need to update your `/etc/hosts` file.
Expand All @@ -62,7 +55,7 @@ Add the following entries to map the hostnames to your local machine:
127.0.0.1 mongo3
```

For the Windows users check this article: https://medium.com/workleap/the-only-local-mongodb-replica-set-with-docker-compose-guide-youll-ever-need-2f0b74dd8384
For Windows users check this article: https://medium.com/workleap/the-only-local-mongodb-replica-set-with-docker-compose-guide-youll-ever-need-2f0b74dd8384

# CometML Integration

Expand Down Expand Up @@ -98,8 +91,6 @@ To access and set up the necessary CometML variables for your project, follow th
4. **Set Environment Variables**:
- Add the obtained `COMET_API_KEY` to your environment variables, along with the `COMET_PROJECT` and `COMET_WORKSPACE` names you have set up.



# Qdrant Integration

## Overview
Expand Down Expand Up @@ -188,8 +179,8 @@ The `insert_data_mongo.py` script is designed to manage the automated downloadin

# RAG Component

## RAG Module Structure

# RAG Module Structure
### Query Expansion
- `query_expansion.py`: Handles the expansion of a given query into multiple variations using language model-based templates. It integrates the `ChatOpenAI` class from `langchain_openai` and a custom `QueryExpansionTemplate` to generate expanded queries suitable for further processing.

Expand All @@ -216,78 +207,6 @@ The workflow is straightforward:
The project includes a `Makefile` for easy management of common tasks. Here are the main commands you can use:

- `make help`: Displays help for each make command.
- `make local-start-infra`: Build and start mongodb, mq and qdrant.
- `make local-start-cdc`: Start cdc system
- `make insert-data-mongo`: Insert data to mongodb
- `make local-bytewax`: Run bytewax pipeline and send data to Qdrant
- `make local-test-retriever:-`: Test RAG retrieval


# Generate Data for LLM finetuning task component

# Component Structure


### File Handling
- `file_handler.py`: Manages file I/O operations, enabling reading and writing of JSON formatted data.

### LLM Communication
- `llm_communication.py`: Handles communication with OpenAI's LLMs, sending prompts and processing responses.

### Data Generation
- `generate_data.py`: Orchestrates the generation of training data by integrating file handling, LLM communication, and data formatting.


### Usage

The project includes a `Makefile` for easy management of common tasks. Here are the main commands you can use:

- `make help`: Displays help for each make command.
- `make local-start-infra`: Build and start mongodb, mq and qdrant.
- `make local-start-cdc`: Start cdc system
- `make insert-data-mongo`: Insert data to mongodb
- `make local-bytewax`: Run bytewax pipeline and send data to Qdrant
- `make generate-dataset`: Generate dataset for finetuning and version it in CometML



# Meet your teachers!
The course is created under the [Decoding ML](https://decodingml.substack.com/) umbrella by:

<table>
<tr>
<td><a href="https://github.com/iusztinpaul" target="_blank"><img src="https://github.com/iusztinpaul.png" width="100" style="border-radius:50%;"/></a></td>
<td>
<strong>Paul Iusztin</strong><br />
<i>Senior ML & MLOps Engineer</i>
</td>
</tr>
<tr>
<td><a href="https://github.com/alexandruvesa" target="_blank"><img src="https://github.com/alexandruvesa.png" width="100" style="border-radius:50%;"/></a></td>
<td>
<strong>Alexandru Vesa</strong><br />
<i>Senior AI Engineer</i>
</td>
</tr>
<tr>
<td><a href="https://github.com/Joywalker" target="_blank"><img src="https://github.com/Joywalker.png" width="100" style="border-radius:50%;"/></a></td>
<td>
<strong>Răzvanț Alexandru</strong><br />
<i>Senior ML Engineer</i>
</td>
</tr>
</table>

# License

This course is an open-source project released under the MIT license. Thus, as long you distribute our LICENSE and acknowledge our work, you can safely clone or fork this project and use it as a source of inspiration for whatever you want (e.g., university projects, college degree projects, personal projects, etc.).

# 🏆 Contribution

A big "Thank you 🙏" to all our contributors! This course is possible only because of their efforts.

<p align="center">
<a href="https://github.com/decodingml/llm-twin-course/graphs/contributors">
<img src="https://contrib.rocks/image?repo=decodingml/llm-twin-course" />
</a>
</p>
- `make local-start`: Build and start mongodb, mq and qdrant.
- `make local-test-github`: Insert data to mongodb
- `make local-test-retriever:`: Test RAG retrieval

0 comments on commit 6621975

Please sign in to comment.