Running the application

You can run the code in many ways, but it is dependent on AWS for dynamodb access & also for security manager(OpenAI API key).

Prerequisites

I did the infrastructure part "somewhat manually", so it will take some manual work. Apologies :')

Setup dynamodb

First Install poetry. I used python 3.10, installed through pyenv (It's great). Then run

poetry install
poetry shell
python scripts/dynamodb_setup.py

For this part I didn't create a docker environment.

OpenAI key

I decided to go with OpenAI. I will discuss this choice later, and come with alternatives. There is much to discuss :) The key you need to set is the following:

openai_api_key

Running the application

Docker:

Build the image.
Run the image. But since it is dependent on AWS, you will need to insert AWS credentials. You will need access to secrets manager, and dynamodb. You can set this up with IAM if needed.

Kubernetes:

In this repo there is a yaml file that can be used to deploy to Kubernetes(llm_api.yaml). There is no real reason for using Kubernetes here though, I did it more for myself (practice). I used EKS for the cluster. You need to make sure that the nodegroup you select(IAM) has access to dynamodb, and secrets manager. I did it manually myself through IAM in AWS Console. Starting an EKS cluster takes around 25-35 minutes. The rest is more or less instant. To create an EKS:

Create an eks cluster. This will also create other necessary infrastructure parts(VPC, IAM, ..): eksctl create cluster --name "INSERT_CLUSTER_NAME" --region INSERT_REGION"
aws eks update-kubeconfig --region "INSERT_REGION" --name "INSERT_CLUSTER_NAME"
Make sure you have ALB running on your cluster. If it is not properly installed the ingress will not work. Read the prerequisits: https://docs.aws.amazon.com/eks/latest/userguide/alb-ingress.html

You now have the cluster setup. Now you need to add IAM policies to the nodegroup.

Find the IAM for your nodegroup. Add read policy for dynamodb, and secrets manager access.
You are good to go! kubectl apply -f llm-api.yaml
Run the following command to get the address: kubectl get ingress -n llm-api
Go to your browser and input the address and add /docs to the ending :)

Troubleshooting:

If the address field is empty for the ingress: If there is no address showing for the ingress, this happened to me by the way, then it is likely that the alb is not running properly. The command kubectl get pods -n kube-system should show the alb.
Installing the alb also meant installing IAM OIDC on the EKS cluster. It will come up during the installation, and the command for installation will show.

aws ecr get-login-password --region REGION | docker login --username AWS --password-stdin LINK_TO_YOUR_ECR
docker build -t lw_task .
docker tag lw_task:latest LINK_TO_YOUR_ECR/lw_task:latest
docker push LINK_TO_YOUR_ECR/lw_task:latest

Create a high-level conceptual & technical design of the solution and the tools you will utilise.

A part of this problem is figuring out a good use case. The use case I came up with MVP was a company wanting to do advertising, and the information provided as context to the LLM to be the brand image together with the products and their specifications. For synthetic data generation, I used GPT-4o and guided it until I was happy. This data can be found in shoby_brand_info.txt

The components I used:

Promptfoo for model evaluation.
FastAPI for the rest API.
To build fast, I went with API Access for the LLM and Open AI(I suspect this breaks the data locality)
For the database, I decided to go for dynamoDB.
Deployment: the API is dockerized, and you can serve it with Kubernetes if you want :)
Poetry for the package management

LLM evaluation

For the evaluation, I used the Promptfoo library, a javascript library. It supports many kinds of evaluation and also contains a dashboard to view the results :) Being able to compare and view outputs from different models, or prompts next to each other makes it easier to either choose the model or improve the prompts.

Information on how to install Promptfoo can be found here: https://github.com/promptfoo/promptfoo To run Promptfoo you need to set the following env variables:

OPENAI_API_KEY=YOUR_KEY
PROMPTFOO_PROMPT_SEPARATOR=ANYTHING_RANDOM_AS_LKJHIUGHUIG

The PROMPTFOO_PROMPT_SEPARATOR needs to be modified because of the evaluation prompt I am using.

After setting the environment variables we can run:

promptfoo eval --no-cache
promptfoo view # Will open the dashboard

I kept the evaluation simple, and you will only find OpenAI models (Please refer to FAQ at the end). I like the idea of having powerful LLM models judge the output, and I believe it would be possible to take this much further than what I did. Here it would have been interesting to compare with models such as Llama 3.1 of various sizes( I had some issues with AWS Bedrock, but could have used either Modal, see later, or ollama).

I ran a manual evaluation using LLama 3 8B, and LLama 3 70B. 70B was good, and 8B had a performance similar to GPT-3.5(not good enough for the use case I created).

I also believe that incorporating a human-generated dataset could be interesting:

Could be used to improve the ML judge (I suspect)
Can use similarity-based metrics. Compare embeddings of ground truth, to the embeddings of the generated text.
BLEU scores and more... https://shubham-shinde.github.io/blogs/llms-metrics/#automatic-vs-human-evaluation

Given that a human-generated dataset is created, we could also finetune the models. Finetuning models can allow us to use a smaller cheaper model, with the performance of a bigger model. Often examples are given to LLMs and this increases the number of tokens. Finetuning could remove that need :)

How would you proceed to scale your model (to support for new parameters and languages)

A strong evaluation is important, and the automatic(LLM Judge) can be reused directly. Languages we might not understand ourselves, so maybe we can translate using the most powerful model, and have a translator to check that quality is good. For Western languages multilingual capabilities tend to be strong, but beware of Asian languages, or other languages that will not have an overlap in the tokenization to Western languages. Especially if you finetune, since the model can easily overfit to language in these cases.

Can you provide a strategic roadmap with ideas/techniques to uplift the performance of the overall system.

From a model perspective:

More prompt engineering, and a much more thorough evaluation. I would also use human-generated examples. Improved models are released on a weekly level, and it should be easy to evaluate new models. Make more examples of different domains or verticals.

Guardrails: Make sure the model is not used in the "wrong" ways. To check for prompt injections, and possibly content we do not want the models to generate.
Finetuning of the LLM, or providing examples.
RAG-based solution if we have a lot of information on a customer.

ML Ops:

Mainly cost-related. I go through this part thoroughly later.

What you would do if you were to optimize training and inference speed.

Training

There is no training here, but let's put a few words on it anyway.

Finetuning for LLMs is mainly done with QLoRA or LoRA. The LoRA part approximates larger matrixes with two smaller ones, and the Q part quantizes these to lower precision, reducing memory. It's old concepts put into LLMs. The fastest is LoRA, memory reduction techniques, and speedups do not always come hand in hand. What I'm reading is that LoRA is slightly faster than a full finetuning

Another obvious candidate is just increasing the flops. Get a more powerful GPU or multiple GPUs.

Inference

I will divide speed into latency & throughput

Latency

What impacts latency:

Size of model.
Some optimizations on the model(graph optimization, maybe more).
Quantization can also be reduced depending if the serving framework supports it. E.g vLLM does not provide a speedup for AWQ while TGI does.
Batch size, more on this later.
GPU

Throughput

This kind of blew up, but more information is usually good :)

Self-hosting

Serving LLMs is not cheap, because the GPUs required are expensive.

The following blog post is very informative, and the vLLM paper is a recommended read. Especially the graph showing batch size RAM and throughput tells the story. https://blog.vllm.ai/2023/06/20/vllm.html

Batch size does increase throughput massively, but with it, latency also increases. I ran some experiments myself on the 3B model with vLLM(A100 40GB). 1 query in 3 seconds, 60 queries(2k tokens/sec generation) in 8 seconds.

A competitor to vLLM is TGI. In my experience, TGI is easier to use and supports more features such as AWQ, but it is not as fast as vLLM.

Other frameworks include llama.cpp, ollama. However, these do not support continuous batching(significant downside). Continued batching is similar to normal ML batching techniques but in a real-time setting. As requests come in they are directly computed, even though the LLM is already computing other requests. In more "traditional" ML the new request would be blocked until the request being computed would finish. This increases the maximum throughput considerably.

Interesting! But how do I choose my setup for self-hosting?

This would be my recommendation:

I'd pick vLLM as my inference engine (if you at least have A100 40GB) available. If you don't have this GPU available you might need to pick TGI and use a quantized model(AWQ). Then choose A10G 25GB. AWS will not easily provide you with good GPUs :') If this is you, then look in the Modal section, and be set free from the claws of AWS.
Select the number of input tokens, number of output tokens, and max latency. (10 seconds in your example)
Time to tune. Select GPU and see how many concurrent requests you can handle while being below the latency requirements. Obviously, cost will be a consideration.

API Access

Multiple variants exist. Obviously, the region will be a limiting factor. Some thoughts & providers:

Together.ai provides many open-source models.
With Huggingface endpoints you can deploy directly to your account, choose a model from the hub, and also the GPU(they even have A100 GPUs for AWS).
Amazon Bedrock has many models, but they only support a few LLMs. You can pay per token, or pay for units if you have a higher throughput.
API Access is the easiest way to ensure a low latency. This is a great website for benchmarking: https://unify.ai/benchmarks

Choosing between self-hosting and API Access

It's a question of throughput, and maybe also security concerns. For offline work where the throughput will be high and latency is less of a concern, self-hosting will be hard to beat on cost. For real-time with low latency, you have to make a good case for self-hosting. There is a cost to deal with the MLOps (or LLMOps), and you need a good amount of throughput to deliver generation at a lower price than API Access. Your GPUs need to work for their cost.

Auto-scaling

API Access will always scale, but if you opt for self-hosting then you need to consider a few things:

LLMs are big. You can't download them every time you start a pod, then it could take minutes to start them. Two other options I see. Put the LLM in the docker image(modal style) or have a shared disk so the pods can load the model.
Loading parameters from the disk takes time, and you also need to start the inference server (engine). In my experience, this takes 20s - 1m 30s depending on the size of the model(2B - 70B). If you can't start your servers fast enough, maybe you should have API access as a backup to deal with the bursts. If you send a larger and larger number of requests to your LLM deployment, without scaling fast enough, first your latency will start to increase and then they could possibly run out of memory and crash(worst case). Hopefully, you will let requests wait, or timeout instead of crashing :)
Worth mentioning is that Ollama is using GGUF. You start llama servers in a few seconds, but I would not recommend this setup.

FAQ

What about GDPR and data locality?

Yes, it is true. OpenAI does not allow you to choose a region. However, I do provide multiple examples of self-hosting vs API access which does solve this.

What is Modal, and what is the modal_deployment.py in scripts folder?

Great question! They provide decently priced GPUs, lightning-fast starts, and a good dashboard. The file can be used to deploy Llama 3.1 8B on their platform. I considered using this to show self-hosting. I am a fan. Read more on their website. https://modal.com/

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
eval		eval
lw_task		lw_task
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
llm-api.yaml		llm-api.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
shoby_brand_info.txt		shoby_brand_info.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Running the application

Prerequisites

Setup dynamodb

OpenAI key

Running the application

Docker:

Kubernetes:

Create a high-level conceptual & technical design of the solution and the tools you will utilise.

LLM evaluation

How would you proceed to scale your model (to support for new parameters and languages)

Can you provide a strategic roadmap with ideas/techniques to uplift the performance of the overall system.

What you would do if you were to optimize training and inference speed.

Training

Inference

Latency

Throughput

Self-hosting

Interesting! But how do I choose my setup for self-hosting?

API Access

Choosing between self-hosting and API Access

Auto-scaling

FAQ

What about GDPR and data locality?

What is Modal, and what is the modal_deployment.py in scripts folder?

About

Releases

Packages

Languages

carlryn/lw-task

Folders and files

Latest commit

History

Repository files navigation

Running the application

Prerequisites

Setup dynamodb

OpenAI key

Running the application

Docker:

Kubernetes:

Create a high-level conceptual & technical design of the solution and the tools you will utilise.

LLM evaluation

How would you proceed to scale your model (to support for new parameters and languages)

Can you provide a strategic roadmap with ideas/techniques to uplift the performance of the overall system.

What you would do if you were to optimize training and inference speed.

Training

Inference

Latency

Throughput

Self-hosting

Interesting! But how do I choose my setup for self-hosting?

API Access

Choosing between self-hosting and API Access

Auto-scaling

FAQ

What about GDPR and data locality?

What is Modal, and what is the modal_deployment.py in scripts folder?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages