Skip to content

Serving Llama model via FastAPI for AI-driven responses

Notifications You must be signed in to change notification settings

hmtcelik/fast-llama

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Fastapi - Llama Service Implementation

Using FastAPI to create a llama service that can be use anywhere to talk with model.

(the docker usage will be here soon)

Installation

pip install -r requirements.txt

Usage:

cd app
python main.py

When you run first time, the init_modal function will download Llama model from huggingface so it will take some time to download the model.

Example for asking simple question:

curl -X POST \
  'http://localhost:8000/question' \
  -H 'Content-Type: application/json' \
  -d '{"q": "What is the capital of France?"}'

Response:

{ "answer": " The capital of France is Paris." }

Using Chat Completion (with message history)

There is two role for messages, user and system. The user role is for the messages that user sends and the system role is for the messages that llama model sends.

If you send them in order, the model will understand the context and give you better answers.

Example Usage:

curl -X POST \
  'http://localhost:8000/chat' \
  -H 'Content-Type: application/json' \
  -d '{
	"messages": [
      {
        "role": "user",
        "content": "What is capital of Turkey ?"
      },
      {
        "role": "system",
        "content": "The capital of Turkey is Ankara"
      },
      {
        "role": "user",
        "content": "How about Spain ?"
      }
    ]
  }'

Response:

{
  "answer": "The capital of Spain is Madrid."
}

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

About

Serving Llama model via FastAPI for AI-driven responses

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages