Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle response body parsing for both streaming and non-streaming cases #178

Open
liu-cong opened this issue Jan 9, 2025 · 2 comments
Open

Comments

@liu-cong
Copy link
Contributor

liu-cong commented Jan 9, 2025

Usage stats collection

vLLM supports stream mode, with "stream":True set in the request. The way it works today is that it will send one output token per stream chunk, in a json format like so {"id":"cmpl-02acf58969a747e3ae312f53f38069e6","created":1734721204,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":"\n","logprobs":null,"finish_reason":null,"stop_reason":null}]}.

To enable usage stats, we should pass the "stream_options": {"include_usage": True} parameter. The usage stats will be populated for the last chunk, and is null for others.

To report request and per output token latency metrics, we need to know the end timestamp of a streaming response, and the completion token count. In vLLM, when streaming is enabled, the last data chunk is a special string [DONE], while the second last chunk has the non-nil usage stats. We can use this to determine the end of the stream.

Open question

vllm only returns usage stats in stream mode if "stream_options": {"include_usage": True} is set in the request. Should we inject this if metric collection is enabled?

Error handling

Errors in streaming need to be carefully handled, specifically, the EPP should correctly capture the following error types, especially for correct metric reporting purpose:

  • Network Errors: Connection issues, timeouts, and other network problems can disrupt the stream.
  • Model Server Errors: The server might encounter issues processing the request or generating the stream. This can be handled by looking at the normal HTTP error codes.
  • Client Errors: Problems on the client-side, such as decoding errors or timeouts.
  • Content Errors: Issues with the streamed content itself, like corruption or unexpected formats.

Appendix

I used the following code snippet to stream the response and print the chunks:

import requests
import time

def non_stream():
    json={
      "model": "meta-llama/Llama-2-7b-hf", 
      "max_tokens": 100,
      "prompt": prompt,
      "temperature": 0,
      "stream": False,
      "stream_options": {"include_usage": True},
    }
    response = requests.post(api_url, json=json, stream=False)
    response.raise_for_status()
    print(response.text)
    
def stream_vllm_response(prompt, api_url="http://localhost:8000/generate"):
  """Streams the response from a vLLM server.

  Args:
    prompt: The prompt to send to the server.
    api_url: The URL of the vLLM server.

  Yields:
    Chunks of the generated text.
  """
  json={
      "model": "meta-llama/Llama-2-7b-hf", 
      "max_tokens": 5,
      "prompt": prompt,
     "temperature": 0,
      "stream": True
  }
  response = requests.post(api_url, json=json, stream=True)
  response.raise_for_status()
  
  print("Initial HTTP Headers:")
  for header, value in response.headers.items():
    print(f"{header}: {value}")

  for chunk in response.iter_lines():
    if chunk:
      decoded_chunk = chunk.decode("utf-8")
      yield decoded_chunk

# Example usage:
# api_url = "http://localhost:8000/v1/completions"  # Replace with your vLLM server URL
api_url = "http://35.239.44.127:8081/v1/completions"
prompt = "Tell me the history of the US"
start = time.time()

print(non_stream())

print("===Streaming")
for chunk in stream_vllm_response(prompt, api_url):
  print(chunk, end="", flush=True)
  print(f"Elapsed {time.time() - start} seconds \n")

Example output of the code snippet:

python3 stream.py   
/Users/conliu/projects/.venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
{"id":"cmpl-356791d989ac476797a39076f866da1a","object":"text_completion","created":1734721203,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":".\nThe United States of America is a country in North America. It is the third largest country in the world. It is the fourth most populous country in the world. It is the most powerful country in the world. It is the most prosperous country in the world. It is the most technologically advanced country in the world. It is the most influential country in the world. It is the most democratic country in the world. It is the most generous country","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":8,"total_tokens":108,"completion_tokens":100,"active_lora_adapters":{},"registered_lora_adapters":{},"pending_queue_size":0}}
None
===Streaming
Initial HTTP Headers:
date: Fri, 03 Jan 2025 18:10:23 GMT
server: uvicorn
content-type: text/event-stream; charset=utf-8
x-went-into-resp-headers: true
transfer-encoding: chunked
data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":".","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}Elapsed 0.24161005020141602 seconds 

data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":"\n","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}Elapsed 0.2615821361541748 seconds 

data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":"The","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}Elapsed 0.27447509765625 seconds 

data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":" United","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}Elapsed 0.30302906036376953 seconds 

data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":" States","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":null}Elapsed 0.3072359561920166 seconds 

data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[],"usage":{"prompt_tokens":8,"total_tokens":13,"completion_tokens":5}}Elapsed 0.3072531223297119 seconds 

data: [DONE]Elapsed 0.30726003646850586 seconds 
@liu-cong
Copy link
Contributor Author

/assign @courageJ

@k8s-ci-robot
Copy link
Contributor

@liu-cong: GitHub didn't allow me to assign the following users: courageJ.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @courageJ

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants