Handle response body parsing for both streaming and non-streaming cases #178

liu-cong · 2025-01-09T17:47:10Z

Usage stats collection

vLLM supports stream mode, with "stream":True set in the request. The way it works today is that it will send one output token per stream chunk, in a json format like so {"id":"cmpl-02acf58969a747e3ae312f53f38069e6","created":1734721204,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":"\n","logprobs":null,"finish_reason":null,"stop_reason":null}]}.

To enable usage stats, we should pass the "stream_options": {"include_usage": True} parameter. The usage stats will be populated for the last chunk, and is null for others.

To report request and per output token latency metrics, we need to know the end timestamp of a streaming response, and the completion token count. In vLLM, when streaming is enabled, the last data chunk is a special string [DONE], while the second last chunk has the non-nil usage stats. We can use this to determine the end of the stream.

Open question

vllm only returns usage stats in stream mode if "stream_options": {"include_usage": True} is set in the request. Should we inject this if metric collection is enabled?

Error handling

Errors in streaming need to be carefully handled, specifically, the EPP should correctly capture the following error types, especially for correct metric reporting purpose:

Network Errors: Connection issues, timeouts, and other network problems can disrupt the stream.
Model Server Errors: The server might encounter issues processing the request or generating the stream. This can be handled by looking at the normal HTTP error codes.
Client Errors: Problems on the client-side, such as decoding errors or timeouts.
Content Errors: Issues with the streamed content itself, like corruption or unexpected formats.

Appendix

I used the following code snippet to stream the response and print the chunks:

import requests
import time

def non_stream():
    json={
      "model": "meta-llama/Llama-2-7b-hf", 
      "max_tokens": 100,
      "prompt": prompt,
      "temperature": 0,
      "stream": False,
      "stream_options": {"include_usage": True},
    }
    response = requests.post(api_url, json=json, stream=False)
    response.raise_for_status()
    print(response.text)
    
def stream_vllm_response(prompt, api_url="http://localhost:8000/generate"):
  """Streams the response from a vLLM server.

  Args:
    prompt: The prompt to send to the server.
    api_url: The URL of the vLLM server.

  Yields:
    Chunks of the generated text.
  """
  json={
      "model": "meta-llama/Llama-2-7b-hf", 
      "max_tokens": 5,
      "prompt": prompt,
     "temperature": 0,
      "stream": True
  }
  response = requests.post(api_url, json=json, stream=True)
  response.raise_for_status()
  
  print("Initial HTTP Headers:")
  for header, value in response.headers.items():
    print(f"{header}: {value}")

  for chunk in response.iter_lines():
    if chunk:
      decoded_chunk = chunk.decode("utf-8")
      yield decoded_chunk

# Example usage:
# api_url = "http://localhost:8000/v1/completions"  # Replace with your vLLM server URL
api_url = "http://35.239.44.127:8081/v1/completions"
prompt = "Tell me the history of the US"
start = time.time()

print(non_stream())

print("===Streaming")
for chunk in stream_vllm_response(prompt, api_url):
  print(chunk, end="", flush=True)
  print(f"Elapsed {time.time() - start} seconds \n")

Example output of the code snippet:

python3 stream.py   
/Users/conliu/projects/.venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
{"id":"cmpl-356791d989ac476797a39076f866da1a","object":"text_completion","created":1734721203,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":".\nThe United States of America is a country in North America. It is the third largest country in the world. It is the fourth most populous country in the world. It is the most powerful country in the world. It is the most prosperous country in the world. It is the most technologically advanced country in the world. It is the most influential country in the world. It is the most democratic country in the world. It is the most generous country","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":8,"total_tokens":108,"completion_tokens":100,"active_lora_adapters":{},"registered_lora_adapters":{},"pending_queue_size":0}}
None
===Streaming
Initial HTTP Headers:
date: Fri, 03 Jan 2025 18:10:23 GMT
server: uvicorn
content-type: text/event-stream; charset=utf-8
x-went-into-resp-headers: true
transfer-encoding: chunked
data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":".","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}Elapsed 0.24161005020141602 seconds 

data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":"\n","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}Elapsed 0.2615821361541748 seconds 

data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":"The","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}Elapsed 0.27447509765625 seconds 

data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":" United","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}Elapsed 0.30302906036376953 seconds 

data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":" States","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":null}Elapsed 0.3072359561920166 seconds 

data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[],"usage":{"prompt_tokens":8,"total_tokens":13,"completion_tokens":5}}Elapsed 0.3072531223297119 seconds 

data: [DONE]Elapsed 0.30726003646850586 seconds

The text was updated successfully, but these errors were encountered:

liu-cong · 2025-01-16T17:53:21Z

/assign @courageJ

k8s-ci-robot · 2025-01-16T17:53:23Z

@liu-cong: GitHub didn't allow me to assign the following users: courageJ.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @courageJ

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

liu-cong mentioned this issue Jan 16, 2025

Adding metrics for request total, latency and size #177

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle response body parsing for both streaming and non-streaming cases #178

Handle response body parsing for both streaming and non-streaming cases #178

liu-cong commented Jan 9, 2025 •

edited

Loading

liu-cong commented Jan 16, 2025

k8s-ci-robot commented Jan 16, 2025

Handle response body parsing for both streaming and non-streaming cases #178

Handle response body parsing for both streaming and non-streaming cases #178

Comments

liu-cong commented Jan 9, 2025 • edited Loading

Usage stats collection

Open question

Error handling

Appendix

liu-cong commented Jan 16, 2025

k8s-ci-robot commented Jan 16, 2025

liu-cong commented Jan 9, 2025 •

edited

Loading