You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
vLLM supports stream mode, with "stream":True set in the request. The way it works today is that it will send one output token per stream chunk, in a json format like so {"id":"cmpl-02acf58969a747e3ae312f53f38069e6","created":1734721204,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":"\n","logprobs":null,"finish_reason":null,"stop_reason":null}]}.
To enable usage stats, we should pass the "stream_options": {"include_usage": True} parameter. The usage stats will be populated for the last chunk, and is null for others.
To report request and per output token latency metrics, we need to know the end timestamp of a streaming response, and the completion token count. In vLLM, when streaming is enabled, the last data chunk is a special string [DONE], while the second last chunk has the non-nil usage stats. We can use this to determine the end of the stream.
Open question
vllm only returns usage stats in stream mode if "stream_options": {"include_usage": True} is set in the request. Should we inject this if metric collection is enabled?
Error handling
Errors in streaming need to be carefully handled, specifically, the EPP should correctly capture the following error types, especially for correct metric reporting purpose:
Network Errors: Connection issues, timeouts, and other network problems can disrupt the stream.
Model Server Errors: The server might encounter issues processing the request or generating the stream. This can be handled by looking at the normal HTTP error codes.
Client Errors: Problems on the client-side, such as decoding errors or timeouts.
Content Errors: Issues with the streamed content itself, like corruption or unexpected formats.
Appendix
I used the following code snippet to stream the response and print the chunks:
import requests
import time
def non_stream():
json={
"model": "meta-llama/Llama-2-7b-hf",
"max_tokens": 100,
"prompt": prompt,
"temperature": 0,
"stream": False,
"stream_options": {"include_usage": True},
}
response = requests.post(api_url, json=json, stream=False)
response.raise_for_status()
print(response.text)
def stream_vllm_response(prompt, api_url="http://localhost:8000/generate"):
"""Streams the response from a vLLM server.
Args:
prompt: The prompt to send to the server.
api_url: The URL of the vLLM server.
Yields:
Chunks of the generated text.
"""
json={
"model": "meta-llama/Llama-2-7b-hf",
"max_tokens": 5,
"prompt": prompt,
"temperature": 0,
"stream": True
}
response = requests.post(api_url, json=json, stream=True)
response.raise_for_status()
print("Initial HTTP Headers:")
for header, value in response.headers.items():
print(f"{header}: {value}")
for chunk in response.iter_lines():
if chunk:
decoded_chunk = chunk.decode("utf-8")
yield decoded_chunk
# Example usage:
# api_url = "http://localhost:8000/v1/completions" # Replace with your vLLM server URL
api_url = "http://35.239.44.127:8081/v1/completions"
prompt = "Tell me the history of the US"
start = time.time()
print(non_stream())
print("===Streaming")
for chunk in stream_vllm_response(prompt, api_url):
print(chunk, end="", flush=True)
print(f"Elapsed {time.time() - start} seconds \n")
Example output of the code snippet:
python3 stream.py
/Users/conliu/projects/.venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
warnings.warn(
{"id":"cmpl-356791d989ac476797a39076f866da1a","object":"text_completion","created":1734721203,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":".\nThe United States of America is a country in North America. It is the third largest country in the world. It is the fourth most populous country in the world. It is the most powerful country in the world. It is the most prosperous country in the world. It is the most technologically advanced country in the world. It is the most influential country in the world. It is the most democratic country in the world. It is the most generous country","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":8,"total_tokens":108,"completion_tokens":100,"active_lora_adapters":{},"registered_lora_adapters":{},"pending_queue_size":0}}
None
===Streaming
Initial HTTP Headers:
date: Fri, 03 Jan 2025 18:10:23 GMT
server: uvicorn
content-type: text/event-stream; charset=utf-8
x-went-into-resp-headers: true
transfer-encoding: chunked
data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":".","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}Elapsed 0.24161005020141602 seconds
data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":"\n","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}Elapsed 0.2615821361541748 seconds
data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":"The","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}Elapsed 0.27447509765625 seconds
data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":" United","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}Elapsed 0.30302906036376953 seconds
data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":" States","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":null}Elapsed 0.3072359561920166 seconds
data: {"id":"cmpl-36294ed39f844e5e951bb0bccad780a1","created":1735927823,"model":"meta-llama/Llama-2-7b-hf","choices":[],"usage":{"prompt_tokens":8,"total_tokens":13,"completion_tokens":5}}Elapsed 0.3072531223297119 seconds
data: [DONE]Elapsed 0.30726003646850586 seconds
The text was updated successfully, but these errors were encountered:
@liu-cong: GitHub didn't allow me to assign the following users: courageJ.
Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Usage stats collection
vLLM supports stream mode, with
"stream":True
set in the request. The way it works today is that it will send one output token per stream chunk, in a json format like so{"id":"cmpl-02acf58969a747e3ae312f53f38069e6","created":1734721204,"model":"meta-llama/Llama-2-7b-hf","choices":[{"index":0,"text":"\n","logprobs":null,"finish_reason":null,"stop_reason":null}]}
.To enable usage stats, we should pass the
"stream_options": {"include_usage": True}
parameter. The usage stats will be populated for the last chunk, and isnull
for others.To report request and per output token latency metrics, we need to know the end timestamp of a streaming response, and the completion token count. In vLLM, when streaming is enabled, the last data chunk is a special string
[DONE]
, while the second last chunk has the non-nil usage stats. We can use this to determine the end of the stream.Open question
vllm only returns usage stats in stream mode if
"stream_options": {"include_usage": True}
is set in the request. Should we inject this if metric collection is enabled?Error handling
Errors in streaming need to be carefully handled, specifically, the EPP should correctly capture the following error types, especially for correct metric reporting purpose:
Appendix
I used the following code snippet to stream the response and print the chunks:
Example output of the code snippet:
The text was updated successfully, but these errors were encountered: