-
Notifications
You must be signed in to change notification settings - Fork 248
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
The requests produced by the framework to a vllm server are too long for it to process and vllm return an exception. The request processed in this line: truncate_prompt_tokens = getattr(request, "truncate_prompt_tokens", None) doesn't contain truncate_prompt_tokens attribute when it should have. The vllm fails with:
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257] File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker_async.VllmAsyncGenerationWorker/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_engine.py", line 1011, in _tokenize_prompt_inputs_async
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257] yield await self._normalize_prompt_text_to_input(
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257] File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker_async.VllmAsyncGenerationWorker/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_engine.py", line 881, in _normalize_prompt_text_to_input
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257] return self._validate_input(request, input_ids, input_text)
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257] File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker_async.VllmAsyncGenerationWorker/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_engine.py", line 962, in _validate_input
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257] raise ValueError(
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257] ValueError: This model's maximum context length is 8192 tokens. However, your request has 8939 input tokens. Please reduce the length of the input messages.
(VllmAsyncGenerationWorker pid=4180772) INFO: 10.65.27.217:55482 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
Steps/Code to reproduce bug
Follow steps from the setup guide and the single-node guide.
Expected behavior
The client should be aware of limitations of the vllm server and send the appropriate truncation instruction in it's requests.
Additional context
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working