Skip to content

[Gym] The context truncation is not handled gracefully in the example from the tutorial #1935

@jbaczek

Description

@jbaczek

Describe the bug
The requests produced by the framework to a vllm server are too long for it to process and vllm return an exception. The request processed in this line: truncate_prompt_tokens = getattr(request, "truncate_prompt_tokens", None) doesn't contain truncate_prompt_tokens attribute when it should have. The vllm fails with:

(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257]   File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker_async.VllmAsyncGenerationWorker/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_engine.py", line 1011, in _tokenize_prompt_inputs_async
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257]     yield await self._normalize_prompt_text_to_input(                                              
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                              
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257]   File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker_async.VllmAsyncGenerationWorker/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_engine.py", line 881, in _normalize_prompt_text_to_input
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257]     return self._validate_input(request, input_ids, input_text)
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257]   File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker_async.VllmAsyncGenerationWorker/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_engine.py", line 962, in _validate_input
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257]     raise ValueError(                                                                                                                                           
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257] ValueError: This model's maximum context length is 8192 tokens. However, your request has 8939 input tokens. Please reduce the length of the input messages.
(VllmAsyncGenerationWorker pid=4180772) INFO:     10.65.27.217:55482 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request 

Steps/Code to reproduce bug

Follow steps from the setup guide and the single-node guide.

Expected behavior

The client should be aware of limitations of the vllm server and send the appropriate truncation instruction in it's requests.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions