[Gym] The context truncation is not handled gracefully in the example from the tutorial

**Describe the bug**
The requests produced by the framework to a vllm server are too long for it to process and vllm return an exception. The request processed in this line: [truncate_prompt_tokens = getattr(request, "truncate_prompt_tokens", None)](https://github.com/vllm-project/vllm/blob/v0.11.2/vllm/entrypoints/openai/serving_engine.py#L856) doesn't contain truncate_prompt_tokens attribute when it should have. The vllm fails with:
```
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257]   File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker_async.VllmAsyncGenerationWorker/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_engine.py", line 1011, in _tokenize_prompt_inputs_async
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257]     yield await self._normalize_prompt_text_to_input(                                              
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                              
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257]   File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker_async.VllmAsyncGenerationWorker/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_engine.py", line 881, in _normalize_prompt_text_to_input
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257]     return self._validate_input(request, input_ids, input_text)
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257]   File "/opt/ray_venvs/nemo_rl.models.generation.vllm.vllm_worker_async.VllmAsyncGenerationWorker/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_engine.py", line 962, in _validate_input
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257]     raise ValueError(                                                                                                                                           
(VllmAsyncGenerationWorker pid=4180772) ERROR 02-12 08:14:29 [serving_chat.py:257] ValueError: This model's maximum context length is 8192 tokens. However, your request has 8939 input tokens. Please reduce the length of the input messages.
(VllmAsyncGenerationWorker pid=4180772) INFO:     10.65.27.217:55482 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request 
```


**Steps/Code to reproduce bug**

Follow steps from [the setup guide](https://docs.nvidia.com/nemo/gym/latest/tutorials/nemo-rl-grpo/setup.html) and [the single-node guide](https://docs.nvidia.com/nemo/gym/latest/tutorials/nemo-rl-grpo/single-node-training.html).

**Expected behavior**

The client should be aware of limitations of the vllm server and send the appropriate truncation instruction in it's requests.

**Additional context**

-

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gym] The context truncation is not handled gracefully in the example from the tutorial #1935

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Gym] The context truncation is not handled gracefully in the example from the tutorial #1935

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions