Description
System Info
Runtime environment:
- Kubernetes Cluster deployment
- 4 A100 GPU with 80GB VRAM each
- 12 CPU with 32 GB RAM each
- TGI Version: 3.0.1 (have also tried with 3.0.0, 3.0.1 and 3.1.0 with the same outcome).
TGI ENV config:
All default values except the following:
extraInferenceEnvs:
MAX_BATCH_PREFILL_TOKENS: "4096"
PREFILL_CHUNKING: "1"
DTYPE: "bfloat16"
(have tried with float16
and bfloat16
with same outcome)
NVIDIA-SMI Output:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:17:00.0 Off | 0 |
| N/A 45C P0 91W / 300W | 78489MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe Off | 00000000:31:00.0 Off | 0 |
| N/A 46C P0 90W / 300W | 78489MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100 80GB PCIe Off | 00000000:B1:00.0 Off | 0 |
| N/A 46C P0 92W / 300W | 78489MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100 80GB PCIe Off | 00000000:CA:00.0 Off | 0 |
| N/A 44C P0 85W / 300W | 78489MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 18577 C /opt/conda/bin/python 78480MiB |
| 1 N/A N/A 18576 C /opt/conda/bin/python 78480MiB |
| 2 N/A N/A 18578 C /opt/conda/bin/python 78480MiB |
| 3 N/A N/A 18579 C /opt/conda/bin/python 78480MiB |
+-----------------------------------------------------------------------------------------+
Text-Generation-Launcher --env output:
{"timestamp":"2025-02-19T10:11:43.478948Z","level":"INFO","fields":{"message":"Args {\n model_id: \"/model_data/llama3-3-70b\",\n revision: None,\n validation_workers: 4,\n sharded: None,\n num_shard: None,\n quantize: None,\n speculate: None,\n dtype: Some(\n BFloat16,\n ),\n kv_cache_dtype: None,\n trust_remote_code: false,\n max_concurrent_requests: 128,\n max_best_of: 2,\n max_stop_sequences: 4,\n max_top_n_tokens: 5,\n max_input_tokens: None,\n max_input_length: None,\n max_total_tokens: None,\n waiting_served_ratio: 0.3,\n max_batch_prefill_tokens: Some(\n 4096,\n ),\n max_batch_total_tokens: None,\n max_waiting_tokens: 20,\n max_batch_size: None,\n cuda_graphs: None,\n hostname: \"llama3-3-70b-deploy-inference-6c985b6745-s4fns\",\n port: 80,\n shard_uds_path: \"/tmp/text-generation-server\",\n master_addr: \"localhost\",\n master_port: 29500,\n huggingface_hub_cache: None,\n weights_cache_override: None,\n disable_custom_kernels: false,\n cuda_memory_fraction: 1.0,\n rope_scaling: None,\n rope_factor: None,\n json_output: true,\n otlp_endpoint: None,\n otlp_service_name: \"text-generation-inference.router\",\n cors_allow_origin: [],\n api_key: None,\n watermark_gamma: None,\n watermark_delta: None,\n ngrok: false,\n ngrok_authtoken: None,\n ngrok_edge: None,\n tokenizer_config_path: None,\n disable_grammar_support: false,\n env: true,\n max_client_batch_size: 4,\n lora_adapters: None,\n usage_stats: On,\n payload_limit: 2000000,\n enable_prefill_logprobs: false,\n}"},"target":"text_generation_launcher"}
{"timestamp":"2025-02-19T10:11:44.898929Z","level":"INFO","fields":{"message":"Using attention flashinfer - Prefix caching true"},"target":"text_generation_launcher"}
{"timestamp":"2025-02-19T10:11:44.898961Z","level":"INFO","fields":{"message":"Sharding model on 4 processes"},"target":"text_generation_launcher"}
{"timestamp":"2025-02-19T10:11:44.898966Z","level":"INFO","fields":{"message":"Using default cuda graphs [1, 2, 4, 8, 16, 32]"},"target":"text_generation_launcher"}
{"timestamp":"2025-02-19T10:11:44.899151Z","level":"INFO","fields":{"message":"Starting check and download process for /model_data/llama3-3-70b"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2025-02-19T10:11:50.740956Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download."},"target":"text_generation_launcher"}
{"timestamp":"2025-02-19T10:11:51.641141Z","level":"INFO","fields":{"message":"Successfully downloaded weights for /model_data/llama3-3-70b"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2025-02-19T10:11:51.641428Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2025-02-19T10:11:51.641431Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
{"timestamp":"2025-02-19T10:11:51.641484Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"rank":2,"name":"shard-manager"}]}
{"timestamp":"2025-02-19T10:11:51.641493Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"rank":3,"name":"shard-manager"}]}
Model /info
output:
{
"model_id": "/model_data/llama3-3-70b",
"model_sha": null,
"model_pipeline_tag": null,
"max_concurrent_requests": 128,
"max_best_of": 2,
"max_stop_sequences": 4,
"max_input_tokens": 131071,
"max_total_tokens": 131072,
"validation_workers": 4,
"max_client_batch_size": 4,
"router": "text-generation-router",
"version": "3.1.0",
"sha": "463228ebfc444f60fa351da34a2ba158af0fe9d8",
"docker_label": "sha-463228e"
}
Model: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
We are running some ETLs that send requests to the generate/
TGI model endpoint with some text information for the model to summarize (redacted for privacy purposes). These requests are often big in terms of n. of input tokens. Furthermore, we are running these requests in concurrent threads. We are observing that for some cases the model returns repeated weird outputs, and in some cases, it makes the model subsequent responses weird as well, even for simple queries. A restart of the TGI container solves this temporarily, until after the ETL is run again, which seems to "corrupt it" over time.
Some detailed info on the requests we are running:
- Request params:
temperature=0.1, max_tokens=500
(rest not set, i.e. default) - 27 requests
- 15 concurrent threads
- Input tokens count:
[37051,591,593,522,490,840,458,3700,4227,380,3144,4404,1949,3812,2606,1878,2132,1374,1241,397,1364,864,1323,782,956,722,686]
- Output tokens count in case of success:
[475,329,322,366,319,346,395,416,506,290,537,483,531,533,511,456,499,481,398,298,350,367,313,365,345,379,345]
- Example output in case of failure:
was451 company of company and2 and and not was is and and of the was is isiHHHHH andH andi and andH\\'t and and and and and and and and isi andH and H2 HiH of the the\\'t not it HH H andH the the is isH H is not the is isH H andHHHH andH andH and and and \" H and H2 the not onlyH is and and\\'t\\'t not is isH and and and and and is and and and and and and and and and andH is was\\' it is a H and and and and and and and and not not is not is was is is is the is not is and and and and and and and and and and and and and and and and and and and and and is and and and and is is was is and and and and and had had and and and and and and\\'t\\' not is was is and is not not\\' is is was not is is is was is is is and is and and and and and and and not\\'t is is and and and not not need have have and and and and and need need and is is is is not is is is is not was is is and and and and is not is is and and and and and is is and and and and and and and need is is is and and and and it is not not not not is not is is is is is is is is a and and and and is is was is and and and and and is is and and and and and and and is is and and and is is is is is is is is is is and and and and and and and is a and and and and\\'t\\'t\\'t is is and and and and is is is is and is and is is not and and and and and is a is is is is is the is is and and and and and and is is is a and and and is is is a is a and and and is a is and is is a is a is not is a is not is is and and and and and and is and and and and and is not is not is and and and is for is a and is and and and and is is is and is a is is and and and and and is is and and and and and and and and is and and and is the it\\'t it is is is is of is is is is is is is and is and is and is is is is is a and and and and is is the is and is and is is is is is is is is for is is is a is a and is is is is a is is is a is a is and and and and is a and and and and and and and is not is a H and and is not is a is a is a is a is a and and and and and and and is is is is a is is is is and is is is is and H and and and is not is is a and is and is a is is is and is/ is and and and and is is is is is is and and and and is is is is and and and\\'t is is is is is is is is is is is is is a is a is and is and is and and is is and is and and is\\'t is is is is is is is a is creating is is is and is and is not is a is
Expected behavior
Model outputs are the same (correct, non-gibberish) in all load scenarios.
Other issues that may be related
I believe this issue could be related and similar scenario may apply: #2871