Skip to content

Llama 3.3 70B Weird , gibberish outputs in production setup #3043

Closed
@andresC98

Description

@andresC98

System Info

Runtime environment:

  • Kubernetes Cluster deployment
  • 4 A100 GPU with 80GB VRAM each
  • 12 CPU with 32 GB RAM each
  • TGI Version: 3.0.1 (have also tried with 3.0.0, 3.0.1 and 3.1.0 with the same outcome).

TGI ENV config:

All default values except the following:

extraInferenceEnvs:
  MAX_BATCH_PREFILL_TOKENS: "4096"
  PREFILL_CHUNKING: "1"
  DTYPE: "bfloat16"

(have tried with float16 and bfloat16 with same outcome)

NVIDIA-SMI Output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000000:17:00.0 Off |                    0 |
| N/A   45C    P0             91W /  300W |   78489MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off |   00000000:31:00.0 Off |                    0 |
| N/A   46C    P0             90W /  300W |   78489MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe          Off |   00000000:B1:00.0 Off |                    0 |
| N/A   46C    P0             92W /  300W |   78489MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100 80GB PCIe          Off |   00000000:CA:00.0 Off |                    0 |
| N/A   44C    P0             85W /  300W |   78489MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     18577      C   /opt/conda/bin/python                       78480MiB |
|    1   N/A  N/A     18576      C   /opt/conda/bin/python                       78480MiB |
|    2   N/A  N/A     18578      C   /opt/conda/bin/python                       78480MiB |
|    3   N/A  N/A     18579      C   /opt/conda/bin/python                       78480MiB |
+-----------------------------------------------------------------------------------------+

Text-Generation-Launcher --env output:

{"timestamp":"2025-02-19T10:11:43.478948Z","level":"INFO","fields":{"message":"Args {\n    model_id: \"/model_data/llama3-3-70b\",\n    revision: None,\n    validation_workers: 4,\n    sharded: None,\n    num_shard: None,\n    quantize: None,\n    speculate: None,\n    dtype: Some(\n        BFloat16,\n    ),\n    kv_cache_dtype: None,\n    trust_remote_code: false,\n    max_concurrent_requests: 128,\n    max_best_of: 2,\n    max_stop_sequences: 4,\n    max_top_n_tokens: 5,\n    max_input_tokens: None,\n    max_input_length: None,\n    max_total_tokens: None,\n    waiting_served_ratio: 0.3,\n    max_batch_prefill_tokens: Some(\n        4096,\n    ),\n    max_batch_total_tokens: None,\n    max_waiting_tokens: 20,\n    max_batch_size: None,\n    cuda_graphs: None,\n    hostname: \"llama3-3-70b-deploy-inference-6c985b6745-s4fns\",\n    port: 80,\n    shard_uds_path: \"/tmp/text-generation-server\",\n    master_addr: \"localhost\",\n    master_port: 29500,\n    huggingface_hub_cache: None,\n    weights_cache_override: None,\n    disable_custom_kernels: false,\n    cuda_memory_fraction: 1.0,\n    rope_scaling: None,\n    rope_factor: None,\n    json_output: true,\n    otlp_endpoint: None,\n    otlp_service_name: \"text-generation-inference.router\",\n    cors_allow_origin: [],\n    api_key: None,\n    watermark_gamma: None,\n    watermark_delta: None,\n    ngrok: false,\n    ngrok_authtoken: None,\n    ngrok_edge: None,\n    tokenizer_config_path: None,\n    disable_grammar_support: false,\n    env: true,\n    max_client_batch_size: 4,\n    lora_adapters: None,\n    usage_stats: On,\n    payload_limit: 2000000,\n    enable_prefill_logprobs: false,\n}"},"target":"text_generation_launcher"}
{"timestamp":"2025-02-19T10:11:44.898929Z","level":"INFO","fields":{"message":"Using attention flashinfer - Prefix caching true"},"target":"text_generation_launcher"}
{"timestamp":"2025-02-19T10:11:44.898961Z","level":"INFO","fields":{"message":"Sharding model on 4 processes"},"target":"text_generation_launcher"}
{"timestamp":"2025-02-19T10:11:44.898966Z","level":"INFO","fields":{"message":"Using default cuda graphs [1, 2, 4, 8, 16, 32]"},"target":"text_generation_launcher"}
{"timestamp":"2025-02-19T10:11:44.899151Z","level":"INFO","fields":{"message":"Starting check and download process for /model_data/llama3-3-70b"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2025-02-19T10:11:50.740956Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download."},"target":"text_generation_launcher"}
{"timestamp":"2025-02-19T10:11:51.641141Z","level":"INFO","fields":{"message":"Successfully downloaded weights for /model_data/llama3-3-70b"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2025-02-19T10:11:51.641428Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2025-02-19T10:11:51.641431Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
{"timestamp":"2025-02-19T10:11:51.641484Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"rank":2,"name":"shard-manager"}]}
{"timestamp":"2025-02-19T10:11:51.641493Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"rank":3,"name":"shard-manager"}]}

Model /info output:

{
  "model_id": "/model_data/llama3-3-70b",
  "model_sha": null,
  "model_pipeline_tag": null,
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_tokens": 131071,
  "max_total_tokens": 131072,
  "validation_workers": 4,
  "max_client_batch_size": 4,
  "router": "text-generation-router",
  "version": "3.1.0",
  "sha": "463228ebfc444f60fa351da34a2ba158af0fe9d8",
  "docker_label": "sha-463228e"
}

Model: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

We are running some ETLs that send requests to the generate/ TGI model endpoint with some text information for the model to summarize (redacted for privacy purposes). These requests are often big in terms of n. of input tokens. Furthermore, we are running these requests in concurrent threads. We are observing that for some cases the model returns repeated weird outputs, and in some cases, it makes the model subsequent responses weird as well, even for simple queries. A restart of the TGI container solves this temporarily, until after the ETL is run again, which seems to "corrupt it" over time.

Some detailed info on the requests we are running:

  • Request params: temperature=0.1, max_tokens=500 (rest not set, i.e. default)
  • 27 requests
  • 15 concurrent threads
  • Input tokens count: [37051,591,593,522,490,840,458,3700,4227,380,3144,4404,1949,3812,2606,1878,2132,1374,1241,397,1364,864,1323,782,956,722,686]
  • Output tokens count in case of success:[475,329,322,366,319,346,395,416,506,290,537,483,531,533,511,456,499,481,398,298,350,367,313,365,345,379,345]
  • Example output in case of failure: was451 company of company and2 and and not was is and and of the was is isiHHHHH andH andi and andH\\'t and and and and and and and and isi andH and H2 HiH of the the\\'t not it HH H andH the the is isH H is not the is isH H andHHHH andH andH and and and \" H and H2 the not onlyH is and and\\'t\\'t not is isH and and and and and is and and and and and and and and and andH is was\\' it is a H and and and and and and and and not not is not is was is is is the is not is and and and and and and and and and and and and and and and and and and and and and is and and and and is is was is and and and and and had had and and and and and and\\'t\\' not is was is and is not not\\' is is was not is is is was is is is and is and and and and and and and not\\'t is is and and and not not need have have and and and and and need need and is is is is not is is is is not was is is and and and and is not is is and and and and and is is and and and and and and and need is is is and and and and it is not not not not is not is is is is is is is is a and and and and is is was is and and and and and is is and and and and and and and is is and and and is is is is is is is is is is and and and and and and and is a and and and and\\'t\\'t\\'t is is and and and and is is is is and is and is is not and and and and and is a is is is is is the is is and and and and and and is is is a and and and is is is a is a and and and is a is and is is a is a is not is a is not is is and and and and and and is and and and and and is not is not is and and and is for is a and is and and and and is is is and is a is is and and and and and is is and and and and and and and and is and and and is the it\\'t it is is is is of is is is is is is is and is and is and is is is is is a and and and and is is the is and is and is is is is is is is is for is is is a is a and is is is is a is is is a is a is and and and and is a and and and and and and and is not is a H and and is not is a is a is a is a is a and and and and and and and is is is is a is is is is and is is is is and H and and and is not is is a and is and is a is is is and is/ is and and and and is is is is is is and and and and is is is is and and and\\'t is is is is is is is is is is is is is a is a is and is and is and and is is and is and and is\\'t is is is is is is is a is creating is is is and is and is not is a is

Expected behavior

Model outputs are the same (correct, non-gibberish) in all load scenarios.

Other issues that may be related

I believe this issue could be related and similar scenario may apply: #2871

@Narsil @OlivierDehaene @drbh

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions