Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama 3.3 70B Weird , gibberish outputs in production setup #3043

Open
2 of 4 tasks
andresC98 opened this issue Feb 20, 2025 · 7 comments
Open
2 of 4 tasks

Llama 3.3 70B Weird , gibberish outputs in production setup #3043

andresC98 opened this issue Feb 20, 2025 · 7 comments

Comments

@andresC98
Copy link

andresC98 commented Feb 20, 2025

System Info

Runtime environment:

  • Kubernetes Cluster deployment
  • 4 A100 GPU with 80GB VRAM each
  • 12 CPU with 32 GB RAM each
  • TGI Version: 3.0.1 (have also tried with 3.0.0, 3.0.1 and 3.1.0 with the same outcome).

TGI ENV config:

All default values except the following:

extraInferenceEnvs:
  MAX_BATCH_PREFILL_TOKENS: "4096"
  PREFILL_CHUNKING: "1"
  DTYPE: "bfloat16"

(have tried with float16 and bfloat16 with same outcome)

NVIDIA-SMI Output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000000:17:00.0 Off |                    0 |
| N/A   45C    P0             91W /  300W |   78489MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off |   00000000:31:00.0 Off |                    0 |
| N/A   46C    P0             90W /  300W |   78489MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe          Off |   00000000:B1:00.0 Off |                    0 |
| N/A   46C    P0             92W /  300W |   78489MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100 80GB PCIe          Off |   00000000:CA:00.0 Off |                    0 |
| N/A   44C    P0             85W /  300W |   78489MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     18577      C   /opt/conda/bin/python                       78480MiB |
|    1   N/A  N/A     18576      C   /opt/conda/bin/python                       78480MiB |
|    2   N/A  N/A     18578      C   /opt/conda/bin/python                       78480MiB |
|    3   N/A  N/A     18579      C   /opt/conda/bin/python                       78480MiB |
+-----------------------------------------------------------------------------------------+

Text-Generation-Launcher --env output:

{"timestamp":"2025-02-19T10:11:43.478948Z","level":"INFO","fields":{"message":"Args {\n    model_id: \"/model_data/llama3-3-70b\",\n    revision: None,\n    validation_workers: 4,\n    sharded: None,\n    num_shard: None,\n    quantize: None,\n    speculate: None,\n    dtype: Some(\n        BFloat16,\n    ),\n    kv_cache_dtype: None,\n    trust_remote_code: false,\n    max_concurrent_requests: 128,\n    max_best_of: 2,\n    max_stop_sequences: 4,\n    max_top_n_tokens: 5,\n    max_input_tokens: None,\n    max_input_length: None,\n    max_total_tokens: None,\n    waiting_served_ratio: 0.3,\n    max_batch_prefill_tokens: Some(\n        4096,\n    ),\n    max_batch_total_tokens: None,\n    max_waiting_tokens: 20,\n    max_batch_size: None,\n    cuda_graphs: None,\n    hostname: \"llama3-3-70b-deploy-inference-6c985b6745-s4fns\",\n    port: 80,\n    shard_uds_path: \"/tmp/text-generation-server\",\n    master_addr: \"localhost\",\n    master_port: 29500,\n    huggingface_hub_cache: None,\n    weights_cache_override: None,\n    disable_custom_kernels: false,\n    cuda_memory_fraction: 1.0,\n    rope_scaling: None,\n    rope_factor: None,\n    json_output: true,\n    otlp_endpoint: None,\n    otlp_service_name: \"text-generation-inference.router\",\n    cors_allow_origin: [],\n    api_key: None,\n    watermark_gamma: None,\n    watermark_delta: None,\n    ngrok: false,\n    ngrok_authtoken: None,\n    ngrok_edge: None,\n    tokenizer_config_path: None,\n    disable_grammar_support: false,\n    env: true,\n    max_client_batch_size: 4,\n    lora_adapters: None,\n    usage_stats: On,\n    payload_limit: 2000000,\n    enable_prefill_logprobs: false,\n}"},"target":"text_generation_launcher"}
{"timestamp":"2025-02-19T10:11:44.898929Z","level":"INFO","fields":{"message":"Using attention flashinfer - Prefix caching true"},"target":"text_generation_launcher"}
{"timestamp":"2025-02-19T10:11:44.898961Z","level":"INFO","fields":{"message":"Sharding model on 4 processes"},"target":"text_generation_launcher"}
{"timestamp":"2025-02-19T10:11:44.898966Z","level":"INFO","fields":{"message":"Using default cuda graphs [1, 2, 4, 8, 16, 32]"},"target":"text_generation_launcher"}
{"timestamp":"2025-02-19T10:11:44.899151Z","level":"INFO","fields":{"message":"Starting check and download process for /model_data/llama3-3-70b"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2025-02-19T10:11:50.740956Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download."},"target":"text_generation_launcher"}
{"timestamp":"2025-02-19T10:11:51.641141Z","level":"INFO","fields":{"message":"Successfully downloaded weights for /model_data/llama3-3-70b"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2025-02-19T10:11:51.641428Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2025-02-19T10:11:51.641431Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
{"timestamp":"2025-02-19T10:11:51.641484Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"rank":2,"name":"shard-manager"}]}
{"timestamp":"2025-02-19T10:11:51.641493Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":3,"name":"shard-manager"},"spans":[{"rank":3,"name":"shard-manager"}]}

Model /info output:

{
  "model_id": "/model_data/llama3-3-70b",
  "model_sha": null,
  "model_pipeline_tag": null,
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_tokens": 131071,
  "max_total_tokens": 131072,
  "validation_workers": 4,
  "max_client_batch_size": 4,
  "router": "text-generation-router",
  "version": "3.1.0",
  "sha": "463228ebfc444f60fa351da34a2ba158af0fe9d8",
  "docker_label": "sha-463228e"
}

Model: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

We are running some ETLs that send requests to the generate/ TGI model endpoint with some text information for the model to summarize (redacted for privacy purposes). These requests are often big in terms of n. of input tokens. Furthermore, we are running these requests in concurrent threads. We are observing that for some cases the model returns repeated weird outputs, and in some cases, it makes the model subsequent responses weird as well, even for simple queries. A restart of the TGI container solves this temporarily, until after the ETL is run again, which seems to "corrupt it" over time.

Some detailed info on the requests we are running:

  • Request params: temperature=0.1, max_tokens=500 (rest not set, i.e. default)
  • 27 requests
  • 15 concurrent threads
  • Input tokens count: [37051,591,593,522,490,840,458,3700,4227,380,3144,4404,1949,3812,2606,1878,2132,1374,1241,397,1364,864,1323,782,956,722,686]
  • Output tokens count in case of success:[475,329,322,366,319,346,395,416,506,290,537,483,531,533,511,456,499,481,398,298,350,367,313,365,345,379,345]
  • Example output in case of failure: was451 company of company and2 and and not was is and and of the was is isiHHHHH andH andi and andH\\'t and and and and and and and and isi andH and H2 HiH of the the\\'t not it HH H andH the the is isH H is not the is isH H andHHHH andH andH and and and \" H and H2 the not onlyH is and and\\'t\\'t not is isH and and and and and is and and and and and and and and and andH is was\\' it is a H and and and and and and and and not not is not is was is is is the is not is and and and and and and and and and and and and and and and and and and and and and is and and and and is is was is and and and and and had had and and and and and and\\'t\\' not is was is and is not not\\' is is was not is is is was is is is and is and and and and and and and not\\'t is is and and and not not need have have and and and and and need need and is is is is not is is is is not was is is and and and and is not is is and and and and and is is and and and and and and and need is is is and and and and it is not not not not is not is is is is is is is is a and and and and is is was is and and and and and is is and and and and and and and is is and and and is is is is is is is is is is and and and and and and and is a and and and and\\'t\\'t\\'t is is and and and and is is is is and is and is is not and and and and and is a is is is is is the is is and and and and and and is is is a and and and is is is a is a and and and is a is and is is a is a is not is a is not is is and and and and and and is and and and and and is not is not is and and and is for is a and is and and and and is is is and is a is is and and and and and is is and and and and and and and and is and and and is the it\\'t it is is is is of is is is is is is is and is and is and is is is is is a and and and and is is the is and is and is is is is is is is is for is is is a is a and is is is is a is is is a is a is and and and and is a and and and and and and and is not is a H and and is not is a is a is a is a is a and and and and and and and is is is is a is is is is and is is is is and H and and and is not is is a and is and is a is is is and is/ is and and and and is is is is is is and and and and is is is is and and and\\'t is is is is is is is is is is is is is a is a is and is and is and and is is and is and and is\\'t is is is is is is is a is creating is is is and is and is not is a is

Expected behavior

Model outputs are the same (correct, non-gibberish) in all load scenarios.

Other issues that may be related

I believe this issue could be related and similar scenario may apply: #2871

@Narsil @OlivierDehaene @drbh

@martinigoyanes
Copy link
Contributor

We are experiencing the same issue!

@danieldk
Copy link
Member

danieldk commented Mar 5, 2025

Any chance you could test TGI 3.1.1? We fixed two prefix caching edge cases that can lead to long-term corruption.

@andresC98
Copy link
Author

@danieldk We are deploying it today; will update here on this issue thread in a couple of weeks after we validate that the issue has dissappeared. Thank you very much for your work!! 😊

@andresC98
Copy link
Author

andresC98 commented Mar 5, 2025

@danieldk turns out 3.1.1 has introduced some important changes in the docker image that breaks the container in our setup
... Are big changes like these expected to be introduced on minor versions? (3.1.0 -> 3.1.1)

/tgi-entrypoint.sh line 5: ./.venv/bin/activate: No such file or directory   
text-generation-launcher: error while loading shared libraries: libpython3.11.so.1.0: cannot open shared object file: No such file or directory.

Our custom image simply contains the following:

FROM huggingface/text-generation-inference:3.1.1
# Adds a non-root llm user to its own group for isolation
ENV UID=1000
ENV USER=llm
RUN groupadd -g "${UID}" "${USER}" && useradd -m -u "${UID}" -g "${USER}" "${USER}"
# Switch to non-root user and use their home as workdir
USER ${USER}
WORKDIR /home/${USER}
RUN mkdir -p -m 0744 /home/${USER}/cache
ENV HF_HOME=/home/${USER}/cache

@Narsil
Copy link
Collaborator

Narsil commented Mar 6, 2025

We should probably also stop running as root by default actually...

@Narsil
Copy link
Collaborator

Narsil commented Mar 7, 2025

Okay this is the fix:

FROM tgi
# Adds a non-root llm user to its own group for isolation
ENV UID=1000
ENV USER=llm
RUN groupadd -g "${UID}" "${USER}" && useradd -m -u "${UID}" -g "${USER}" "${USER}"
# Give access to users to `/root/.local` where the python interpreter is located (since it was installed as root by uv).
RUN chmod 555 /root
USER ${USER}
WORKDIR /home/${USER}
RUN mkdir -p -m 0744 /home/${USER}/cache
ENV HF_HOME=/home/${USER}/cache

Basically the user you create needs to be able to traverse the /root/.local folder where uv located the python interpreter.
This wasn't used before because we used conda which already dumped the interpreter in a globally readable folder.

We reflected on the nonroot docker by default, but it's likely to mess other things (especially around the mounted folders) so we're not going to go ahead yet.

@andresC98
Copy link
Author

@Narsil thanks for the fix/workaround; now the container is able to start however this warning/error still appears in the beginning of the startup, should I be worried?
/tgi-entrypoint.sh: line 5: ./.venv/bin/activate: No such file or directory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants