Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Losing redis connection #127

Open
renan-souza opened this issue Apr 10, 2024 · 1 comment
Open

Losing redis connection #127

renan-souza opened this issue Apr 10, 2024 · 1 comment
Labels
bug Something isn't working priority:medium

Comments

@renan-souza
Copy link
Collaborator

Sometimes, fortunately only rarely with the LLM experiment, we get the error below. We need to debug it to plan what to do. One possibility is simply to retry the connection and the failed request until it makes it. Today, if this error happens, we are likely losing data.

[flowcept][ERROR][frontier06306.frontier.olcf.ornl.gov][pid=61095][thread=140733193385728][function=_start][Connection closed by server.]
Traceback (most recent call last):
File "/lustre/orion/stf219/scratch/souzar/flowcept/flowcept/flowceptor/consumers/document_inserter.py", line 199, in _start
for message in pubsub.listen():
File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/client.py", line 1653, in listen
response = self.handle_message(self.parse_response(block=True))
File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/client.py", line 1531, in parse_response
response = self._execute(conn, try_read)
File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/client.py", line 1507, in _execute
return conn.retry.call_with_retry(
File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/retry.py", line 49, in call_with_retry
fail(error)
File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/client.py", line 1509, in
lambda error: self._disconnect_raise_connect(conn, error),
File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/client.py", line 1496, in _disconnect_raise_connect
raise error
File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/retry.py", line 46, in call_with_retry
return do()
File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/client.py", line 1508, in
lambda: command(*args, **kwargs),
File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/client.py", line 1529, in try_read
return conn.read_response()
File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/connection.py", line 848, in read_response
response = self._parser.read_response(disable_decoding=disable_decoding)
File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/connection.py", line 335, in read_response
result = self._read_response(disable_decoding=disable_decoding)
File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/connection.py", line 383, in _read_response
response = [
File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/connection.py", line 384, in
self._read_response(disable_decoding=disable_decoding)
File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/connection.py", line 377, in _read_response
response = self._buffer.read(length)
File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/connection.py", line 230, in read
self._read_from_socket(length - self.length)
File "/lustre/orion/stf219/scratch/souzar/miniconda/envs/llm3/lib/python3.8/site-packages/redis/connection.py", line 195, in _read_from_socket
raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.

@renan-souza renan-souza added bug Something isn't working priority:medium labels Apr 10, 2024
@renan-souza
Copy link
Collaborator Author

I found that it is an intermittent error that happens on Frontier, likely due to network issues. Anyhow, we might need to consider handling this failure better than just missing the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:medium
Projects
None yet
Development

No branches or pull requests

1 participant