Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Orchestrator metrics singleton #1301

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

eero-t
Copy link
Contributor

@eero-t eero-t commented Feb 18, 2025

Description

Make Orchestrator metrics singleton, so that even if applications instantiate multiple Orchestrators, "megaservice_*" metrics collect data from all of them.

This is intended as proper fix for the #1280 workaround. When only CI tests created multiple Orchestrator instances, changing metric prefix was fine, but now that applications (e.g. DocSum) do that too, that's not the case any more.

(Another option would be to add arguments for passing Orchestrator instance names as metric prefixes, to name and differentiate metrics for each Orchestrator instance. However, that would have needed changes in 4 OPEA projects instead of just this one, and dashboards & benchmarks would then need to hard-code those per-application prefixes.)

Issues

  • Adding Orchestrator instances to applications changes their metric names, depending on in which order those instances are created, and it distorts the whole application metric values. This breaks dashboards and benchmarks relying on those metric names

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Dependencies

n/a.

Tests

@eero-t eero-t marked this pull request as draft February 18, 2025 14:24
Copy link

codecov bot commented Feb 18, 2025

Codecov Report

Attention: Patch coverage is 84.61538% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
comps/cores/mega/orchestrator.py 84.61% 2 Missing ⚠️
Files with missing lines Coverage Δ
comps/cores/mega/orchestrator.py 90.98% <84.61%> (-0.23%) ⬇️

@eero-t eero-t marked this pull request as ready for review February 21, 2025 20:15
@eero-t
Copy link
Contributor Author

eero-t commented Feb 21, 2025

Rebased to main, did testing and removed draft status.

Metrics work. There's something funky with DocSum LLM-uservice HuggingFace API endpoint handling, when it is stressed more (it seems to try to connect HF site for every query and eventually fail), but I don't see how it could be related to these changes.

@xiguiw
Copy link
Collaborator

xiguiw commented Feb 25, 2025

@eero-t
Is it possible to create a test case in CI test to test this?

@eero-t
Copy link
Contributor Author

eero-t commented Feb 25, 2025

@eero-t Is it possible to create a test case in CI test to test this?

@xiguiw I'll try to modify (streaming) ServiceOrchestrator test to check also metrics.

@eero-t eero-t force-pushed the metric-singleton branch 3 times, most recently from 3904431 to 8575e74 Compare February 26, 2025 16:51
@eero-t
Copy link
Contributor Author

eero-t commented Feb 26, 2025

CI fails on something completely unrelated to changes in this PR:

chatqna-gaudi-nginx-server Error manifest for opea/nginx:comps not found: manifest unknown: manifest unknown
Error response from daemon: manifest for opea/nginx:comps not found: manifest unknown: manifest unknown

@eero-t eero-t force-pushed the metric-singleton branch 9 times, most recently from 3dcd91a to f1f9744 Compare February 26, 2025 19:07
@eero-t
Copy link
Contributor Author

eero-t commented Feb 26, 2025

@eero-t Is it possible to create a test case in CI test to test this?

@xiguiw Ok, I've expanded (streaming) ServiceOrchestrator test to cover also checking of its streaming metrics.

It showed odd metrics values, which did not happen with real apps, only with test code.

=> It was ServiceOrchestrator bug in propagating first token status, which is now fixed.


CI "example-test" continues to fail, but it's completely unrelated to this PR, and should not block merging:

[ retrieval ] HTTP status is not 200. Received status was 000
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/llama_index/core/utils.py", line 63, in __init__
    nltk.data.find("corpora/stopwords")
  File "/usr/local/lib/python3.11/site-packages/nltk/data.py", line 579, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource stopwords not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('stopwords')

@eero-t
Copy link
Contributor Author

eero-t commented Feb 27, 2025

https://github.com/opea-project/GenAIComps/pull/1340/files and https://github.com/opea-project/GenAIComps/pull/1343/files should fix the CI issue:

Resource stopwords not found.
Please use the NLTK Downloader to obtain the resource

But I think it may be cleaner if I submit commits for the metrics test and fix to issues it revealed, as a separate PR.

@eero-t
Copy link
Contributor Author

eero-t commented Feb 28, 2025

I split test code addition and fixes to #1348, it should be merged first.

So that even if applications instantiate multiple Orchestrators,
"megaservice_*" metrics collect data from all of them.

Another option would be to add arguments for passing Orchestrator
instance names as metric prefixes, to name and differentiate metrics
for each Orchestrator instance.

However, that would have needed changes in 3 OPEA projects instead of
just this one, and dashboards would then need to hard-code those
per-application prefixes.

Signed-off-by: Eero Tamminen <[email protected]>
@eero-t eero-t force-pushed the metric-singleton branch 2 times, most recently from 04f2407 to 4bfd4ae Compare March 3, 2025 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants