Xiaoya new feature grafana by xiaoyachong · Pull Request #71 · als-computing/splash_flows

xiaoyachong · 2025-04-07T19:17:54Z

This PR adds functionality to collect and visualize data transfer metrics via Prometheus and Grafana. Key changes include:

Metrics Collection and Prometheus Integration:

Added a new prometheus_utils.py module for pushing metrics to Prometheus Pushgateway
Implemented metrics for transfer counts, file sizes, transfer speeds, and durations

Enhanced Transfer Controller:

Expanded the GlobusTransferController with file size detection and metrics collection
Modified the copy method to support metrics collection with optional flags

New Flow for Grafana Testing:

Added test_transfers_832_grafana flow to test the metrics collection
Created transfer_data_to_ners_by_controller task that uses the enhanced controller
Added a deployment script create_deployments_832_grafana.sh

Configuration Updates:

Modified NERSC endpoint path in config.yml to use a test directory
Added Prometheus client dependency to requirements.txt

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1209588090194771

taxe10

Very nice work, @xiaoyachong! I left a couple of comments for your revision and a couple of questions as well. Something we should think about is that we may need to track metrics per endpoint (or maybe a set of endpoints) in the future instead of per machine (e.g. NERSC) - which can be addressed in a future PR of course, but good to keep in mind now to avoid a major refactor later.
I'd be happy to take another look once changes have been updated.

config.yml

orchestration/flows/bl832/move.py

orchestration/prometheus_utils.py

orchestration/transfer_controller.py

requirements.txt

orchestration/transfer_controller.py

create_deployments_832_grafana.sh

dylanmcreynolds

This is great. A few comments.

orchestration/transfer_controller.py

dylanmcreynolds · 2025-04-11T15:59:53Z

orchestration/transfer_controller.py

        source: GlobusEndpoint = None,
        destination: GlobusEndpoint = None,
+        collect_metrics: bool = True,
+        machine_name: str = "NERSC"


I think the information you want in machine_name is already available in the destination?

Yes, I removed machine_name from the copy() arguments, since we can obtain it directly from destination. Meanwhile, collect_metrics was also removed.

dylanmcreynolds · 2025-04-11T16:03:59Z

orchestration/transfer_controller.py

+        # Create metrics instance once per controller instance
+        self.prometheus_metrics = PrometheusMetrics()
+
+    def get_file_size(


This is doing a lot of work with a fair number of messages to globus. Is there really nothing in the globus API do do this? When you start a transfer, globus definitely calculates the size of the payload. Perhaps it's available at the end of the transfer?

That’s a great point. I’ve added a new function called get_transfer_file_info() that retrieves task information using transfer_client.get_task(task_id).

dylanmcreynolds

I would like to have pyunit tests for this PR before merging. Thanks!

xiaoyachong · 2025-04-14T16:41:14Z

I would like to have pyunit tests for this PR before merging. Thanks!

I’ve added the test_globus_transfer_controller_with_metrics() test in orchestration/_tests/test_transfer_controller.py. Let me know if additional tests are needed.

orchestration/globus/transfer.py

dylanmcreynolds · 2025-04-22T17:05:20Z

orchestration/transfer_controller.py

+        except Exception as e:
+            logger.error(f"Error collecting or pushing metrics: {e}")
+
+    def _get_hpc_name_from_endpoint(self, endpoint: GlobusEndpoint) -> str:


Why not just use whatever is in the GlobusEndpoint? Not all transfers have an HPC as a destination.

I change to machine_name = destination.name and delete _get_hpc_name_from_endpoint

xiaoyachong added 8 commits April 5, 2025 16:02

add local test

249fefe

push metrics to prometheus pushgateway

7b7dd5f

push metrics to prometheus pushgateway

cc21179

push metrics to prometheus pushgateway

bd9f38e

push metrics to prometheus pushgateway

7b5eaee

push metrics to prometheus pushgateway

04c272e

push metrics to prometheus pushgateway

9ef61a4

push metrics to prometheus pushgateway

0ed2428

xiaoyachong requested review from davramov, dylanmcreynolds and taxe10 April 7, 2025 19:21

push metrics to prometheus pushgateway

d17f560

taxe10 requested changes Apr 9, 2025

View reviewed changes

update prometheus-related code based on PR review

3a5cf90

dylanmcreynolds requested changes Apr 11, 2025

View reviewed changes

xiaoyachong added 6 commits April 12, 2025 13:58

update prometheus-related code based on PR review

d89080c

update prometheus-related code based on PR review

b6fd6a7

update prometheus-related code based on PR review

24fd002

update prometheus-related code based on PR review

37a821d

update prometheus-related code based on PR review

64ed9cb

update prometheus-related code based on PR review

c4b6aa6

dylanmcreynolds reviewed Apr 22, 2025

View reviewed changes

orchestration/globus/transfer.py Show resolved Hide resolved

dylanmcreynolds reviewed Apr 22, 2025

View reviewed changes

xiaoyachong added 3 commits April 22, 2025 13:23

use destination name as machine_name

45da2ec

add --no-cache-dir

b23118e

change return value of start_transfer()

abb76a9

taxe10 approved these changes Apr 22, 2025

View reviewed changes

dylanmcreynolds merged commit eba7ca8 into als-computing:main Apr 22, 2025
1 check passed

Comments

Conversation

xiaoyachong commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taxe10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dylanmcreynolds left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dylanmcreynolds Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

xiaoyachong Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

dylanmcreynolds Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

xiaoyachong Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

dylanmcreynolds left a comment

Choose a reason for hiding this comment

Uh oh!

xiaoyachong commented Apr 14, 2025

Uh oh!

Uh oh!

dylanmcreynolds Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

xiaoyachong Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xiaoyachong commented Apr 7, 2025 •

edited

Loading