Skip to content

Comments

Xiaoya new feature grafana#71

Merged
dylanmcreynolds merged 19 commits intoals-computing:mainfrom
xiaoyachong:xiaoya-new-feature-grafana
Apr 22, 2025
Merged

Xiaoya new feature grafana#71
dylanmcreynolds merged 19 commits intoals-computing:mainfrom
xiaoyachong:xiaoya-new-feature-grafana

Conversation

@xiaoyachong
Copy link
Contributor

@xiaoyachong xiaoyachong commented Apr 7, 2025

This PR adds functionality to collect and visualize data transfer metrics via Prometheus and Grafana. Key changes include:

  1. Metrics Collection and Prometheus Integration:
  • Added a new prometheus_utils.py module for pushing metrics to Prometheus Pushgateway
  • Implemented metrics for transfer counts, file sizes, transfer speeds, and durations
  1. Enhanced Transfer Controller:
  • Expanded the GlobusTransferController with file size detection and metrics collection
  • Modified the copy method to support metrics collection with optional flags
  1. New Flow for Grafana Testing:
  • Added test_transfers_832_grafana flow to test the metrics collection
  • Created transfer_data_to_ners_by_controller task that uses the enhanced controller
  • Added a deployment script create_deployments_832_grafana.sh
  1. Configuration Updates:
  • Modified NERSC endpoint path in config.yml to use a test directory
  • Added Prometheus client dependency to requirements.txt

Copy link
Collaborator

@taxe10 taxe10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work, @xiaoyachong! I left a couple of comments for your revision and a couple of questions as well. Something we should think about is that we may need to track metrics per endpoint (or maybe a set of endpoints) in the future instead of per machine (e.g. NERSC) - which can be addressed in a future PR of course, but good to keep in mind now to avoid a major refactor later.
I'd be happy to take another look once changes have been updated.

Copy link
Contributor

@dylanmcreynolds dylanmcreynolds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. A few comments.

source: GlobusEndpoint = None,
destination: GlobusEndpoint = None,
collect_metrics: bool = True,
machine_name: str = "NERSC"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the information you want in machine_name is already available in the destination?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I removed machine_name from the copy() arguments, since we can obtain it directly from destination. Meanwhile, collect_metrics was also removed.

# Create metrics instance once per controller instance
self.prometheus_metrics = PrometheusMetrics()

def get_file_size(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doing a lot of work with a fair number of messages to globus. Is there really nothing in the globus API do do this? When you start a transfer, globus definitely calculates the size of the payload. Perhaps it's available at the end of the transfer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s a great point. I’ve added a new function called get_transfer_file_info() that retrieves task information using transfer_client.get_task(task_id).

Copy link
Contributor

@dylanmcreynolds dylanmcreynolds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have pyunit tests for this PR before merging. Thanks!

@xiaoyachong
Copy link
Contributor Author

I would like to have pyunit tests for this PR before merging. Thanks!

I’ve added the test_globus_transfer_controller_with_metrics() test in orchestration/_tests/test_transfer_controller.py. Let me know if additional tests are needed.

except Exception as e:
logger.error(f"Error collecting or pushing metrics: {e}")

def _get_hpc_name_from_endpoint(self, endpoint: GlobusEndpoint) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use whatever is in the GlobusEndpoint? Not all transfers have an HPC as a destination.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change to machine_name = destination.name and delete _get_hpc_name_from_endpoint

@dylanmcreynolds dylanmcreynolds merged commit eba7ca8 into als-computing:main Apr 22, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants