Xiaoya new feature grafana#71
Conversation
taxe10
left a comment
There was a problem hiding this comment.
Very nice work, @xiaoyachong! I left a couple of comments for your revision and a couple of questions as well. Something we should think about is that we may need to track metrics per endpoint (or maybe a set of endpoints) in the future instead of per machine (e.g. NERSC) - which can be addressed in a future PR of course, but good to keep in mind now to avoid a major refactor later.
I'd be happy to take another look once changes have been updated.
dylanmcreynolds
left a comment
There was a problem hiding this comment.
This is great. A few comments.
orchestration/transfer_controller.py
Outdated
| source: GlobusEndpoint = None, | ||
| destination: GlobusEndpoint = None, | ||
| collect_metrics: bool = True, | ||
| machine_name: str = "NERSC" |
There was a problem hiding this comment.
I think the information you want in machine_name is already available in the destination?
There was a problem hiding this comment.
Yes, I removed machine_name from the copy() arguments, since we can obtain it directly from destination. Meanwhile, collect_metrics was also removed.
orchestration/transfer_controller.py
Outdated
| # Create metrics instance once per controller instance | ||
| self.prometheus_metrics = PrometheusMetrics() | ||
|
|
||
| def get_file_size( |
There was a problem hiding this comment.
This is doing a lot of work with a fair number of messages to globus. Is there really nothing in the globus API do do this? When you start a transfer, globus definitely calculates the size of the payload. Perhaps it's available at the end of the transfer?
There was a problem hiding this comment.
That’s a great point. I’ve added a new function called get_transfer_file_info() that retrieves task information using transfer_client.get_task(task_id).
dylanmcreynolds
left a comment
There was a problem hiding this comment.
I would like to have pyunit tests for this PR before merging. Thanks!
I’ve added the |
orchestration/transfer_controller.py
Outdated
| except Exception as e: | ||
| logger.error(f"Error collecting or pushing metrics: {e}") | ||
|
|
||
| def _get_hpc_name_from_endpoint(self, endpoint: GlobusEndpoint) -> str: |
There was a problem hiding this comment.
Why not just use whatever is in the GlobusEndpoint? Not all transfers have an HPC as a destination.
There was a problem hiding this comment.
I change to machine_name = destination.name and delete _get_hpc_name_from_endpoint
This PR adds functionality to collect and visualize data transfer metrics via Prometheus and Grafana. Key changes include:
prometheus_utils.pymodule for pushing metrics to Prometheus PushgatewayGlobusTransferControllerwith file size detection and metrics collectiontest_transfers_832_grafanaflow to test the metrics collectiontransfer_data_to_ners_by_controllertask that uses the enhanced controllercreate_deployments_832_grafana.shconfig.ymlto use a test directoryrequirements.txt