-
Notifications
You must be signed in to change notification settings - Fork 6
Issue 59: Transfer Flows between CFS and HPSS #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Progress:HPSS
New file:
New
New file:
New file:
Next steps:HPSS
PruneController
SciCat
|
|
Notes when talking to Dylan:
Moving to tape strategies:
|
Summary of Issue 59 ChangesKey New Features
|
|
I met with @rajasriramoju and came up with a few final steps to test and verify the feature changes. End-to-end tests
Define test for tape:
|
dylanmcreynolds
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple preliminary requests.
As discussed, let's move a lot of this stuff into the scicat_beamline project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 questions:
- I expect the BL scientist or prefect dispatcher to start off the hpss transfer, because they would have access. So when is a user, mentioned in the docs, involved in this?
- When talking about transferring specific user data from hpss from cfs, where is the operation taking place from and who is performing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm updating the documentation to define what we mean by NERSC user, ALS user, and service accounts.
NERSC users can move data to HPSS.
In this case, "user" is assumed to be our alsdev service account
In production, ALS users would not be moving the data themselves, but they would still have access to their own data and know where it is located on HPSS via SciCat. As ALS staff, we could get their data if they request it using our flows, or the user could follow the documentation and log in to HPSS in jupyter and they should have file permissions for their projects.
Also thinking maybe we should set the nersc account in the slurm job as an env variable so it's easier for other people to use it themselves. Maybe a fallback can be set in config.yml, and overridden by the env variable.
Change it from this
#SBATCH -A als # Specify the account.
to something like this
nersc_account = os.environ.get("NERSC_ACCOUNT"),
job_script = rf"""#!/bin/bash
...
#SBATCH -A {nersc_account} # Specify the account.- The flow is set up to use the
alsdevservice account handles the transfer using thexferQOS:#SBATCH -q xferto launch the slurm job with the specifichsiandhtarcommands - It would be launched via a Prefect Agent running on the beamline's respective VM (i.e.
flow-prd). We have options if we want that to be a scheduled transfer, or manual transfer of one project directory. - ALS Users who are NERSC users could look up their data in SciCat and get their data via hsi/htar commands
|
Notes from Raja
|
|
Notes from today's testing hackathon with Raja, just a few items to clean up TODO:
|
|
Hi @GBirkel, adding you as a reviewer to this PR, but don't worry about it until I rebase with other recent changes. |
| 1. Define the source (CFS) and destination (HPSS) paths. | ||
| 2. Create the destination directory on HPSS if it doesn't exist. | ||
| - Recursively check each part of the incoming file path if the folder exists | ||
| - If the folder does not exist, use `hsi mkdir` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an equivalent to mkdir -p for recursive folder creation? E.g. hsi mkdir -p?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I found this in the documentation: https://hpss-collaboration.org/wp-content/uploads/2023/09/hpss_hsi_10.2_reference_manual.pdf
19.63. MKDIR command
Synopsis
mkd[ir] [-A annotation] [-m mode] [-p] path …
Description
This command creates HPSS subdirectories.
HSI Version
All
Aliases
add, md
Options
...
-p Creates missing intermediate path name directories. If the -p flag is not specified,
the parent directory of each newly-created directory must already exist
| - If a directory, group files by beam cycle and archive them. | ||
| * Cycle 1: Jan 1 - Jul 15 | ||
| * Cycle 2: Jul 16 - Dec 31 | ||
| * If a group exceeds 2 TB, it is partitioned into multiple tar archives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be code that warns if the resulting archive would be less than 100GB, and skips it, so the content is consolidated into the next archiving run instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting thought. The lower limit bound (<100GB) seems more like a suggestion so that small files are not scattered across many different tapes. I think if all of the files are bundled together for a cycle/proposal, then users could still retrieve their data easily.
| def add_new_dataset_location( | ||
| self, | ||
| dataset_id: str = None, | ||
| source_folder: str = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we giving add_new_dataset_location a path to a folder, then having it construct a file path by appending the name of the data set to the folder? It seems like a dangerous assumption to make that the file would always have that name, especially since it appears we are quietly falling back to a generic name of "dataset".
If a user wants to put two data sets in the same place and renames one for this purpose, there is no way to supply the changed name to add_new_dataset_location.
Perhaps we should leave it to the caller to compose the full filename?
|
|
||
| # Add location information to the path if host is provided | ||
| if source_folder_host: | ||
| datafile.path = f"{source_folder_host}:{file_path}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps prepare this string before creating the DataFile object above, so we're not modifying the object after creation?
| def _find_dataset( | ||
| self, | ||
| proposal_id: Optional[str] = None, # The ALS proposal ID, not the SciCat ID | ||
| file_name: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if SciCat's advanced query API supports this, but, if we're actually searching by file_name, wouldn't it make more sense to look in all the dataFileList entries for all datasets?
This comes to mind because at 733, scientists organize their data mostly by creating dated folders, and there's no guarantee that the files in those folders have unique names relative to the other folders. So if they were searching for a particular data file, they would need to use a path fragment, e.g. '20241216_153047/new_run.h5'
If this function could search dataFileLists by path fragment, it would be some future-proofing for those users...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am adding this use case in a comment as TODO. Right now, the _find_dataset method isn't used anywhere, so we can build on this as needed.
| sample_name: str | ||
| ) -> str: | ||
| """extract search terms from sample name to provide something pleasing to search on""" | ||
| terms = re.split("[^a-zA-Z0-9]", sample_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this include dashes or underscores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question... I will leave this as is for now, since I'm not sure.
orchestration/flows/scicat/utils.py
Outdated
| """Create a thumbnail from an image array.""" | ||
| image_array = image_array - np.min(image_array) + 1.001 | ||
| image_array = np.log(image_array) | ||
| image_array = 205 * image_array / (np.max(image_array)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can an image_array ever contain all 0s? (Don't want a "divide by zero" error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding a try-except statement, where if an error occurs while building the thumbnail (e.g., dividing by 0), it will return a blank image. This way if it fails, it won't block the rest of the ingestion flow from completing.
| f"in {days_from_now.total_seconds()/86400:.1f} days") | ||
|
|
||
| try: | ||
| schedule_prefect_flow( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if this is called twice, and two pruning jobs for the same location are triggered at the same time? Will we create a race between the jobs and end up with file errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I've seen, if the file is already pruned, then the flow would fail with a "file not found" error
orchestration/prune_controller.py
Outdated
| file_path: str = None, | ||
| source_endpoint: FileSystemEndpoint = None, | ||
| check_endpoint: Optional[FileSystemEndpoint] = None, | ||
| days_from_now: datetime.timedelta = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is asking for days_from_now but expects a timedelta. Might be more clear to either:
- Change the argument to
time_delta - Expect a floating point value, and create the
timedeltainternally from it
…rces requested than allowed for logical queue xfer (0.0390625 requested core-equivalents > 0.025)
…d.localhost header when logging in if true.
…than the flow version
I am opening this PR to address issue #59.
Key components in this initial commit:
CFSToHPSSTransferController()with logic for handling single files vs directories usingSFAPI,Slurm,hsi, andhtarbased on HPSS best practices.create_sfapi_client()to a new file calledorchestration/sfapi.py. Now it can be easily accessed by multiple components, including the NERSC tomography workfloworchestration/_tests/test_sfapi_flow.pyto reflect the new location ofcreate_sfapi_client().This is still a WIP and requires thorough testing, and there are a few outstanding tasks:
HPSSToCFSTransferController()class (currently is just a placeholder)SFAPIid/key pair. Our application must make it easy to update these values since they expire fairly quickly.TransferControllersto accept a genericconfig=Configrather thanconfig=Config832, which will help generalize our codebase as we extend support to additional beamlines.orchestration/pruning_controller.pythat centralizes and implements Pruning logic for: