Skip to content

Upgraded metadata lookup for backwards compatibility, added option to input metadata manually#179

Merged
pbeaucage merged 43 commits intomainfrom
Issue178_FixLoadRun
Apr 1, 2025
Merged

Upgraded metadata lookup for backwards compatibility, added option to input metadata manually#179
pbeaucage merged 43 commits intomainfrom
Issue178_FixLoadRun

Conversation

@PriyankaKetkarBNL
Copy link
Collaborator

@PriyankaKetkarBNL PriyankaKetkarBNL commented Feb 23, 2025

Addresses issue #178.

mdLookup dictionary was updated so that beamline metaddata key names can be entered as a list instead of single value (or two values in the case of secondary_lookup table). Sets up infrastructure for backwards compatibility with respect to key names used historically at the beamline. secondary_lookup dictionary was removed, as all historical key names are consolidated into mdLookup. These changes were propagated to the code where Tiled is searched for metadata keys as well as in the construction of reverse lookup table (reverse_lut) in loadRun. These changes were tested for scan IDs 92849, 93065, and 91175, which include scans from before and after the beamline codebase upgrades in January 2025.

Loader = phs.load.SST1RSoXSDB(corr_mode="none")

scanID = 92849 ## Count scan with 50 repeats at constant energy and polarization
#scanID = 93065 ## RSoXS energy scan with 1 repeat post-January 2025
#scanID = 91175 ## RSoXS energy scan with 1 repeat December 2024

scan = Loader.loadRun(scanID, dims=["time", "energy", "polarization"]).unstack('system')
scan

Also tested the following:

scanID_spiral = 92770

scan = loader.loadRun(scanID_spiral, dims=['sam_x','sam_y']).unstack('system')
scan

Additionally, mdManual input was added into loadRun function so that metadata values can be entered manually in case they were not originally written out with the scan. My understanding is that the current coords input cannot take more qualitative single-value entries (e.g., {"sample_notes": "details"}) that are meant to be stored in attrs. If it makes sense, I could think of how to consolidate these two inputs.

Copy link
Collaborator

@pbeaucage pbeaucage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a pretty minor change to the data structure that maps metadata fields to PyHyper words, made far more complicated by changing a bunch of variable names and re-inventing basic function calls. Before merging, we should

  • revert the changes to variable names from lowercase/underscore to camel case, making them match the rest of the class. This made this quite difficult to review and I'd like to look again after this is fixed to ensure we didn't lose anything in the diff.
  • add a test that covers the new metadata names
  • ideally, add a documentation page on this variable name mechanism. I'd accept making an issue for this.

Copy link
Collaborator

@pbeaucage pbeaucage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is now good to go, assuming you are OK with the workaround for manual metadata documented in the issue.

Also, can you please add some tests for new data before we merge?

@PriyankaKetkarBNL
Copy link
Collaborator Author

I believe this is now good to go, assuming you are OK with the workaround for manual metadata documented in the issue.

Also, can you please add some tests for new data before we merge?

Yes, your changes look good to me. I just added some tests.

I noticed before adding my tests that some of the tests had failed. Possibly from the recent commits, but I hadn't kept a close eye.

image

@PriyankaKetkarBNL
Copy link
Collaborator Author

This is actually a pretty minor change to the data structure that maps metadata fields to PyHyper words, made far more complicated by changing a bunch of variable names and re-inventing basic function calls. Before merging, we should

  • revert the changes to variable names from lowercase/underscore to camel case, making them match the rest of the class. This made this quite difficult to review and I'd like to look again after this is fixed to ensure we didn't lose anything in the diff.
  • add a test that covers the new metadata names
  • ideally, add a documentation page on this variable name mechanism. I'd accept making an issue for this.

What do you mean regarding the third bullet? Is that for explaining PyHyperScattering variable names or beamline variable names?

@pbeaucage
Copy link
Collaborator

pbeaucage commented Apr 1, 2025

Tests failing: Weird, transient Tiled errors. I was looking at that. Hopefully they don't recur. You can click thru to get the run log.

Variable names: a list of the standard PyHyper words and what they mean; energy, polarization, sam_x, sam_y, sam_th, etc. I can copy that into an issue.

Just curious: do those test load calls work without hinting the dimensions? I'm especially curious for the count scan, if it works by default. We can fix that if not.

@PriyankaKetkarBNL
Copy link
Collaborator Author

Few style changes here, in particular, it is bad form to put anything of consequence on the same line as a flow control block, really that should be used for only minimal logic, it is far more readable to add a newline. I tried to add some comments explaining the logic flow as well for the benefit of future maintainers. If these all look good, the only thing left is to decide on override_md or similar, I will comment in issue thread.

What do you mean by the first sentence? Is the "consequence" referring to some operation and the "flow control block" things like for, if, try, etc.?

@pbeaucage
Copy link
Collaborator

Blegh, tests are working, but the Tiled server is just dropping connections. I've messaged on NSLS2 slack.

@PriyankaKetkarBNL
Copy link
Collaborator Author

Tests failing: Weird, transient Tiled errors. I was looking at that. Hopefully they don't recur. You can click thru to get the run log.

Variable names: a list of the standard PyHyper words and what they mean; energy, polarization, sam_x, sam_y, sam_th, etc. I can copy that into an issue.

Just curious: do those test load calls work without hinting the dimensions? I'm especially curious for the count scan, if it works by default. We can fix that if not.

The energy scans (scan IDs 93065 and 91175) and the spiral (scan ID 92770) worked. Time scan (scan ID 92849) did not work.

loader.loadRun(run=92849).unstack('system')

Error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[187], line 1
----> 1 scan = loader.loadRun(run=92849).unstack('system')
      2 scan

File ~\.conda\envs\20250308_1\Lib\site-packages\xarray\core\dataarray.py:3019, in DataArray.unstack(self, dim, fill_value, sparse)
   2958 def unstack(
   2959     self,
   2960     dim: Dims = None,
   (...)
   2963     sparse: bool = False,
   2964 ) -> Self:
   2965     """
   2966     Unstack existing dimensions corresponding to MultiIndexes into
   2967     multiple new dimensions.
   (...)
   3017     DataArray.stack
   3018     """
-> 3019     ds = self._to_temp_dataset().unstack(dim, fill_value=fill_value, sparse=sparse)
   3020     return self._from_temp_dataset(ds)

File ~\.conda\envs\20250308_1\Lib\site-packages\xarray\core\dataset.py:5833, in Dataset.unstack(self, dim, fill_value, sparse)
   5829         result = result._unstack_full_reindex(
   5830             d, stacked_indexes[d], fill_value, sparse
   5831         )
   5832     else:
-> 5833         result = result._unstack_once(d, stacked_indexes[d], fill_value, sparse)
   5834 return result

File ~\.conda\envs\20250308_1\Lib\site-packages\xarray\core\dataset.py:5653, in Dataset._unstack_once(self, dim, index_and_vars, fill_value, sparse)
   5650 variables: dict[Hashable, Variable] = {}
   5651 indexes = {k: v for k, v in self._indexes.items() if k != dim}
-> 5653 new_indexes, clean_index = index.unstack()
   5654 indexes.update(new_indexes)
   5656 for idx in new_indexes.values():

File ~\.conda\envs\20250308_1\Lib\site-packages\xarray\core\indexes.py:1057, in PandasMultiIndex.unstack(self)
   1054 clean_index = remove_unused_levels_categories(self.index)
   1056 if not clean_index.is_unique:
-> 1057     raise ValueError(
   1058         "Cannot unstack MultiIndex containing duplicates. Make sure entries "
   1059         f"are unique, e.g., by  calling ``.drop_duplicates('{self.dim}')``, "
   1060         "before unstacking."
   1061     )
   1063 new_indexes: dict[Hashable, Index] = {}
   1064 for name, lev in zip(clean_index.names, clean_index.levels, strict=True):

ValueError: Cannot unstack MultiIndex containing duplicates. Make sure entries are unique, e.g., by  calling ``.drop_duplicates('system')``, before unstacking.

@pbeaucage
Copy link
Collaborator

pbeaucage commented Apr 1, 2025

What do you mean by the first sentence? Is the "consequence" referring to some operation and the "flow control block" things like for, if, try, etc.?

So yes, while it is technically valid Python syntax to write, for instance:

try: from wherever import thing
except ImportError: from elsewhere import thing

(or yes, any ofter flow control statement like for or if)
that is really only appropriate for things that are short and minimal, not things that require thought. You would usually put a newline and then indent it.

try:
    from wherever import thing
except ImportError:
    from elsewhere import thing

One way to think about this is that Python is a language where whitespace and layout are meaningful, compared with say C. So following expected patterns of whitespace and flow are helpful.

@pbeaucage
Copy link
Collaborator

Created #190, #191 for the last few items.

@PriyankaKetkarBNL
Copy link
Collaborator Author

The energy scans (scan IDs 93065 and 91175) and the spiral (scan ID 92770) worked. Time scan (scan ID 92849) did not work.

I'm testing more recent scans now and ran into a situation where scan ID 93983 loads fine with dims=["energy"] hinted but throws an error without hinting.

scan = loader.loadRun(run=93983).unstack('system')

Error:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[10], line 1
----> 1 scan = loader.loadRun(run=93983).unstack('system')
      2 scan

File ~\Anaconda3\envs\20250331_1\Lib\site-packages\PyHyperScattering\SST1RSoXSDB.py:740, in SST1RSoXSDB.loadRun(self, run, dims, coords, return_dataset, useMonitorShutterThinning)
    737     dims_to_join.append(val)
    738     dim_names_to_join.append(key)
--> 740 index = pd.MultiIndex.from_arrays(dims_to_join, names=dim_names_to_join)
    741 # handle the edge case of a partly-finished scan
    742 if len(index) != len(data["time"]):

File ~\Anaconda3\envs\20250331_1\Lib\site-packages\pandas\core\indexes\multi.py:533, in MultiIndex.from_arrays(cls, arrays, sortorder, names)
    530     if len(arrays[i]) != len(arrays[i - 1]):
    531         raise ValueError("all arrays must be same length")
--> 533 codes, levels = factorize_from_iterables(arrays)
    534 if names is lib.no_default:
    535     names = [getattr(arr, "name", None) for arr in arrays]

File ~\Anaconda3\envs\20250331_1\Lib\site-packages\pandas\core\arrays\categorical.py:3069, in factorize_from_iterables(iterables)
   3065 if len(iterables) == 0:
   3066     # For consistency, it should return two empty lists.
   3067     return [], []
-> 3069 codes, categories = zip(*(factorize_from_iterable(it) for it in iterables))
   3070 return list(codes), list(categories)

File ~\Anaconda3\envs\20250331_1\Lib\site-packages\pandas\core\arrays\categorical.py:3069, in <genexpr>(.0)
   3065 if len(iterables) == 0:
   3066     # For consistency, it should return two empty lists.
   3067     return [], []
-> 3069 codes, categories = zip(*(factorize_from_iterable(it) for it in iterables))
   3070 return list(codes), list(categories)

File ~\Anaconda3\envs\20250331_1\Lib\site-packages\pandas\core\arrays\categorical.py:3042, in factorize_from_iterable(values)
   3037     codes = values.codes
   3038 else:
   3039     # The value of ordered is irrelevant since we don't use cat as such,
   3040     # but only the resulting categories, the order of which is independent
   3041     # from ordered. Set ordered to False as default. See GH #15457
-> 3042     cat = Categorical(values, ordered=False)
   3043     categories = cat.categories
   3044     codes = cat.codes

File ~\Anaconda3\envs\20250331_1\Lib\site-packages\pandas\core\arrays\categorical.py:425, in Categorical.__init__(self, values, categories, ordered, dtype, fastpath, copy)
    422 elif isinstance(values, np.ndarray):
    423     if values.ndim > 1:
    424         # preempt sanitize_array from raising ValueError
--> 425         raise NotImplementedError(
    426             "> 1 ndim Categorical are not supported at this time"
    427         )
    428     values = sanitize_array(values, None)
    429 else:
    430     # i.e. must be a list

NotImplementedError: > 1 ndim Categorical are not supported at this time

@pbeaucage
Copy link
Collaborator

OK, that sounds like a separate issue. Not sure what would be causing it offhand. Feel free to make new thread to look into.

@pbeaucage
Copy link
Collaborator

Tests pass! merging

@pbeaucage pbeaucage merged commit cc880b4 into main Apr 1, 2025
16 checks passed
@pbeaucage pbeaucage deleted the Issue178_FixLoadRun branch April 1, 2025 20:57
@PriyankaKetkarBNL
Copy link
Collaborator Author

OK, that sounds like a separate issue. Not sure what would be causing it offhand. Feel free to make new thread to look into.

Sounds good. Created issue #193.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: loadRun is not working for 2025-1 cycle scans due to beamline codebase refactoring.

2 participants