Skip to content

Conversation

Diya910
Copy link

@Diya910 Diya910 commented Apr 13, 2025

Before submitting a pull request (PR), please read the contributing guide.

Please fill out as much of this template as you can, but if you have any problems or questions, just leave a comment and we will help out :)

Description

What is this PR

  • Bug fix
  • [ yes] Addition of a new feature
  • Other

Why is this PR needed?
This PR introduces a new @Dateto@ wildcard that enables users to search for folders based on a date range embedded in their names. This feature is especially useful when users want to transfer data recorded within a specific date range, without needing to create folders for every date in that range.

What does this PR do?
Implements @Dateto@ pattern recognition inside search_for_wildcards.

Uses get_values_from_bids_formatted_name to extract date-YYYYMMDD from folder names.

Filters the folders based on whether the date falls within the provided range.

References

#508

How has this PR been tested?

Created automated tests (test_date_search_range) using a simulated folder structure with date-YYYYMMDD format.

Verified that only folders within the specified date range are returned.

Confirmed that existing wildcard functionality remains unaffected.

Is this a breaking change?

No, this feature is additive and does not alter existing behavior.

Does this PR require an update to the documentation?

Yes. The documentation should be updated to mention the new @Dateto@ wildcard and its usage.

If any features have changed, or have been added. Please explain how the
documentation has been updated.

Checklist:

  • [ yes] The code has been tested locally
  • [ yes] Tests have been added to cover all new functionality
  • The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

There are two minor mypy errors I couldn't fully resolve:
A type conflict involving the dummy Configs class used in tests — guidance from maintainers would help finalize this.
A type mismatch originating from an existing code path — this appears unrelated to the new functionality added.

@Diya910
Copy link
Author

Diya910 commented Apr 17, 2025

@adamltyson @JoeZiminski
Is there any update on the pull request. Your feedback will be really helpful.

@sumana-2705
Copy link
Contributor

Hello @Diya910,
The changes are looking great. The review process might be a little delayed since the team is currently a bit busy. In the meantime, it might be a good idea to take a look at the documentation as well, in the transfer_data.md file :)

@JoeZiminski
Copy link
Member

Hi @Diya910 so sorry for the delay in response! thanks a lot for this PR and the extensive tests. I'm still not back full time but will definitely have time to review this within the next two weeks. Thanks for your patience

Copy link
Member

@JoeZiminski JoeZiminski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Diya910 thanks a lot for this, its a really nice implementation and is exactly what we need to do in this case. I have left a few comments on refactoring, this is because the introduced functionality can be aligned with some existing code to reduce duplication across the codebase. This requires some massaging of existing datashuttle code to make it a little more general so it can be called here. The suggestions also extend the implementation to handle the TIMETO and DATETIMETO case. For now I have not reviewed the tests as they might need changing after the refactor, but in general they look good and the attention to detail on testing is much appreciated.

Let me know if anything is not clear and if you have any questions or alternative ways to tackle this. Refactorings like those suggested can be a little fiddly. The linting / type checking will be useful when performing such refactorings. Of course, I'm happy to help wherever it would be useful. Thanks again for this contribution!

Just a reminder to myself, we will also need to add documentation for this new functionality.

name = name.replace(canonical_tags.tags("*"), "*")

matching_names: List[str]
if canonical_tags.tags("*") in name or "@DATETO@" in name:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can split this case, I think at present if there is both a wildcard and date in the name it will overwrite search_str generated with search_str = name.replace(canonical_tags.tags("*"), "*") with search_str = re.sub(r"\d{8}@DATETO@\d{8}", "date-*", name)

search_str = name
if canonical_tags.tags("*") in name or "@DATETO@" in name:
    search_str = search_str = name.replace(canonical_tags.tags("*"), "*")
    
if @DATETO@ in search_str:
... date replacement code

This is not very nice as it is constantly mutating search-string which can be difficult to debug. However, I think the problem at hand calls for this and it is the neatest way to indication the intention. We can leave a comment to explain.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorganized the code to handle wildcards and datetime tags separately

if canonical_tags.tags("*") in name or "@DATETO@" in name:
search_str = name.replace(canonical_tags.tags("*"), "*")
# If a date-range tag is present, extract dates and update the search string.
if "@DATETO@" in name:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a canonical tags.tags() function that contains all the tags (just in case we change them or some other problem that requires their editing arises). So @DATETO@, @TIMETO@ and @DATETIMETO@ could be added to that function and here @DATETO@ replaced with tags.tags("DATETO")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added DATETO, TIMETO, and DATETIMETO to canonical_tags.py and using them through tags()

search_str = name.replace(canonical_tags.tags("*"), "*")
# If a date-range tag is present, extract dates and update the search string.
if "@DATETO@" in name:
m = re.search(r"(\d{8})@DATETO@(\d{8})", name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, we have some validation code for ISO formats here however this function works on a list of names, and returns in a slightly strange format (which makes sense in the context of the validation functions).

I think to centralise this we can do a few refactorings:

Move this code block:

    formats = {
        "datetime": "%Y%m%dT%H%M%S",
        "time": "%H%M%S",
        "date": "%Y%m%d",
    }

to configs/canonical_tags and wrap it in a function like get_datetime_format based on the format.

Then this code in datetimes_in_iso_format can be factored into a separate function valdiate_datetime

        strfmt = formats[key]

        try:
            datetime.strptime(format_to_check, strfmt)
            error_message = []
        except ValueError:
            error_message = [get_datetime_error(key, name, strfmt, path_)]

except it can just return True / False and we can leave the error message stuff to datetimes_in_iso_format.

Now, we can call this new function from here. I think it is also worth having a quick function to get the expected number of values (8 for date) for use above, instead of hard-coding. This could be like:

def get_expected_num_datetime_values(format):
    format_str = get_datetime_format(format)
    today = datetime.now()
    return len(today.strftime(format_str)

We can then pass this "date", "time" or "datetime" to get the values.

This section would then look something like:

# somewhere need to check that @DATETO@, @TIMETO@ and @DATETIMETO@ are used exclusively
format = tag  = None
if tags.tag("DATETO") in search_str:
    format = "date"
    tag = tags.tag("DATETO") 
elif  tags.tag("TIMETO" in search_str:
    format = "time"
    tag = tags.tag("TIMETO") 
elif tags.tag("DATETIMETO") in search_str:
    format = "datetime"
    tag = tags.tag("DATETIMETO") 
    
expected_values =  get_expected_num_datetime_values(format)
full_tag_regex = fr"(\d{{{num_values}}}){re.escape(tag)}(\d{{{num_values}}})"
match = re.search(full_tag_regex , search_str)

if not match:
...raise (use utils. raise_error and raise a NeuroBlueprint error)

start_str, end_str = match.groups()

start_timepoint =datetime.strptime(start_str, get_datetime_format(format)")
end_timepoint =datetime.strptime(end, get_datetime_format(format)")
    
if not validate_datetime(start_timepoint, format):
... raise error

<same for end_timepoint>

search_str = re.sub(full_tag_regex , f{format}-*", search_str)

I think all of this could be isolated to a new function such that in this function we just have something like:

format = tag  = None
if tags.tag("DATETO") in search_str:
    format = "date"
    tag = tags.tag("DATETO") 
elif  tags.tag("TIMETO" in search_str:
    format = "time"
    tag = tags.tag("TIMETO") 
elif tags.tag("DATETIMETO") in search_str:
    format = "datetime"
    tag = tags.tag("DATETIMETO") 
    
search_str = format_and_validate_datetime_search_str(search_str, format, tag)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved datetime formats to canonical_tags.py and created get_datetime_format function for centralized access

)[0]

# If a date-range tag was provided, further filter the results.
if "@DATETO@" in name:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, and at this point we know we have validated dates. Ideally the validation should happen immediately before the point of use but in this case, there is no point wasting time searching if the dates are not valid, so it makes sense to do it before. But it is worth leaving a comment to indicate we know the dates are valid at this stage.

# If a date-range tag was provided, further filter the results.
if "@DATETO@" in name:
filtered_names: List[str] = []
for candidate in matching_names:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, to generalise it a bit more we can have:

if format is not None:
    assert tag is not None, "format and tag should be set together"
    
get_values_from_bids_formatted_name can use `format` and in the strptime call use the new `get_datetime_format ` function

values_list = get_values_from_bids_formatted_name(
[candidate_basename], "date"
)
if not values_list:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can assume this list is not empty because date-* was used to search the names already (?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed unnecessary empty list check since the search pattern ensures valid datetime values

except ValueError:
continue
if start_date <= candidate_date <= end_date:
filtered_names.append(candidate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this entire block could be isolated in a new function filter list of names by datetime just for readability (and it might also be useful in future)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created filter_names_by_datetime_range() function

@JoeZiminski
Copy link
Member

JoeZiminski commented Jun 6, 2025

Hey @Diya910 do you think you would be interested in continuing to work on this PR? This is a great addition and it would be nice to release it in a version soon. I'm happy to finalise the PR as most of the work now is just refactoring into the existing codebase.

@Diya910
Copy link
Author

Diya910 commented Jun 6, 2025

Hey @Diya910 do you think you would be interested in continuing to work on this PR? This is a great addition and it would be nice to release it in a version soon. I'm happy to finalise the PR as most of the work now is just refactoring into the existing codebase.

Yes yes, I am interested. I was busy with my exams and other stuffs. Just allow me a day or two. I'll do the required changes suggested by you.

@JoeZiminski
Copy link
Member

Hey @Diya910 great! No rush BTW I was just checking in, please prioritise exams / other stuff / taking some time to recuperate after exams. I was thinking it might be nice to merge over the next few weeks (rather than next few days), thanks!

@Diya910
Copy link
Author

Diya910 commented Jun 6, 2025

Thanks, I'll try to work on it as soon as possible.

…ion of code my making functions in validation.py and using in search_with_tags feature in folders file
@Diya910
Copy link
Author

Diya910 commented Jun 15, 2025

Hey, @JoeZiminski I have probably done all the changes suggested by you and also centralized the code. I have also changed the test file with additional test functions, everything is working fine from side. If any other changes are required, please let me know. I will do them at the earliest.

@JoeZiminski JoeZiminski added this to the v2.8.0 milestone Jun 17, 2025
@JoeZiminski
Copy link
Member

Hi @Diya910 thanks a lot for this! Will review tomorrow

Copy link
Member

@JoeZiminski JoeZiminski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Diya910 thanks for this, this is really great stuff. The code is very clean, this is going to make a great feature. I have left a few comments on the code, they just suggest some minor refactoring's to reduce code duplication where possible. For critical code, it makes sense to define the key parts only in once place, just in case they are changed later but the editor forgets to check for all places they are defined.

The tests are great for ensuring the features works well, I have suggested a refactoring here to use our existing testing machinery which I think should reduce some boilerplate, let me know if you have any questions about this. The tests will should probably test all three cases, dateto, timeto and datetimteto, happy to help with this.

I just pushed some fixes to the pre-commit on the CI which was failing, just some minor typing issues (see here for some detail on the pre-commit hooks). This should move on to the full test suite now.

Thanks again Diya this is nearly done! I just remembered we will also need to document this change, the contributing guide for this is here. It would make sense to add the new tags to this section. Happy to do this because the documentation can be a bit fiddly, but if you are interested in this please feel free to go ahead, let me know if you have any questions!

}
return tags[tag_name]


_DATETIME_FORMATS = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks for this. Can this be refacotred to be returned from a function rather than a dict with global scope e.g.:

def get_datetime_formats():
return {...}

The reason is that _DATETIME_FORMATS becomes a dictionary with global scope across the application, meaning if it is accidentally changed in one part of the code this will propagate everywhere. Wrapping it in a function means the scope is not longer globa.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies I see what you did here, that's great. In this case, you can move _DATETIME_FORMATS directly into get_datetime_format

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion, I will move it into the function


key = next((key for key in formats if key in name), None)
key = next(
(key for key in ["datetime", "time", "date"] if key in name), None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you can do datetime_keys = list(get_datetime_format().keys()) then

key for key in datetime_keys (just to avoid the re-definition of these keys))

Copy link
Author

@Diya910 Diya910 Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datetime_keys = list(canonical_tags.get_datetime_format())
key = next((key for key in datetime_keys if key in name), None)

Like this??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see apologies for the confusion, the return value of get_datetime_format is slightly different to that I thought.

What do you think of changing the function to take no arguments and return the entire dictionary, and then it can be indexed (instead of called). For example format = get_datetime_format(format_type) becomes format = get_datatime_formats()[format_type].

Then above, the signature can be:

datetime_keys = list(canonical_tags.get_datetime_formats().keys())
key = next((key for key in datetime_keys if key in name), None)

This signature is now different to the tags() function, but it is probably a better design because it is more flexible. Now the datetime names are canonically defined in a central place, and we can grab them all or index them as we like. Later on tags() can be changed to follow the same design. Sorry this would be a bit of a pain to change all the calls in your code though. What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes a lot of sense! I agree that returning the full dictionary from get_datetime_formats() would make the design cleaner and more flexible — especially for cases like this where we need all the keys. Happy to refactor the calls where needed. Also agree that aligning the design of tags() later on would help keep things consistent across the codebase. Will go ahead with this change!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great!

try:
datetime.strptime(format_to_check, strfmt)
error_message = []
if not validate_datetime(format_to_check, key):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be incorrect, I don't think this needs to be in a try catch block anymore? Would the below work? (The order of the conditional is also reversed to make the positive case first which is usually slightly more readable):

if validate_datetime(format_to_check, key):
    error_message = []
else:
    error_message = [
        get_datetime_error(
            key,
            name,
            canonical_tags.get_datetime_format(key),
            path_,
        )
    ]    

return error_message


return error_message


def validate_datetime(datetime_str: str, format_type: str) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good name but because there are a few similar functions it could be slightly more explicit e.g. datetime_value_str_is_iso_format()

return False


def get_expected_num_datetime_values(format_type: str) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this function get_expected_num_datetime_values and format_and_validate_datetime_search_str are only used in folders.py I think it makes sense to move them there (they could be under a Datetime Tag section or something)

match
): # We know this is true because format_and_validate_datetime_search_str succeeded
start_str, end_str = match.groups()
start_timepoint = datetime.strptime(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines can also use the new datetime_object_from_string function

@@ -0,0 +1,217 @@
import glob
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a well thought out test script that puts a lot of emphasis on realistic tests which is excellent. I think on balance, here it would be easier to use some of the test functionality to actually make a project and some folders, then check these are found as expected. For example here.

The test code might look something like the below. The project fixture is inherited from the BaseTest class and automatically comes with set up and tear down. Thinking about it, you might as well test directly with project.transfer_custom and see that the correct folders are transferred. This will then check every cog in the machine:

class TestDateSearchRange(BaseTest):

def test_date_search_range(self, project)
        
       sub_names = ["a list of example subs to test"]
       ses_names = ["a list of example ses to test]  # it might actually be easier to test the ses and sub case separately

        test_utils.make_and_check_local_project_folders(
            project, "rawdata", subs, sessions,  ["behav", "ephys"]
        )

      project.upload_custom( some search strings)
      
      transerred_subjects = (project.get_central_path() / "rawdata).glob("*")

     # now check that the correct files have been transferred

     
    )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also see here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I am opening the first example, it is causing some disruption in the ui and i am not getting where you are pointing. Can you please see it? Can you share that again

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Diya910, I think I accidentally copied from a commit, does this work?

it is the function test_wildcard_transfer in /tests/tests_integration/test_filesystem_transfer.py

assert found_dates == expected_dates


def test_simple_wildcard(temp_project_dir: Path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this function given this but if this checks an extra case we can keep of course

search_with_tags(cfg, base_folder, local_or_central, [pattern])
assert "Invalid" in str(exc_info.value)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant idea, this can be adjusted as suggested above but the core test is great

@JoeZiminski JoeZiminski linked an issue Jun 21, 2025 that may be closed by this pull request
@Diya910
Copy link
Author

Diya910 commented Jul 2, 2025

Hey @JoeZiminski, I have done changes required by you. These were a lot of changes I am not able to reply to all of them individually. But I made sure to make changes suggested by you. I have tested the changes on draft test file and they are working fine. I haven't properly done work on test file. It was a lot for me to do in a go. Once you confirm these changes I'll move ahead in refactoring test file. I hope you are fine with it. If I missed any suggestion above just in case, please point out to that I'll make those changes.

Copy link
Member

@JoeZiminski JoeZiminski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Diya910 this is great, definitely good to go bar some very minor suggestions. Most of these are minor github code suggestions so you can directly commit them.

Apologies, one of my suggestions was actually worse than what was already there 😅 around the walrus operator. Sorry for the inconvenience of having to revert this.

After these changes are integrated I will message @Akseli-Ilmanen to test this manually while the other tests are been written. Let me know if you have any questions as you refactor the tests. Thanks again!

@Diya910
Copy link
Author

Diya910 commented Jul 4, 2025

@JoeZiminski I have done all the changes. Please have a look. I am not sure about if I have removed declarations the right way. Please let me know if you want me to change docstrings in any specific way. Thankyou

@JoeZiminski
Copy link
Member

Hey @Diya910 sure I have merged main into this branch, there were a lot of fiddly conflicts that would be basically impossible to solve if you had not worked on the main branch changes. That's fixed now, I also made some changes related to typing to ensure the linter passes. Let me know if you have any questions!

@Diya910
Copy link
Author

Diya910 commented Aug 3, 2025

Hey @Diya910 sure I have merged main into this branch, there were a lot of fiddly conflicts that would be basically impossible to solve if you had not worked on the main branch changes. That's fixed now, I also made some changes related to typing to ensure the linter passes. Let me know if you have any questions!

Okay thanks. I'll make the test cases now.

@Diya910
Copy link
Author

Diya910 commented Aug 14, 2025

@JoeZiminski I have made the test file covering all the test cases, and also used basemodel this time. Can you please review it, so that I can make further changes accordingly. As of now all the test cases are working properly. Thankyou

@JoeZiminski
Copy link
Member

Brilliant! @Diya910 I am teaching this week but will be able to review this next week, thanks!

@Diya910
Copy link
Author

Diya910 commented Aug 14, 2025

Yeah sure, no worries. I will be happy to work.

Copy link
Member

@JoeZiminski JoeZiminski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Diya910 thanks again for these tests, they are excellent and cover not only the core cases but many useful additional cases. Please see for a couple of suggestions, there is one edge case in which the bounds can be adjusted, and the regexp for the datetime needs to be slightly adjusted too along with the corresponding tests. Once these changes are made I think all changes to the code will be done! Then just the documentation, I think we only really need a few lines there, and this will be good to go. Thanks again for this great contribution, this is a really cool feature.

strip_start_end_date_from_datetime_tag(search_str, format_type, tag)

# Replace datetime range with wildcard pattern
expected_len = get_expected_datetime_len(format_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only just noticed that for datetime the regex must be: [A-Za-z0-9]{15} whereas \d is for date and datetime only. I think we can change get_expected_datetime_len to get_datetime_regexp_format and return either "\d{6}" or "[A-Za-z0-9]{15}and then using these directly in the regexp. AT the moment, the datetime case fails as theT` is not recognised.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will actually be 3 types because date will have 8 digit so "\d{8}". Am I right? Should I change it accordingly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And one more bug here I feel is that the regex will loosely check the length. If there is any number in place of T it will still pass it because it is just returning length.

class TestDateSearchRange(BaseTest):
"""Test date/time range search functionality with real datashuttle projects."""

def test_simple_wildcard_first(self, project):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a neat and well-written test, but it can be removed from this module as this functionality is tested elsewhere. Here we can keep only tests related to the date search range.

"rawdata",
sub_names=subs,
ses_names=[
f"ses-{canonical_tags.tags('*')}_datetime-20240315{canonical_tags.tags('*')}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the @DATETO@ tag can be used e.g.:

    def test_datetime_range_transfer(self, project):
        """Test that wildcard matching works with datetime-tagged sessions."""
        subs = ["sub-001"]
        sessions = [
            "ses-001_datetime-20240301T080000",
            "ses-002_datetime-20240315T120000",
            "ses-003_datetime-20240401T160000",
            "ses-004_datetime-20240401T160001",
            "ses-005_datetime-20240415T200000",
        ]

        datatypes_used = test_utils.get_all_broad_folders_used(value=False)
        datatypes_used.update({"behav": True})
        test_utils.make_and_check_local_project_folders(
            project, "rawdata", subs, sessions, ["behav"], datatypes_used
        )

        project.upload_custom(
            "rawdata",
            sub_names=subs,
            ses_names=[
                f"ses-{canonical_tags.tags('*')}_20240315T120000{canonical_tags.tags('DATETIMETO')}20240401T160002",
                f"ses-{canonical_tags.tags('*')}_20240415T200000{canonical_tags.tags('DATETIMETO')}20240415T200000",
            ],
            datatype=["all"],
        )

        central_path = project.get_central_path() / "rawdata" / "sub-001"
        transferred_sessions = [ses.name for ses in central_path.glob("ses-*")]

        expected_sessions = [
            "ses-002_datetime-20240315T120000",
            "ses-003_datetime-20240401T160000",
            "ses-004_datetime-20240401T160001",
            "ses-005_datetime-20240415T200000",
        ]

        assert sorted(transferred_sessions) == sorted(expected_sessions)

assert sorted(transferred_subs) == sorted(expected_subs)

@pytest.mark.parametrize("project", ["full"], indirect=True)
def test_download_with_date_range(self, project):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this test following the suggested changes to the main three tests above

],
datatype=["behav"],
)
assert "before start" in str(exc_info.value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the string example could include a few more words, just so as a reader it is clear what the general semantics behind the error message are. Same for Invalid .... below

project, "rawdata", subs, sessions, ["behav"], datatypes_used
)

with pytest.raises(Exception) as exc_info:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for consistency with other tests in the codebase, the exc_info can just be called e


def test_subject_level_date_range(self, project):
"""Test date ranges work at the subject level too."""
subs = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is is enough, but for fun and to cover all cases we could make 6 folders:

  1. 2x with date, 2x with with time, 2x with datetime

Then in the upload function we can have a list of three names (one date, one datetime, one time) that picks one of the sub names of each type. Then we can assert these three names are found. Also, this function could use the upload_or_download parameterization

]
assert sorted(downloaded_sessions) == sorted(expected_sessions)

def test_edge_case_exact_boundary_dates(self, project):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cool test, when testing I ran into the problem that if there is just one folder e.g. ses-001_date-2024030 then ses-001_2024030@DATETO@2024030 does not transfer anything because the <= used above (I left a suggested change there). In this case, maybe we can just test with one folder and the search string can include only that folder. This then tests that tricky case and is also a test for exact boundary dates. We could use 3x folder names (one with date, one with time, one with datetime) as suggested above to consider all cases (now we would just test that 3x folder are transferred)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For good measure, we might as well also use the upload_or_download parameterization here

"rawdata",
sub_names=subs,
ses_names=[
f"ses-{canonical_tags.tags('*')}_20240315{canonical_tags.tags('DATETO')}20240401"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For good measure, we could use the second tag as "ses-004_date-20240415" and check all three are found (then we have a test case at both boundaries + the middle)

@JoeZiminski
Copy link
Member

Hey @Diya910, thanks again for this contribution. If it is easier, I am happy to take care of those suggestions as there is nothing too substantial there, I think after this there are no more changes to the code required, only docs!

@Diya910
Copy link
Author

Diya910 commented Aug 26, 2025

Hey @JoeZiminski, yes you may do the further changes. I was planning to do this weekend. But if you want to take the charge you may proceed. Thankyou

@Diya910
Copy link
Author

Diya910 commented Sep 2, 2025

Hey @JoeZiminski , I was trying to work yesterday on the PR but there are merging conflicts with main branch. Can you please resolve them so that I can work further.

@JoeZiminski
Copy link
Member

Hey @Diya910 apologies for the delay in reply, please do feel free to work on the changes it's much appreciated! I will sort out the merge now

@JoeZiminski
Copy link
Member

JoeZiminski commented Sep 4, 2025

Hey @Akseli-Ilmanen thanks again for this suggestion, this PR has only tests and docs to finalise and so the implementation should be working. It would be great if you could test it out and let us know if its working well. Cheers!

EDIT:

To use the new feature, you can add a @DATETO@, @TIMETO@ or @DATETIMETO@ surrounded by the time range you want to restrict to e.g. sub-@*@_01012025@DATETO02022025

@Akseli-Ilmanen
Copy link

Heya.

@Diya910 thanks so much for setting this up.

@JoeZiminski I am a bit new to programming, so I am not sure what's the best way to do clone/install this branch and then test the functionality?

Also, I am a big fan of the datashuttle GUI. I was wondering whether the date selection functionality could be integrated there. See below for a little sketch. (I also added something extra about how one could select the subject via a selection tool. This could be really helpful if researchers have many/long subject names, they don't remember by heart.

image

Hope, I am not creating too much work for you guys. :D Really appreciate it!

Best
Akseli

@JoeZiminski
Copy link
Member

Hey @Akseli-Ilmanen thanks for the suggestion! Any new ideas on how to improve the software are much appreciated, I'm glad you are finding the GUI useful. Would you be able to make a new issue with the suggestion above? This is a very cool idea and could be explored in a new PR, and there is a non-official textual date-picker widget that could be used for this.

To test, as you say the best way would be to clone the formed repository, and then check into the feature branch. You can make a new environment so you don't overwrite your existing datashuttle installation. The commands to download the code would be:

git clone https://github.com/Diya910/datashuttle.git
cd datashuttle
pip install -e .
git checkout date_feature

The -e flag tells pip not to install the package to the packages directory (e.g. where packages are located when you do pip show ...) but use the package as-is in location you install from. That way, when you do things like change branch, the installed version of the package is updated. In general this is the best way to install packages for development (even if you are not planning on changing the repository code yourself).

@Akseli-Ilmanen
Copy link

@JoeZiminski, thanks for the explanation on how to manage installation, very helpful!

Would your example work? I thought that the date has to be specified in YYYYMMDD format?

To use the new feature, you can add a @DATETO@, @TIMETO@ or @DATETIMETO@ surrounded by the time range you want to restrict to e.g. sub-@*@_01012025@DATETO02022025

@Akseli-Ilmanen
Copy link

@JoeZiminski
Also, I got a bit confused on how to use wildcards, I tried a number of different things now, but all not working. Could you point out what's incorrect about:

from datashuttle import DataShuttle


project = DataShuttle("AI_data")
project.upload_custom(
    top_level_folder="derivatives",
    sub_names="all_sub",
    ses_names="ses-@*@_20250309@DATETO20250310",
    datatype="behav")

Also, when reading the documentation, I found quite confusing in which ways @*@ can or cannot be used. Maybe, it would be helpful to have an tutorial page, with lots of example of how one can and cannot use it. I think more examples would definitely help to understand the functionality.

@JoeZiminski
Copy link
Member

HI @Akseli-Ilmanen , I think that it should work except it's missing a final @ at the end of @DATETO@ e.g.:

from datashuttle import DataShuttle


project = DataShuttle("AI_data")
project.upload_custom(
    top_level_folder="derivatives",
    sub_names="all_sub",
    ses_names="ses-@*@_20250309@DATETO@20250310",
    datatype="behav"
)

Let me know how this goes! I also just pushed a fix (unrelated to this, but would have revealed itself if you change ses-@*@_... to ses-001_...). You can git pull in the datashuttle repo and if you installed with pip install -e . the changes will be incorporated next time you run datashuttle. You can check with git log, you should see the most recent commit is Cover DATETO etc. cases in check_and_format_names...

Thanks for the suggestion on more examples that's a nice idea 🚀 , I'll raise a PR over the next couple of days, will keep you updated.

@Diya910 I pushed a fix in e2940b2 that adds the new tags to a list of exceptions to ensure they are not validated (failing because they include a nonalphanumeric value) and added some new tests to cover these cases.

@Akseli-Ilmanen
Copy link

Ah oopsie. I fixed the @, but still get these errors:

When I run:

from datashuttle import DataShuttle

project = DataShuttle("AI_data")
project.upload_custom(
    top_level_folder="derivatives",
    sub_names="all_sub",
    ses_names="ses-@*@_20250309@DATETO@20250310",
    datatype="behav"
)

I get:

The sub names to transfer are: ['sub-01_id-Ivy', 'sub-02_id-Poppy', 'sub-03_id-Freddy']
The ses names to transfer are: []
The ses names to transfer are: []
The ses names to transfer are: []
No files included. None transferred.

And if I run with the ses-000 modification:

from datashuttle import DataShuttle


project = DataShuttle("AI_data")
project.upload_custom(
    top_level_folder="derivatives",
    sub_names="all_sub",
    ses_names="ses-000_20250309@DATETO@20250310",
    datatype="behav"
)

I get this error message:

The sub names to transfer are: ['sub-01_id-Ivy', 'sub-02_id-Poppy', 'sub-03_id-Freddy']
---------------------------------------------------------------------------
NeuroBlueprintError                       Traceback (most recent call last)
Cell In[11], [line 5](vscode-notebook-cell:?execution_count=11&line=5)
      1 from datashuttle import DataShuttle
      4 project = DataShuttle("AI_data")
----> [5](vscode-notebook-cell:?execution_count=11&line=5) project.upload_custom(
      6     top_level_folder="derivatives",
      7     sub_names="all_sub",
      8     ses_names="ses-000_20250309@DATETO@20250310",
      9     datatype="behav"
     10 )

File ~\Documents\Akseli\Code\datashuttle\datashuttle\utils\decorators.py:66, in check_configs_set.<locals>.wrapper(*args, **kwargs)
     60     log_and_raise_error(
     61         "Must set configs with make_config_file() "
     62         "before using this function.",
     63         ConfigError,
     64     )
     65 else:
---> [66](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/Admin/Documents/Akseli/Code/movformer-gui/~/Documents/Akseli/Code/datashuttle/datashuttle/utils/decorators.py:66)     return func(*args, **kwargs)

File ~\Documents\Akseli\Code\datashuttle\datashuttle\utils\decorators.py:88, in check_is_not_local_project.<locals>.wrapper(*args, **kwargs)
     81     log_and_raise_error(
     82         "This function cannot be used for a local-project. "
     83         "Set connection configurations using `update_config_file` "
...
     80 """
     81 ds_logger.close_log_filehandler()
---> [82](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/Admin/Documents/Akseli/Code/movformer-gui/~/Documents/Akseli/Code/datashuttle/datashuttle/utils/utils.py:82) raise exception(message)

NeuroBlueprintError: SPECIAL_CHAR: The name: ses-000_20250309@DATETO@20250310, contains characters which are not alphanumeric, dash or underscore.
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?b37f3fe7-1f32-449c-9835-1522309136b4) or open in a [text editor](command:workbench.action.openLargeOutput?b37f3fe7-1f32-449c-9835-1522309136b4). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...

@JoeZiminski
Copy link
Member

Hey @Akseli-Ilmanen thanks a lot for looking into this, I wonder if the code is definitely picking up the most recent version (in theory, those commands should work)

The code is run from:
File ~\Documents\Akseli\Code\datashuttle
could you please do git log in this directory and confirm that the most recent commit appears as:

image

Similarly could you also do the below before from datashuttle import DataShuttle:

import datashuttle
print(datashuttle.__file__)

to triple check the location of the used package is correct (I was just struggling this problem myself earlier today).

Also, just to take a look at the session folders to transfer, could you please do in the rawdata folder:

find . -type d -name "ses-*" -exec ls -l {} \;

and copy the contents here (assuming you are happy to share the folders and file names in this path).

Thanks for taking the time to test this!

@Akseli-Ilmanen
Copy link

Hi @JoeZiminski,

I think I initallly pulled incorrectly, now I do get the Tue 9 Sept update in my git log.
image

image

find . -type d -name "ses-*" -exec ls -l {} \; this I was not able to make work, it's a command I would write in my terminal in the subject folder? Here are some file explorer screenshots, maybe the suffice too.

image image

Note that we decided to name all our session ids ses-000. We did this because in our existing recording set-up, because with our existing backup system, we ran into some inconsistencies when different recording machines (ephys vs video) had different numbers of session folders, and we wanted to use our own backup method to move them to a centralized storage location. E.g. from the ephys machine the session would be ses-003_dateXYZ and from video machine the session would be ses-004_dateXYZ. Although they belong to the same recording session _dateXYZ, it's hard to assign coordinate them getting the same session number across machines that are not linked. If you have any thoughts on that, would be curious to hear.

@JoeZiminski
Copy link
Member

Hey @Akseli-Ilmanen thanks for sharing this! No worries about that command (yes that's correct, it was to run in the system terminal from within the rawdata folder, apologies I was not clear) but the images work well. I guess since the updating (maybe do pip install -e .[dev] from within datashuttle again just to be sure) the command is still not working?

I think it might be because of the trailing _01 and _02 on the folder names, technically this is not NeuroBlueprint because it is not part of a key-value pair and so the date detection might be getting confused. Would it be possible to add a key to this value e.g. condition-01? (out of interest what does it represent?)

That's a good point about syncing across different machines, its a difficult issue to work around (#373). One approach is to manually set the session number in a way that makes sense to you (e.g. if an animal has a behaviour session, then a behaviour + ephys session, the first session is ses-001 (in behav folder) and the second session is ses-002 (in behav and ephys folder). In this case the session id represents the animals overall session, rather than the ephys or behav session specifically.

Alternatively, we recently introduced a PR #575 that is more flexible on the sub- and ses- keys, allowing any alphanumeric character (e.g. ses-20250101T121212 would be allowed). Currently, date, time and datetime are only checked if they are under the date/time/datetime key. However, other users have expressed interest in having the datetime as the ses- value e..g ses-20250101. We could in theory extend the date functionality to the sub- or ses- key. In this case, the date/time/datetime key would be preferred but we can also check the sub- and ses- key, and if it is a valid date/time/datetime use it to filter sessions in this PR. You could do something like ses-20250309@DATETO@20250310.

Do you think either of this approaches would help with this issue?

@Akseli-Ilmanen
Copy link

Akseli-Ilmanen commented Sep 12, 2025

Hi, @JoeZiminski Yes, I think the trailing _01 are the problem. When I find some time, I will have to rename some folders. We added them ad-hoc, when we realized that sometimes we have more than 1 session per day.

Yeah, cross-machine synchronization seems tricky. Something like a "sync metadata" (#373) solution across machines would be great.

The problem with this approach is that users have to keep track of which session number have been created on all PCs. It's easy to keep track of whether one is doing session 1 or 2 during one day, but more tricky for a user to keep track that we are on session 32 on the behav PC vs session 16 on the ephys machine or so.

the first session is ses-001 (in behav folder) and the second session is ses-002 (in behav and ephys folder). In this case the session id represents the animals overall session, rather than the ephys or behav session specifically.

Yes, a ses-20250101T121212 a with DATETTO functionality would be great. I think moving forward for our lab, I would like to do the following: ses-000_date-20250407_01 -> ses-20250407_n-01. This would adhere to NeuroBluePrint? When a user starts a session, they just have to specify n-01 or n-02 and the date is assumed to be today's date. So, I am look forward to the ses-20250309@DATETO@20250310 functionality

I think doing ses-000_date-20250407_01 -> ses-20250407_n-01 would be something I can very easily do programatically, and don't need to manually go renaming folders. But in case you recommend for a different solution, I am very open for suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Search within date range
4 participants