-
Notifications
You must be signed in to change notification settings - Fork 1
Create ssh_repo_elm.md #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
lunebellec
wants to merge
2
commits into
main
Choose a base branch
from
datalad_ssh_repo
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| ## 🧠 Datalad for Students: Minimal Reproducible Workflow | ||
|
|
||
| ### 📦 1. Create a Datalad Dataset for Data (on `elm`) | ||
|
|
||
| **On your local machine:** | ||
|
|
||
| ```bash | ||
| datalad create -c text2git REPONAME | ||
| cd REPONAME | ||
| ``` | ||
|
|
||
| **Annex the big files (e.g. CSVs):** | ||
| It is important to properly configure `.gitattributes` such that the right files get annexed. The `text2git` configuration typically configures text files to be stored in `git` instead of being annexed. More info in the [datalad documentation](https://handbook.datalad.org/en/latest/basics/101-124-procedures.html). But you may want to manually set rules to ensure the content you want annexed indeed is. For example if you plan to store you data in `csv` files: | ||
|
|
||
| ```bash | ||
| echo "*.csv annex.largefiles=anything" >> .gitattributes | ||
| datalad save -m "Set annex rules for CSVs" | ||
| ``` | ||
|
|
||
| **Add data to the repository:** | ||
| You can just add files in the repository and save its current state with the following command: | ||
| ``` | ||
| datalad save -m "Adding some data" | ||
| ``` | ||
| **Create a new sibling of the repository on `elm`:** | ||
| (update the path to a location under your own USERNAME): | ||
| ```bash | ||
| datalad create-sibling \ | ||
| --name elm \ | ||
| ssh://elm/data/simexp/USERNAME/REPONAME \ | ||
| --existing=skip \ | ||
| ``` | ||
| **Push data to `elm`:** | ||
| You can now easily maintain a versionized backup of your data on elm. | ||
| ``` | ||
| datalad push --to elm | ||
| ``` | ||
|
|
||
| **Create a github record of meta-data:** | ||
| First, create a repo called REPONAME on github, under some organization ORGNAME (for example `courtois-neuromod`). Keep it blank, no README or LICENSE. Then, add this repo as sibling of the dataset: | ||
| ```bash | ||
| datalad siblings add -s origin --url [email protected]:ORGNAME/REPONAME.git | ||
| ``` | ||
|
|
||
| **Push Git-only metadata to GitHub (optional):** | ||
| It is now easy to push metadata to github: | ||
| ``` | ||
| datalad push --to origin | ||
| ``` | ||
| Note that if you misconfigured datalad you may push sensitive data on github. First, check using `ls -alsh` that the sensitive data appears as links pointing to git-annex rather than actual files. Second, start by making the repo private until you're share no sensitive data was pushed by mistake. If you pushed sensitive data by mistake, just delete the repository and start fresh if you can. Otherwise you'll need to edit the git+git-annex history of the repository, good luck :/ | ||
|
|
||
| --- | ||
|
|
||
| ### 👩💻 2. For Students: Install and Use | ||
|
|
||
| **Clone the dataset from GitHub or `elm`:** | ||
|
|
||
| ```bash | ||
| # Option A: from GitHub (metadata only) | ||
| datalad install [email protected]:courtois-neuromod/image10k-zooniverse.git | ||
|
|
||
| # Option B: from elm (with the actual data) | ||
| datalad install ssh://elm/data/simexp/pbellec/image10k-zooniverse.git | ||
| ``` | ||
|
|
||
| **Navigate and get data:** | ||
|
|
||
| ```bash | ||
| datalad get EXAMPLEFILE.csv | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ### 🖼 3. Managing Outputs (Optional) | ||
|
|
||
| **Create a separate dataset for outputs:** | ||
|
|
||
| ```bash | ||
| datalad create image10k-zooniverse.plots | ||
| cd image10k-zooniverse.plots | ||
|
|
||
| echo "*.png annex.largefiles=anything" >> .gitattributes | ||
| datalad save -m "Track plots in annex" | ||
| ``` | ||
|
|
||
| **Link it back into the analysis repo:** | ||
|
|
||
| ```bash | ||
| cd image10k-zooniverse | ||
| datalad install -d . -s ../image10k-zooniverse.plots plots | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ### ⚠️ Tips & Troubleshooting | ||
|
|
||
| * If `datalad get` fails with `annex-ignore`, you likely cloned from GitHub only. Clone once from `elm` to propagate sibling config. | ||
| * To inspect siblings: | ||
|
|
||
| ```bash | ||
| datalad siblings | ||
| ``` | ||
|
|
||
| * To pull subdataset updates: | ||
|
|
||
| ```bash | ||
| datalad update --merge | ||
| ``` | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using
--as-common-datasrc NAMEsee above would fix that. Or setting the create sibling as autoenabled afterwardgit-annex configremote elm autoenable=true.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this has been a point I'm still struggling with!! I could not get it to work such that installing from github would download from elm. So if I add
--as-common-datasrcwhen I create theelmsiblings it should fix it? or is that configuration staying local?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK experimented a bit and could not get it to work. I tried to remove the elm siblings then adding it back with:
datalad siblings add --name elm --url ssh://elm/data/simexp/pbellec/image10k-zooniverse --as-common-datasrc originGot this error:add-sibling(impossible): . (sibling) [cannot configure as a common data source, URL protocol is not http or https] .: elm(+) [ssh://elm/data/simexp/pbellec/image10k-zooniverse (git)]