Skip to content

Commit 8b5387f

Browse files
Re-organize and add more helpful tools (#19)
* Reformat based on tool use-case Add more tools we use, re-orient, ref #2 Co-authored-by: Matt Thompson <[email protected]> * Add short marimo description * Add short description of Markdownlint remove some redundancy * Add creator preface for all GH links --------- Co-authored-by: Matt Thompson <[email protected]>
1 parent 224e5c8 commit 8b5387f

File tree

1 file changed

+77
-8
lines changed

1 file changed

+77
-8
lines changed
Lines changed: 77 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,42 @@
11
# Helpful Tools for your Workflow
22

3-
This page is dedicated to tools that can facilitate or improve project workflows. If there's something you use regularly that you think should be on this list, please [suggest it](https://github.com/Imageomics/Collaborative-distributed-science-guide/issues)!
3+
This page is dedicated to tools that can facilitate or improve project workflows. There are many available options to solve most of these challenges; these suggestions are based on tools used regularly within our community. If there's something you use regularly that you think should be on this list, please [suggest it](https://github.com/Imageomics/Collaborative-distributed-science-guide/issues)!
44

5-
## Jupytext
5+
## Better Version Control for Notebooks
66

7-
If you use Jupyter Notebooks in your project (as many of us do), you may want to consider adding [Jupytext](https://jupytext.readthedocs.io/en/latest/) to your repertoire. [Jupytext](https://github.com/mwouts/jupytext) allows you to [pair](https://github.com/mwouts/jupytext#paired-notebooks) a Jupyter Notebook to a `.py` (or `.md`) file so that `git` renders clearer and more informative diffs, showing only the code and markdown cells that have been updated between commits.
7+
When working with Notebooks that may be re-run without changes to code, it's particularly hard in a collaborative setting to keep track of these changes&mdash;or lack thereof. [Jupytext](#jupytext) and [marimo](#marimo) are a couple of useful tools to address this challenge and improve the Notebook experience.
8+
9+
### Jupytext
10+
11+
If you use Jupyter Notebooks in your project (as many of us do), you may want to consider adding [Jupytext](https://jupytext.readthedocs.io/en/latest/) to your repertoire. [mwouts/jupytext](https://github.com/mwouts/jupytext) allows you to [pair](https://github.com/mwouts/jupytext#paired-notebooks) a Jupyter Notebook to a `.py` (or `.md`) file so that `git` renders clearer and more informative diffs, showing only the code and markdown cells that have been updated between commits.
812
This makes it easier to see the differences between versions as you work through your project. For instance, if you re-ran your notebook with just a new random seed, the diff in the commit would show that without reproducing the whole thing, and you could go look at the output in the notebook.
913

10-
### How it Works
14+
#### How it Works
1115

1216
Notebooks can be [paired](https://github.com/mwouts/jupytext#paired-notebooks) individually, or you can set a [global config](https://jupytext.readthedocs.io/en/latest/config.html) in your notebooks folder to generate a pairing automatically. Unfortunately, this automated pairing only works if you use Jupyter Lab (i.e., run notebooks through the terminal), not if you work in VS Code or other IDEs. [Manual pairing](https://github.com/mwouts/jupytext/blob/main/docs/faq.md#can-i-use-jupytext-with-jupyterhub-binder-nteract-colab-saturn-or-azure) code is given below.
1317

14-
#### Jupytext commands in terminal for VS Code
18+
##### Jupytext commands in terminal for VS Code
1519

1620
```bash
1721
jupytext --set-formats ipynb,py:percent <notebook-name>.ipynb # Pair a notebook to a py script
1822
jupytext --sync <notebook-name>.ipynb # Sync the two representations
1923
```
2024

21-
#### But wait! ...There's another way to automate it!
25+
##### But wait! ...There's another way to automate it!
2226

2327
There is a [jupytext pre-commit hook](https://jupytext.readthedocs.io/en/latest/using-pre-commit.html) that can be used to sync your paired files automatically when updating your GitHub repo. To learn more about pre-commit hooks in general, see the [git docs on pre-commit hooks](https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks).
2428

25-
## Ruff
29+
### Marimo
30+
31+
[marimo](https://marimo.io/) functions similarly to a Jupyter Notebook, but has many built-in reproducibility and error-avoidance features, including the fact that it saves as a Python program (similar to the paired file created by [Jupytext](#jupytext)). See the summary in their [README](https://github.com/marimo-team/marimo?tab=readme-ov-file) or explore the [docs](https://docs.marimo.io/) to get started.
32+
33+
## Formatting and Linting
2634

27-
[Ruff](https://github.com/astral-sh/ruff) is a fast python formatter and linter. You can install it with `pip install ruff` or `conda install ruff` in your virtual/conda environment. They also have extensions for [VS Code](https://github.com/astral-sh/ruff-vscode) and [other editors supporting LSP](https://github.com/astral-sh/ruff-lsp).
35+
Have you found yourself saying, "I just need to clean up my code first"? Make this easier, and do it as you go, with linters! Additionally, formatting can impact code consistency and readability, while altering display of Markdown and generally adding noise version control diffs. [Ruff](#ruff) and [markdownlint](#markdownlint) are two tools designed to resolve this challenge, for Python and Markdown, respectively.
36+
37+
### Ruff
38+
39+
Fast _Python_ formatter and linter. You can install [astral-sh/ruff](https://github.com/astral-sh/ruff) with `pip install ruff` or `conda install ruff` in your virtual/conda environment. They also have extensions for [VS Code](https://github.com/astral-sh/ruff-vscode) and [other editors supporting LSP](https://github.com/astral-sh/ruff-lsp).
2840

2941
To format a file, run:
3042

@@ -39,3 +51,60 @@ ruff check <path/to/file>
3951
```
4052

4153
Ruff can also be set up as part of a pre-commit hook or GitHub Workflow. See their [Usage section](https://github.com/astral-sh/ruff?tab=readme-ov-file#usage) for more information.
54+
55+
### Markdownlint
56+
57+
Fast _Markdown_ formatter and linter. We use the [DavidAnson/markdownlint](https://github.com/DavidAnson/markdownlint) package for this site; see instructions and example in the [linting section](https://github.com/Imageomics/Collaborative-distributed-science-guide/blob/main/CONTRIBUTING.md#linting) of our contributing guidelines. It is flexible in configuration and allows for simple checking or even fixing straight-forward formatting issues.
58+
59+
## FAIR Data Access and Validation
60+
61+
Don't add to the reproducibility crisis! Are you using existing data accessed through URLs and need to ensure consistency for re-use? Do you have a folder of images with all their metadata documented through their filenames? [Cautious Robot](#cautious-robot) and [Sum Buddy](#sum-buddy) are here to help.
62+
63+
### Cautious Robot
64+
65+
Simple image from CSV downloader. The [Imageomics/cautious-robot](https://github.com/Imageomics/cautious-robot) package provides a FAIR and Reproducible method for **downloading a collection of images from URLs**.
66+
67+
- Configurable wait time and max attempts for retry.
68+
- Names images by given column with unique values.
69+
- Logs all successful responses and errors for review after download.
70+
- Uses [sum-buddy](#sum-buddy) to record checksums of all downloaded images.
71+
- Performs minimal check that the number of expected images matches the number sum-buddy counts.
72+
73+
**Optional features:**
74+
75+
- Organize images into subfolders based on any column in CSV.
76+
- Create square images for modeling:
77+
- Organizes images in a second directory (same format) with copies of images in specified size.
78+
- **Buddy-check:** verifies all expected images downloaded intact (compares given checksums with sum-buddy output).
79+
80+
#### Sample Command
81+
82+
Given a CSV (`example.csv`) with a list of image URLs in a `file_url` column with `filename` providing unique IDs for each image, the following snippet will download these into an `example_images/` directory and validate the contents with provided MD5 hashes from the `md5` column of the CSV.
83+
84+
```console
85+
cautious-robot --input-file example.csv --output-dir example_images -v "md5"
86+
```
87+
88+
To download larger (10-100M image scale), more distributed datasets, to HPC systems please see [Imageomics/distributed-downloader](https://github.com/Imageomics/distributed-downloader).
89+
90+
### Sum Buddy
91+
92+
Simple and flexible checksum calculator, from a single file to an entire directory. The [Imageomics/sum-buddy](https://github.com/Imageomics/sum-buddy) package provides a FAIR and Reproducible method for **duplicate file identification**, efficient **metadata generation**, and general **file integrity and validation** support.
93+
94+
- Input: Folder with things to checksum.
95+
- Output: CSV or printout of filepaths, filenames, and checksums.
96+
- Options:
97+
- Ignore subfolders and patterns,
98+
- Hash algorithm to use,
99+
- Avoid hidden files and directories.
100+
- Usage: Run as a CLI or with exposed Python methods.
101+
102+
#### Sample Use Case
103+
104+
Given a collection of images, e.g., in an `images/` directory, with no accompanying metadata, quickly generate a metadata file listing the filepaths, filenames, and checksums of all images contained in the folder. Note the option to include an "ignore file". This operates similarly to a `.gitignore`, allowing one to avoid inclusion of particular files or file types. In this case, let's assume there may be some `.doc` or similar included with the images. Hidden files and directories (e.g., `.DS_Store`) are ignored by default.
105+
106+
```console
107+
sum-buddy --output-file metadata.csv --ignore-file .sbignore images/
108+
```
109+
110+
The added benefit to this method of metadata CSV generation is the ability to quickly and easily check for duplicate images within a collection. See our [data training repo](https://github.com/Imageomics/data-workshop-AH-2024) to learn more about this subject.

0 commit comments

Comments
 (0)