How to better handle Jupyter Notebooks #1152
adamjstewart
started this conversation in
General
Replies: 2 comments 2 replies
-
RE diffs/review, I see a lot of people using https://www.reviewnb.com/ which seems like an option for ignoring whitespace changes when reviewing. |
Beta Was this translation helpful? Give feedback.
1 reply
-
PyTorch uses sphinx-gallery. This is definitely an alternative to jupytext if we can't get it working, although jupytext is more popular. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We use Jupyter Notebooks in TorchGeo for our tutorials. Before we start to greatly expand these for formal conference tutorials, there are a few chronic issues we should deal with first.
Issues
Some issues I've always had with notebooks.
Formatting
We run tools like black/flake8/isort/mypy/pydocstyle/pyupgrade on our .py files to enforce PEP-8, perform static type analysis, and provide documentation guidelines. With the exception of black, none of these tools directly support .ipynb files. This is relatively minor compared to the other issues, but nevertheless a long-term issue I've had with notebooks.
Diff
The biggest issue I have with notebooks is that it is impossible to review changes to notebooks on GitHub. If you save a notebook locally, it uses 1-space indentation. But if you save it on Colab, it uses 2-space indentation. So every single line of the notebook is "changed". Also, Colab adds thousands of lines of useless metadata that change every time you run the notebook. And the output images that get saved add thousands more lines.
For an example of the severity of this, see #1124, where I actually simplified our notebooks by removing several sections in the indices tutorial, and accidently added 10K lines of code in the process. The diffs aren't even possible to view on any of the files where I generated new output. And even if I regenerate on Colab, there are still thousands of lines that change every single time.
Testing
Our notebook tests haven't been passing in almost a year. #1124 fixes that, but there are still occasional bugs that pop up. For example, if I save the output of the trainers tutorial, the tests no longer pass, but if I strip the output, they do. The nbmake developer is looking into this and our reports that failing tests do not have useful error messages: treebeardtech/nbmake#80.
Possible Solutions
Some tools we could potentially use to alleviate these issues.
nbqa
nbqa is a tool that lets you run black/flake8/isort/mypy/pydocstyle/pyupgrade on your notebooks. This seems like a no brainer to add, even if it means yet another dependency.
jq
Jupyter Notebooks are simply JSON files. jq is a popular JSON parser that could be used to autoformat and fix the indentation of notebooks.
nbstripout
In a way, notebooks are both source files (contain code and text) AND built files (contain execution output). You're not really supposed to store built files in git, you're only supposed to store source files. nbstripout is a tool that lets you remove all execution output from a notebook. This could be used to clean up notebooks and keep diffs minimal.
jupytext
You may have noticed that the PyTorch tutorials are actually stored as .py files. I haven't actually figured out how PyTorch does this, but there are tools out there for this.
One option is jupytext, where notebooks can be stored as text files and converted to notebooks only for the docs. This allows autoformatters to run on raw .py files and has better support for diffs and text editors.
nbsphinx
nbstripoutput and jupytext can be used to reduce the notebook to only the bare minimum source, but it can be nice to display outputs and plots in the documentation. We already use nbsphinx to convert our notebooks to HTML, but it can also be used to actually execute the notebook when generating the docs. There are many problems with this (our notebooks can take hours to run and some output requires a GPU), but if we can figure out a way to cache this and only run it when a notebook is modified, this could be a good solution.
Please comment with any other issues you've noticed or tools you know of. And let me know if you have any preferences or opinions about these solutions.
Beta Was this translation helpful? Give feedback.
All reactions