Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow stubgen recursively from CMake #463

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

laggykiller
Copy link
Contributor

No description provided.

@laggykiller
Copy link
Contributor Author

laggykiller commented Mar 10, 2024

As always, pypy3.9 decides to randomly fail during test ¯\(ツ)

Looks like pypy has newer release (v7.3.15), perhaps updating it in ci.yml would solve this problem?

@wjakob
Copy link
Owner

wjakob commented Mar 10, 2024

Looks like pypy has newer release (v7.3.15), perhaps updating it in ci.yml would solve this problem?

Good idea, I will try 🤞

Regarding the PR: RECURSIVE does not necessarily require OUTPUT_DIR to be specified. In that case, stubgen will place the generated stub directly into the module directory, which I daresay is usually what you want. Perhaps you could mention this in the documentation RST file?

@laggykiller
Copy link
Contributor Author

laggykiller commented Mar 10, 2024

But what if user specify both --output-dir and --output-file? Looks like it prefers --output-file over --output-dir, should I also add a check in stubgen in this PR so it throws an error if both specified?

@wjakob
Copy link
Owner

wjakob commented Mar 10, 2024

That combination doesn't make sense. And output file with recursive also doesn't make sense. (Indeed, stubgen also has some checks for this inside). I was just referring to a part of the text in the docs you wrote that seemed to suggest that RECURSIVE requires OUTPUT_DIR.

@laggykiller
Copy link
Contributor Author

Please check!

@wjakob
Copy link
Owner

wjakob commented Mar 11, 2024

I'm still pondering the best API to expose stub generation in Python.

For example, I think that there is a way of getting RECURSIVE to work even when not using INSTALL_MODE. However, CMake would need to be told which output files are generated as part of the traversal. So in that case, there would be two main ways of building stubs:

nanobind_add_stub(
    ...
    MODULE my_ext
    OUTPUT_FILE my_ext/__init__.pyi
)

and

nanobind_add_stub(
    ...
    MODULE my_ext
    RECURSIVE
    OUTPUT_FILES my_ext/__init__.pyi my_ext/submodule/__init__.pyi
)

If INSTALL_TIME is specified, then OUTPUT_FILE or OUTPUT_FILES is optional.

Once things are tracked at that level of granularity, here is nothing against allowing more modules as input even in non-recursive mode, e.g.:

nanobind_add_stub(
    ...
    MODULES my_ext_1 my_ext_2
    OUTPUT_FILES my_ext_1/__init__.pyi my_ext_2/__init__.pyi
)

@laggykiller
Copy link
Contributor Author

You have mentioned that...

Some of my own nanobind-based projects are designed to be importable/usable from the build directory without requiring an extra install step

Do you have any examples of those projects?

I am not sure how those kind of project work, but maybe calling a python script from nanobind-config.cmake that recurse and find out what stub files should be generated would help?

@wjakob wjakob force-pushed the master branch 2 times, most recently from 56d7e93 to e80edb1 Compare March 11, 2024 17:04
@laggykiller
Copy link
Contributor Author

laggykiller commented Mar 12, 2024

In my opinion,

nanobind_add_stub(
   ...
   MODULE my_ext
   RECURSIVE
   OUTPUT_FILES my_ext/__init__.pyi my_ext/submodule/__init__.pyi
)

is less clean than

nanobind_add_stub(
   ...
   MODULES my_ext my_ext.submodule
)

If we cannot figure out how to find out what output files need to be generated for non-INSTALL_TIME, I think it is better for us to not implement RECURSIVE for that mode (As the user still need to specify things one-by-one, we are just shifting the headache from MODULE to OUTPUT_FILES, and arguably requiring more keystrokes). Instead, we require user to specify MODULE one-by-one, and find out the OUTPUT_FILES automatically based on MODULE suplied by user. This should be easy with something like:

set(OUTPUT_FILES "")
foreach(M ARG_MODULES)
   string(REPLACE "." "/" M_NEW1)
   string(CONCAT ${M_NEW1} "/__init__.pyi" M_NEW2)
   set(OUTPUT_FILES "${OUTPUT_FILES} ${M_NEW2}")
endforeach()

@wjakob
Copy link
Owner

wjakob commented Mar 12, 2024

There is a problem I see with the loop you posted:

set(OUTPUT_FILES "")
foreach(M ARG_MODULES)
   string(REPLACE "." "/" M_NEW1)
   string(CONCAT ${M_NEW1} "/__init__.pyi" M_NEW2)
   set(OUTPUT_FILES "${OUTPUT_FILES} ${M_NEW2}")
endforeach()

In a submodule my_ext.sub, how will you know whether to generate my_ext/sub/__init__.pyi or my_ext/sub.pyi? The difference is important especially in larger binding projects containing a mixture of Python and C++ code.

You were interested in knowing about a project that creates a build-time importable library. See the nanobind_v2 branch of the Dr.Jit project: https://github.com/mitsuba-renderer/drjit/tree/nanobind_v2. This is a project with very complicated bindings, it uses essentially all features of nanobind and is what caused me to start writing nanobind in the first place.

@laggykiller
Copy link
Contributor Author

laggykiller commented Mar 12, 2024

Two ideas.


Idea 1

I think we actually don't need to know the stubgen output files path in order to let CMake only regenerate stub files if source was changed.

From nanobind-config.cmake

    add_custom_command(
      OUTPUT ${NB_STUBGEN_OUTPUTS}
      COMMAND ${NB_STUBGEN_CMD}
      WORKING_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}"
      DEPENDS ${ARG_DEPENDS} "${NB_STUBGEN}" "${ARG_PATTERN_FILE}"
      ${NB_STUBGEN_EXTRA}
    )
    add_custom_target(${name} ALL DEPENDS ${NB_STUBGEN_OUTPUTS})

What actually determines if stubgen.py should be run is DEPENDS in add_custom_command(). As seen from drjit, ARG_DEPENDS contains the target drjit-python which is associated with the cpp source files. Hence if cpp source files associated with target drjit-python are modified, add_custom_command() that depends on target drjit-python would be determined to be run.

This means this should work:

    add_custom_command(
      OUTPUT "whatever_file.txt"
      COMMAND ${NB_STUBGEN_CMD}  # stubgen.py also creates whatever_file.txt
      WORKING_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}"
      DEPENDS ${ARG_DEPENDS} "${NB_STUBGEN}" "${ARG_PATTERN_FILE}"
      ${NB_STUBGEN_EXTRA}
    )
    add_custom_target(${name} ALL DEPENDS "whatever_file.txt")

Or even better (Less confident about this though):

    add_custom_command(
      OUTPUT whatever
      COMMAND ${NB_STUBGEN_CMD}
      WORKING_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}"
      DEPENDS ${ARG_DEPENDS} "${NB_STUBGEN}" "${ARG_PATTERN_FILE}"
      ${NB_STUBGEN_EXTRA}
    )
    set_source_files_properties(whatever PROPERTIES SYMBOLIC true)
    add_custom_target(${name} ALL DEPENDS whatever)

CMake should not panic if the command generates more file than expected, and probably won't delete those extra files?


Idea 2 (EDIT: This should NOT work)

If above does not work, this shall work:

I searched "cmake add_custom_command with unknown output name" in google, found:

What if we:

  1. Add flag A in stubgen.py that would dump stub files path to stdout, another flag B for only drying running, walking through the module to be stubgened and only figure out what are the submodules and what stub files would be generated
  2. Run stubgen.py with flag A and B in execute_process(). Store the stdout of stubgen.py that contain list of stub files path in OUTPUT_VARIABLE (NB_STUBGEN_OUTPUTS)
  3. Now we know NB_STUBGEN_OUTPUTS, run add_custom_command() add_custom_target()

Yes, this means we technically need to run stubgen.py twice, but one of the run should be fast (?) and maybe it is too greedy for us to both want CMake to be fast + be lazy and not specify what files to be generated?

@laggykiller
Copy link
Contributor Author

laggykiller commented Mar 12, 2024

The first idea worked! Now it should be possible to recursively generate stub even for non-install time, and only generated if the source was changed.

I have made a new commit. To test, clone my fork, change tests/CMakeLists.txt

nanobind_add_stub(
py_stub
MODULE py_stub_test
OUTPUT ${PYI_PREFIX}py_stub_test.pyi
PYTHON_PATH $<TARGET_FILE_DIR:test_stl_ext>
DEPENDS py_stub_test.py
)

To the following:

nanobind_add_stub(
  py_stub
  MODULE py_stub_test
  # OUTPUT ${PYI_PREFIX}py_stub_test.pyi
  OUTPUT_DIR ${PYI_PREFIX}
  RECURSIVE
  VERBOSE
  PYTHON_PATH $<TARGET_FILE_DIR:test_stl_ext>
  DEPENDS ${CMAKE_CURRENT_BINARY_DIR}/py_stub_test.py
)

Now cmake build nanobind with test. py_stub_test.pyi should be regenerated only if py_stub_test.py was changed.

Only catch for this method is if py_stub_test.pyi was deleted, it would not be regenerated unless cmake from clean, *_stub.tmp was deleted or changes done to py_stub_test.py

Note: workflow should be successful if change DEPENDS py_stub_test.py to DEPENDS ${CMAKE_CURRENT_BINARY_DIR}/py_stub_test.py in tests/CMakeLists.txt

@wjakob
Copy link
Owner

wjakob commented Mar 12, 2024

Interesting, I guess that makes sense and is reminiscent of dummy timestamp files used in Makefiles. One thing this won't do is clean up things correctly. (e.g. ninja --clean)

@laggykiller
Copy link
Contributor Author

laggykiller commented Mar 12, 2024

One way is to use set_property() to set ADDITIONAL_CLEAN_FILES for .pyi file, but I can't think of how. Glob pattern in set_property() seems not working, and there seems to be no way to store a list of stub file paths generated from add_custom_command() and use set_property() on those stub files. Thinking about this has used more time than I wanted to spend, and I am too busy to think about this problem further...

If there is no way we can mark the .pyi file for deletion during clean, would you still prefer this method? Or would you prefer the method of adding execute_process() before add_custom_command() and add_custom_target()?

@laggykiller
Copy link
Contributor Author

laggykiller commented Mar 12, 2024

This article provides a semi-working solution: https://discourse.cmake.org/t/how-to-cleanup-random-byproducts/3154

Please try the latest commit. Recursively generated files are cleaned reliably (Ninja) or from second time onwards (Unix Makefile)

However, note that file(GLOB) has some bad rep: https://discourse.cmake.org/t/is-glob-still-considered-harmful-with-configure-depends/808

Given that using file(GLOB) only gives us semi-working solution and potential performance issue, you might not want to add that to code. Please share your opinion on this.

@laggykiller
Copy link
Contributor Author

After some thought, seems like the second method of calling execute_process() before running add_custom_command() and add_custom_target() is not an option, as execute_process() is run during configuration time, which .py file and nanobind python modules are absent in the build directory, hence cannot be used for stubgen during configuration time.

Seems like the first idea is the best we can do...

@wjakob wjakob force-pushed the master branch 2 times, most recently from dce91b4 to 183f039 Compare March 18, 2024 12:22
@wjakob
Copy link
Owner

wjakob commented May 23, 2024

Hi @laggykiller,

sorry about leaving this dormant for a long time. This PR went through a few different phases and I was wondering:

  • What are the benefits and downsides of the current approach? As far as I can tell, you write a dummy file that helps compute the staleness of outputs. This makes it unnecessary to specify an output file during stub generation. Are these dummy files only used in certain circumstances or always?
  • You mentioned having added a solution for cleaning files but with some limitations ( Recursively generated files are cleaned reliably (Ninja) or from second time onwards (Unix Makefile)) -- is that still part of the PR? It's a downside if things aren't cleaned up but not a deal-breaker.
  • Is it still possible to specify explicit file outputs and retain the current functionality?
  • Why does the commit mention that recursive mode is only available at install time?

@laggykiller
Copy link
Contributor Author

laggykiller commented May 23, 2024

What are the benefits and downsides of the current approach?

Currently this PR allow RECURSIVE mode to be used whether INSTALL_TIME is set or not (Sort of), which is a benefit.

The downsides are:

Are these dummy files only created and used in certain circumstances or always?

Dummy files only used for non-INSTALL_TIME.

You mentioned having added a solution for cleaning files but with some limitations ( Recursively generated files are cleaned reliably (Ninja) or from second time onwards (Unix Makefile)) -- is that still part of the PR?

Yes it is part of the PR (5982f36)

Is it still possible to specify explicit file outputs and retain the current functionality?

Yes, if user uses OUTPUT instead of OUTPUT_DIR. Everything works as before if user does not use OUTPUT_DIR.

Why does the commit mention that recursive mode is only available at install time?

I forgot to remove that sentence in the PR, fixed.


I was very busy and many things happened after my last comment here, so I may not remember all the details about the PR.

@BrunoB81HK
Copy link

What is the status for this feature? This would be very useful for me and I hope you can find time to add it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants