Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: problems new files: include/exclude functionality #5455

Open
h-vetinari opened this issue Aug 11, 2024 · 11 comments
Open

BUG: problems new files: include/exclude functionality #5455

h-vetinari opened this issue Aug 11, 2024 · 11 comments

Comments

@h-vetinari
Copy link
Contributor

h-vetinari commented Aug 11, 2024

The functionality introduced in #5216 by @carterbox promises to be very useful on several feedstocks that need to "slice and dice" the result of a build into different outputs, so I started testing this for llvmdev in conda-forge/llvmdev-feedstock#283. This runs into a couple of different problems though.

files: does not take into account files already existing in host:

In llvmdev, we want to version the binaries (and some libraries), and create symlinks for the unversioned ones (that point to the versioned ones). These should go into different outputs. As such, llvm-tools-18 should contain the versioned binaries (except llvm-config), and llvm-tools should contain the symlinks. The following configuration expresses that intent.

  - name: llvm-tools-{{ major_ver }}
    files:
      include:
        - bin/*-{{ major_ver }}     # [unix]
      exclude:
        # belongs into llvmdev
        - bin/llvm-config*          # [unix]
      [...]

  - name: llvm-tools
    files:
      include:
        # opt-viewer tool is in share
        - bin/*                     # [unix]
        - share/*                   # [unix]
        - Library/bin/*.exe         # [win]
        - Library/share/*           # [win]
      exclude:
        # belongs into llvmdev
        - bin/llvm-config*          # [unix]
        - Library/bin/llvm-config*  # [win]
    requirements:
      host:
        - {{ pin_subpackage("llvm-tools-" ~ major_ver, exact=True) }}   # [not win]

However, what ends up happening is that llvm-tools repackages the output of llvm-tools-18, despite llvm-tools-18 being a host-dependence of llvm-tools (which should therefore be part of the baseline snapshot before installation that's the basis for determining what got installed on top).

This is a common setup, i.e. it's often desirable to slice off lib/libfoo.so and lib/libbar.so into separate outputs, and everything else in lib should go into yet another output. Similarly here, for the llvmdev output, we want to pick up everything that hasn't already been installed into other outputs. That too doesn't seem to work (i.e. bin/llvm-config ends up missing), using the seemingly obvious

  - name: llvmdev
    files:
      include:
        # everything not already in other outputs
        - "*"

Exclusion rule causes glob error

One output has:

  - name: libllvm{{ major_ver }}
    files:
      include:
        - lib/lib*                  # [unix]
      exclude:
        # separate output, see below
        - lib/libLLVM-C.*           # [unix]
        - Library/**/libLLVM-C.*    # [win]

on osx-arm, the exclusion causes a glob error:

ERROR: Glob $PREFIX/lib/libLLVM-C.* did not match in root_dir $PREFIX

It's worth noting that this file doesn't currently get installed, but if it ever does in the future, it should be excluded (hence the rule). Excluding files that aren't present needs to not cause errors.

Also, somehow the osx-arm case (cross-compiled from x64) manages to trigger something for the wrong architecture:

  File "/Users/runner/miniforge3/lib/python3.10/site-packages/conda_build/post.py", line 1635, in post_process_shared_lib
    mk_relative_osx(path, host_prefix, m, files=files, rpaths=rpaths)
  File "/Users/runner/miniforge3/lib/python3.10/site-packages/conda_build/os_utils/macho.py", line 221, in otool
    lines = check_output([otool, "-l", path], stderr=STDOUT).decode("utf-8")
  File "/Users/runner/miniforge3/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/Users/runner/miniforge3/lib/python3.10/subprocess.py", line 503, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/Users/runner/miniforge3/lib/python3.10/subprocess.py", line 971, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/Users/runner/miniforge3/lib/python3.10/subprocess.py", line 1863, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 86] Bad CPU type in executable: '$PREFIX/bin/llvm-otool'

That might be a corner case though, as we're packaging the binary that perhaps gets called by conda-build itself (if a symlink otool -> llvm-otool exists). However, this aspect didn't fail before switching to the files:-based approach.

@h-vetinari
Copy link
Contributor Author

I guess there are two different use-cases that are at odds with each other. In the case of the bootstrap compilers, we actually do want everything from host: to be picked up.

But since we now have a distinction between

files:
  - abc

and

files:
  include:
    - abc
  exclude:
    - def

I guess we could keep the former as "take everything" and the latter as "take into account snapshotting"?

@carterbox
Copy link
Contributor

files: does not take into account files already existing in host:

The implicit filtering/ignoring of files added to the prefix from other packages only applies to files installed from packages which were created in separate recipes. It doesn't work when the host dependcies come from another output of the same recipe. You will need to craft your include exclude globs so they do not intersect.

If you want a "everything not already in other outputs" then you would do something like this:

- name: A
  files:
    include:
     - A
- name: B
  files:
    include:
     - B
- name: C
  files:
    include:
     - "*"
    exclude:
     - A
     - B

The reason for this is mostly implementation motivated to minimize changes to the exisiting code, but also if we were to track which artifacts had already been added to other outputs, then suddenly the order in which the outputs are packaged matters! Currently, I think the order in which the outputs are packaged is somewhat inconsistent/arbitrary.

Exclusion rule causes glob error

I have noticed this message too. It comes from a conda-implemented glob function and not something that I added when implementing include/exclude. I don't think it actually causes the build to fail though? If I recall, it's just a print statement which contains the word "Error". I agree this should be fixed though.

I guess there are two different use-cases that are at odds with each other. In the case of the bootstrap compilers, we actually do want everything from host: to be picked up.

I'm not saying it's wrong per se, but I do think it's strange for a recipe to repackage artifacts from its host dependencies. Why repackage instead of depending? Maybe there's something to be fixed with the interaction between build/always_include_files and build/files/include,build/files/exclude.

@carterbox
Copy link
Contributor

  - name: llvm-tools-{{ major_ver }}
    files:
      include:
        - bin/*-{{ major_ver }}     # [unix]
      exclude:
        # belongs into llvmdev
        - bin/llvm-config*          # [unix]
      [...]

  - name: llvm-tools
    files:
      include:
        # opt-viewer tool is in share
        - bin/*                     # [unix]
        - share/*                   # [unix]
        - Library/bin/*.exe         # [win]
        - Library/share/*           # [win]
      exclude:
        # belongs to llvm-tools-major-ver
        - bin/*-{{ major_ver }}     # [unix]
        # belongs into llvmdev
        - bin/llvm-config*          # [unix]
        - Library/bin/llvm-config*  # [win]
    requirements:
      host:
        - {{ pin_subpackage("llvm-tools-" ~ major_ver, exact=True) }}   # [not win]

@h-vetinari
Copy link
Contributor Author

Yes, trivially I can duplicate the exclusion rules in various places. But that's unnecessarily cumbersome and brittle.

In all other scenarios the content is correctly determined as the diff to the host environment before installation. If I really want to overwrite something, it should be opt in with always_include_files (which I had forgotten above and answers the question of how to handle the bootstrap compilers).

@h-vetinari
Copy link
Contributor Author

if we were to track which artifacts had already been added to other outputs, then suddenly the order in which the outputs are packaged matters!

But conda already necessarily provides the right build order if a subpackage is a host dependency of another one? That logic must already exist in some form, because that's how it works when we manually layer the way these outputs get filled (examples are llvmdev, clangdev, arrow-cpp, google-cloud-cpp, and many more).

PS. Sorry, was on a phone and went through my notifications in reverse chronological order (my mistake).

@carterbox
Copy link
Contributor

But conda already necessarily provides the right build order if a subpackage is a host dependency of another one?

I am thinking of a case where subpackages do not have a dependency relationship but have overlapping globs. Then it is ambiguous. This could happen when subpackages are optional like static library packages or something with optional modules?

@carterbox
Copy link
Contributor

Anyways, my argument is that I think it's valuable to have the user explicitly define non-overlapping globs for the artifacts that they have created in their package and not rely on overlapping globs in which artifacts are separated implicitly based on dependency relationships.

@h-vetinari
Copy link
Contributor Author

Then it is ambiguous.

I don't see that it's ambiguous at all. That's exactly where the question of adding A to the host-deps of B specifies both the build order, as well as which files of A won't be duplicated in B, despite the overlapping glob. This is the core snapshotting mechanism how conda-build determines the before/after state of an environment, where the delta is what got "installed" and becomes the content of the output (by definition). Consequently, if we add something to the "before" state, it won't end up in the delta!

Anyways, my argument is that I think it's valuable to have the user explicitly define non-overlapping globs for the artifacts that they have created in their package and not rely on overlapping globs in which artifacts are separated implicitly based on dependency relationships.

I disagree wholeheartedly. We should not force users to juggle multiple globs and their exclusion in various places redundantly. The default behaviour should match what's happening elsewhere by default as well, which is "delta between before & after". And for cases where that needs to be overridden, there's more existing functionality (always_include_files) that provides an opt-in for that.

To be clear, I'm very grateful for the functionality you pioneered in #5216, it'll be a great addition to the toolkit. However, the behaviour as of 24.7 is inconsistent with other conda-build behaviour (I'd go as far as saying: conceptually broken), and while I'm not a maintainer here, I think we do have a small-but-closing window to fix this brand-new functionality before people start relying on this behaviour.

@carterbox
Copy link
Contributor

A summary of the discussion so far

We agree that printing an error about exclusion globs not matching anything is extraneous and should be fixed. We agree that always_include_files should work for files added to the PREFIX by external host dependencies. We are discussing how we can improve user experience by changing which files are implicitly added to the include/exclude sets. Currently, all files not installed in the top level build step are implicitly excluded (i.e. external host dependencies are excluded). No files are implicitly included. @h-vetinari has proposed that more files should be implicitly excluded for each of the outputs.

Continuing discussion of which files should be implcitly excluded

The default behaviour should match what's happening elsewhere by default as well, which is "delta between before & after".

Can you provide an example of this "delta between before & after" paradigm elsewhere in the conda-build process?

With the previous behavior (using outputs/files), I recall that any file in the PREFIX that matches a file listed will be added to that output. This is regardless of whether that file has already been added to another output or whether the file is from an external package. For example, you could have:

package:
  name: subpackage-test-split
  version: 0.0.0

build:
  number: 0
  script:
    - touch ${PREFIX}/a-file

outputs:

  - name: subpackage-name-first
    files:
      - a-file
    test:
      commands:
        - test -f ${PREFIX}/a-file

  - name: subpackage-name-second
    files:
      - a-file
    test:
      commands:
        - test -f ${PREFIX}/a-file

about:
  summary: A fake package to test conda-build behavior

and they would both contain a-file So, in this case, there is no "delta between before & after"; there is only "filenames which match files in the PREFIX". I would say this method is very conceptually in line with the new outputs/files/include method.

The other previous behavior using outputs/script is essentially a file list in script form because whatever files you have listed in the script to move to the PREFIX are added to the output. You can move the same file into the PREFIX for multiple outputs. The PREFIX and build directory are reset for each output, so the order in which the outputs are packaged does not effect their contents. I guess you could think of this method as "delta between before & after"? In that case, you would modify or extend the behavior of output/scripts not output/files in order to achieve the behavior that you are looking for?

xref: https://docs.conda.io/projects/conda-build/en/latest/resources/define-metadata.html#specifying-files-to-include-in-output

carterbox added a commit to carterbox/conda-build that referenced this issue Aug 30, 2024
The expand_globs function from conda_build.utils logs an ERROR
when a glob expression returns no matches, this is overly alarming
because the user may now use negative glob expressions which they
don't care if it returns empty or the user may want to use the
same set of glob expressions for multiple platforms some of which
may return empty on some platforms.

conda#5216
conda#5455
@h-vetinari
Copy link
Contributor Author

h-vetinari commented Aug 31, 2024

Can you provide an example of this "delta between before & after" paradigm elsewhere in the conda-build process?

See the docs for the conda-build process:

Creates a conda package containing all the files in the build environment that are new from step 5, along with the necessary conda package metadata.

That's what I meant when I said it's the keys mechanism, but I'll freely admit I don't know the codebase here, and it would surely be beneficial to have some maintainers here chime in.

So, in this case, there is no "delta between before & after"; there is only "filenames which match files in the PREFIX".

My understanding on the history is about as limited as my familiarity with the code-base, but AFAIU, the outputs: functionality was added to conda-build later on, originally for slicing the result of one global build step (the before/after) into various tranches. However, outputs have slowly become full-fledged builds themselves, in that they can have their own build/host/run requirements, build scripts, etc.

So we have some ambiguity due to the way things grew historically. If you use outputs only to slice the result of a global build step, I can see why things were done the way they are. My argument is that the consistent thing to do would be to let outputs be first-class citizens that enjoy all the features that regular builds get as well.

Your example is missing a key point that makes my proposal (and the situation, IMO) unambiguous:

outputs:
  - name: subpackage-name-first
    files:
      - a-file
    test:
      commands:
        - test -f ${PREFIX}/a-file

  - name: subpackage-name-second
    files:
      - a-file
    requirements:
      host:
        - {{ pin_subpackage("subpackage-name-first", exact=True) }}     # <- !!!
      run:
        # whether we add it under run: as well determines if the test below passes/fails
        - {{ pin_subpackage("subpackage-name-first", exact=True) }}
    test:
      commands:
        - test -f ${PREFIX}/a-file

By specifying that subpackage-name-second starts off with subpackage-name-first in its host environment, the "before" of the installation of subpackage-name-second already contains a-file, and so subpackage-name-second shouldn't pick it up anymore (unless overridden with always_include_files).

@h-vetinari
Copy link
Contributor Author

@kenodegard @travishathaway @jezdez @beeankha
could someone from the conda side chime in here? I started the functionality from #5216 this in conda-forge/llvmdev-feedstock#283 and found it completely unworkable. Concretely, if there's an output B depending on A, it's IMO conceptually broken to have to manually exclude A's files from the files: of B. That should be taken care of by conda-build's usual snapshotting ("was this file in $PREFIX already before installation of this output?").

This problem gets much worse if several outputs depend on each other and would lead to duplicated files: specifications all over the place (but switching back and forth between include: and exclude, which multiplies the chances of getting it wrong). This makes the include:/exclude: feature unusable for several feedstocks where I'd be keen to use it otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants