Skip to content

Conversation

@lefou
Copy link
Member

@lefou lefou commented Oct 27, 2025

Another attempt to make Mill cache storage relocate-able.

The idea is to replace absolute paths with paths relative to pre-defined known root paths.

Fix parts of #3660

API

Path mappings are managed via the new MappedRoots object. Mappings use a thread local DynamicVariable and various convenience methods to easily adapt the mapping.

An explicit API to change the mapping dynamically is necessary, as the underlying JSON-serialization mechanism can be used for other purposes than Mill caching, like RPC protocols, where a path mapping is not desirable.

Default Mapping

These are the default location which are mapped by Mill:

  • $WORKSPACE - the project directory (where the top-level build.mill can be found. (BuildCtx.workspaceRoot)
  • $MILL_OUT - the Mill output directory used for caches and build-results
  • $HOME - the user home directory (os.home)

Mappings are automatically used in the upickle-based JSON serializers for mill.api.PathRef and os.Path.

More paths come to mind:

  • $CACHE_COURSIER - the coursier cache directory
  • $CACHE_IVY - the ivy cache directory
  • $CACHE_MAVEN - the Maven cache directory

Those are currently not directly accessible in the code base for all the places where we need to be able to read and write JSON cache files. Also, their typical storage locations are already relative to $HOME (e.g. $CACHE_COURSIER vs $HOME/.cache/coursier), hence I left them out for this PR.

We can later add them by adapting Mapping.withMillDefaults.

Design Choices

One important design choice I made, is to go with a dynamic set of known roots. This means, without any registered mapping, PathRef/os.Path JSON-serialization will behave exactly as before. If there are registered path root mappings at serialization time, these mappings will be used. While de-serialization of a mapped path, a mapping for the used key is required, otherwise the de-serialization will fail with an runtime exception.

While it is not ideal to have potential runtime failures, an alternative design with a static mapping, which should behave more deterministically, hasn't worked out well. This is mostly due to the fact that JSON serialization is a universal concept which is, even in Mill, used for various use cases, and not uniquely for Mill cache persistence. While in theory, it should be controllable via the given/implicit scopes, practically we need to define all virtual paths mapping anywhere in the code base, which isn't only impractical, but also sometimes impossible. Effectively it would require us to use dedicated data types for when we need path mappings (e.g. in Mill caches) and when not (e.g. when defining RPC protocols). See PR #6021 where I tried to have a fixed mapping, but had a lot of issues.

PathRef hashCode changes

Due to the way the PathRef.hashCode is used in Mill, esp. in the Evaluator when deciding whether a task needs re-computation, it was important to keep the Java hash code stable and consistent. PathRef.hashCode now uses a mapped path (as configured at construction time) instead of the unmapped, absolute path. Otherwise, it causes constant re-evaluations of already cached tasks. It should be possible to "construct" cases, in which this leads to more hash collisions, but those cases are very unlikely, unrealistic and should still be unproblematic, since a path-change without a sig-change still means, we refer to the same path structure and content.

Other changes

Some refactoring was necessary to streamline the configuration flow of the $MILL_OUT path. Before, the out path was constructed in many different locations, sometimes with slightly different logic. This is of course a minefield, since we have many entry points (Mill CLI, Mill Daemon, Mill BSP, RPC worker processes, Test Suites, Integration tests, ...) where those where set up. So this PR relies heavily on our programmatic test coverage.

Other effects

The show command can be used to show the result of a (cached) task. It now no longer shows local absolute paths for locations covered by a mapped root (which should be the most occurrences) but the paths with placeholders instead. This is IMHO less helpful, but not so easy to fix correctly.

@lefou
Copy link
Member Author

lefou commented Oct 28, 2025

I'm down to a single failing test case for which I don't have an explanation or fix for:

> mill integration.feature[output-directory].packaged.daemon.testForked
...
[7778-0] + mill.integration.OutputDirectoryTests.Output directory elsewhere in workspace 24540ms  
[7778-0] Copying integration test sources from /home/lefou/work/opensource/mill/integration/feature/output-directory/resources to /home/lefou/work/opensource/mill/out/integration/feature/output-directory/packaged/daemon/testForked.dest/worker-0/sandbox/run-3
[7778-0] X mill.integration.OutputDirectoryTests.Output directory outside workspace 6137ms 
[7778-0]   java.lang.AssertionError: assertion failed: ==> assertion failed: false != true
[7778-0]     scala.runtime.Scala3RunTime$.assertFailed(Scala3RunTime.scala:8)
[7778-0]     utest.asserts.Asserts$ArrowAssert.$eq$eq$greater(Asserts.scala:90)
[7778-0]     mill.integration.OutputDirectoryTests$.tests$$anonfun$1$$anonfun$3$$anonfun$1(OutputDirectoryTests.scala:35)
[7778-0]     scala.runtime.function.JProcedure1.apply(JProcedure1.java:15)
[7778-0]     scala.runtime.function.JProcedure1.apply(JProcedure1.java:10)
[7778-0]     mill.testkit.IntegrationTestSuite.integrationTest$$anonfun$1(IntegrationTestSuite.scala:60)
[7778-0]     mill.util.Retry.apply$$anonfun$1(Retry.scala:38)
[7778-0]     mill.util.Retry.apply$$anonfun$adapted$1(Retry.scala:38)
[7778-0]     mill.util.Retry.$anonfun$1$$anonfun$1(Retry.scala:49)
[7778-0]     scala.util.Try$.apply(Try.scala:217)
[7778-0]     mill.util.Retry.$anonfun$1(Retry.scala:49)
[7778-0]     java.lang.Thread.run(Thread.java:1583)
[7778-0] Tests: 3, Passed: 2, Failed: 1
...
1 tasks failed
[7778] integration.feature[output-directory].packaged.daemon.testForked 1 tests failed: 
  mill.integration.OutputDirectoryTests mill.integration.OutputDirectoryTests.Output directory outside workspace

@lefou lefou marked this pull request as ready for review October 28, 2025 07:39
@lefou lefou changed the title [WIP] Dynamically map known root paths in cache files Dynamically map known root paths in cache files Oct 28, 2025
@lefou
Copy link
Member Author

lefou commented Oct 28, 2025

Fixed last test, which was caused by an accidental code paste

@lefou lefou requested a review from lihaoyi October 28, 2025 12:31
@lefou
Copy link
Member Author

lefou commented Oct 29, 2025

This is good to merge. The next step is probably to add more external cache dirs like coursier cache. Usage of MappedRoots.withMillDefaults API is a good indicator, where we need to initialize these values. Maybe @alexarchambault does have an opinion about the best strategy to provide the value here? /cc @lihaoyi

@lihaoyi
Copy link
Member

lihaoyi commented Oct 29, 2025

The approach taken by #4642 is to instrument os.Path#{toString,hashCode,etc.} to customize the rendering, along with some strategic symlinks to make those relative paths valid when passed to subprocesses. That approach solve the issue where paths get passed around embedded in other data structures, e.g. in scalacOptions or similar. IMO that would be a bit more robust than what you're doing here which only works for paths directly returned as PathRefs from a task

@lefou
Copy link
Member Author

lefou commented Oct 29, 2025

This PR is explicitly just about paths where path is an explicit type. All other cases are paths in strings, which could be solved by some optimistically find-and-replace mechanism, which I personally only see as a last resort approach. Another way could be to make config task use more structured formats than just Seq[String], which could then be bases on a well-defined type-based mapping. But I considered it out of scope of this PR. (We should / I will elaborate those ideas in #3660.)

I must admit, I overlooked PR #4642 when I started this one. (Is it even linked?) But I don't like the sym-link approach. Also, one takeaway of this PR`s predecessor (#6021) is, that for communication with other processes and when launching tools we always want to use absolute paths (as we do currently). The whole caching abstraction of Mill happens in Mill, so the conversion between Mill-mapped-paths and absolute paths need to happen in the tasks (and workers). And in Mill, we have the power and the knowledge for translating between both systems.

@lihaoyi
Copy link
Member

lihaoyi commented Oct 29, 2025

The symlink approach is what is used by Bazel for the same purpose. It's not simple by any means, but the question is if there's anything we can do that's better. An optimistic find-and-replace definitely doesn't seem sufficient, e.g. you cannot find-and-replace paths inside the binary compile.dest/zinc file.

I think the litmus test for this is really what was stated in the original ticket: Can we compile an example project in two separate folders, diff the out/ folders, and find they are identical? This PR solves some of the inconsistencies, but I'd be reluctant to merge it up front since (a) this doesn't give us remote caching for typical projects, e.g. the 5-webapp-scalajs-shared mentioned in the original ticket, where things like scalacOptions and others zinc are required and (b) we don't know if this approach will allow us to eventually solve the other problems, or whether we'll hit a wall and find ourselves needing to rip it out later and re-implement some other approach.

@lefou lefou added the later The issue is still relevant, but has now high priority right now label Oct 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

later The issue is still relevant, but has now high priority right now

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants