Dynamically map known root paths in cache files #6031

lefou · 2025-10-27T10:59:37Z

Another attempt to make Mill cache storage relocate-able.

The idea is to replace absolute paths with paths relative to pre-defined known root paths.

Fix parts of #3660

API

Path mappings are managed via the new MappedRoots object. Mappings use a thread local DynamicVariable and various convenience methods to easily adapt the mapping.

An explicit API to change the mapping dynamically is necessary, as the underlying JSON-serialization mechanism can be used for other purposes than Mill caching, like RPC protocols, where a path mapping is not desirable.

Default Mapping

These are the default location which are mapped by Mill:

$WORKSPACE - the project directory (where the top-level build.mill can be found. (BuildCtx.workspaceRoot)
$MILL_OUT - the Mill output directory used for caches and build-results
$HOME - the user home directory (os.home)

Mappings are automatically used in the upickle-based JSON serializers for mill.api.PathRef and os.Path.

More paths come to mind:

$CACHE_COURSIER - the coursier cache directory
$CACHE_IVY - the ivy cache directory
$CACHE_MAVEN - the Maven cache directory

Those are currently not directly accessible in the code base for all the places where we need to be able to read and write JSON cache files. Also, their typical storage locations are already relative to $HOME (e.g. $CACHE_COURSIER vs $HOME/.cache/coursier), hence I left them out for this PR.

We can later add them by adapting Mapping.withMillDefaults.

Design Choices

One important design choice I made, is to go with a dynamic set of known roots. This means, without any registered mapping, PathRef/os.Path JSON-serialization will behave exactly as before. If there are registered path root mappings at serialization time, these mappings will be used. While de-serialization of a mapped path, a mapping for the used key is required, otherwise the de-serialization will fail with an runtime exception.

While it is not ideal to have potential runtime failures, an alternative design with a static mapping, which should behave more deterministically, hasn't worked out well. This is mostly due to the fact that JSON serialization is a universal concept which is, even in Mill, used for various use cases, and not uniquely for Mill cache persistence. While in theory, it should be controllable via the given/implicit scopes, practically we need to define all virtual paths mapping anywhere in the code base, which isn't only impractical, but also sometimes impossible. Effectively it would require us to use dedicated data types for when we need path mappings (e.g. in Mill caches) and when not (e.g. when defining RPC protocols). See PR #6021 where I tried to have a fixed mapping, but had a lot of issues.

PathRef `hashCode` changes

Due to the way the PathRef.hashCode is used in Mill, esp. in the Evaluator when deciding whether a task needs re-computation, it was important to keep the Java hash code stable and consistent. PathRef.hashCode now uses a mapped path (as configured at construction time) instead of the unmapped, absolute path. Otherwise, it causes constant re-evaluations of already cached tasks. It should be possible to "construct" cases, in which this leads to more hash collisions, but those cases are very unlikely, unrealistic and should still be unproblematic, since a path-change without a sig-change still means, we refer to the same path structure and content.

Other changes

Some refactoring was necessary to streamline the configuration flow of the $MILL_OUT path. Before, the out path was constructed in many different locations, sometimes with slightly different logic. This is of course a minefield, since we have many entry points (Mill CLI, Mill Daemon, Mill BSP, RPC worker processes, Test Suites, Integration tests, ...) where those where set up. So this PR relies heavily on our programmatic test coverage.

Other effects

The show command can be used to show the result of a (cached) task. It now no longer shows local absolute paths for locations covered by a mapped root (which should be the most occurrences) but the paths with placeholders instead. This is IMHO less helpful, but not so easy to fix correctly.

Signed-off-by: Tobias Roeser <[email protected]>

lefou · 2025-10-28T07:17:11Z

I'm down to a single failing test case for which I don't have an explanation or fix for:

> mill integration.feature[output-directory].packaged.daemon.testForked
...
[7778-0] + mill.integration.OutputDirectoryTests.Output directory elsewhere in workspace 24540ms  
[7778-0] Copying integration test sources from /home/lefou/work/opensource/mill/integration/feature/output-directory/resources to /home/lefou/work/opensource/mill/out/integration/feature/output-directory/packaged/daemon/testForked.dest/worker-0/sandbox/run-3
[7778-0] X mill.integration.OutputDirectoryTests.Output directory outside workspace 6137ms 
[7778-0]   java.lang.AssertionError: assertion failed: ==> assertion failed: false != true
[7778-0]     scala.runtime.Scala3RunTime$.assertFailed(Scala3RunTime.scala:8)
[7778-0]     utest.asserts.Asserts$ArrowAssert.$eq$eq$greater(Asserts.scala:90)
[7778-0]     mill.integration.OutputDirectoryTests$.tests$$anonfun$1$$anonfun$3$$anonfun$1(OutputDirectoryTests.scala:35)
[7778-0]     scala.runtime.function.JProcedure1.apply(JProcedure1.java:15)
[7778-0]     scala.runtime.function.JProcedure1.apply(JProcedure1.java:10)
[7778-0]     mill.testkit.IntegrationTestSuite.integrationTest$$anonfun$1(IntegrationTestSuite.scala:60)
[7778-0]     mill.util.Retry.apply$$anonfun$1(Retry.scala:38)
[7778-0]     mill.util.Retry.apply$$anonfun$adapted$1(Retry.scala:38)
[7778-0]     mill.util.Retry.$anonfun$1$$anonfun$1(Retry.scala:49)
[7778-0]     scala.util.Try$.apply(Try.scala:217)
[7778-0]     mill.util.Retry.$anonfun$1(Retry.scala:49)
[7778-0]     java.lang.Thread.run(Thread.java:1583)
[7778-0] Tests: 3, Passed: 2, Failed: 1
...
1 tasks failed
[7778] integration.feature[output-directory].packaged.daemon.testForked 1 tests failed: 
  mill.integration.OutputDirectoryTests mill.integration.OutputDirectoryTests.Output directory outside workspace

lefou · 2025-10-28T09:36:14Z

Fixed last test, which was caused by an accidental code paste

lefou · 2025-10-29T07:59:07Z

This is good to merge. The next step is probably to add more external cache dirs like coursier cache. Usage of MappedRoots.withMillDefaults API is a good indicator, where we need to initialize these values. Maybe @alexarchambault does have an opinion about the best strategy to provide the value here? /cc @lihaoyi

lihaoyi · 2025-10-29T08:15:37Z

The approach taken by #4642 is to instrument os.Path#{toString,hashCode,etc.} to customize the rendering, along with some strategic symlinks to make those relative paths valid when passed to subprocesses. That approach solve the issue where paths get passed around embedded in other data structures, e.g. in scalacOptions or similar. IMO that would be a bit more robust than what you're doing here which only works for paths directly returned as PathRefs from a task

lefou · 2025-10-29T09:20:49Z

This PR is explicitly just about paths where path is an explicit type. All other cases are paths in strings, which could be solved by some optimistically find-and-replace mechanism, which I personally only see as a last resort approach. Another way could be to make config task use more structured formats than just Seq[String], which could then be bases on a well-defined type-based mapping. But I considered it out of scope of this PR. (We should / I will elaborate those ideas in #3660.)

I must admit, I overlooked PR #4642 when I started this one. (Is it even linked?) But I don't like the sym-link approach. Also, one takeaway of this PR`s predecessor (#6021) is, that for communication with other processes and when launching tools we always want to use absolute paths (as we do currently). The whole caching abstraction of Mill happens in Mill, so the conversion between Mill-mapped-paths and absolute paths need to happen in the tasks (and workers). And in Mill, we have the power and the knowledge for translating between both systems.

lihaoyi · 2025-10-29T09:43:52Z

The symlink approach is what is used by Bazel for the same purpose. It's not simple by any means, but the question is if there's anything we can do that's better. An optimistic find-and-replace definitely doesn't seem sufficient, e.g. you cannot find-and-replace paths inside the binary compile.dest/zinc file.

I think the litmus test for this is really what was stated in the original ticket: Can we compile an example project in two separate folders, diff the out/ folders, and find they are identical? This PR solves some of the inconsistencies, but I'd be reluctant to merge it up front since (a) this doesn't give us remote caching for typical projects, e.g. the 5-webapp-scalajs-shared mentioned in the original ticket, where things like scalacOptions and others zinc are required and (b) we don't know if this approach will allow us to eventually solve the other problems, or whether we'll hit a wall and find ourselves needing to rip it out later and re-implement some other approach.

lefou and others added 30 commits October 23, 2025 21:20

Use path mapping when json-serializing PathRefs

7918f3d

Use path-roots mapping for path serialization

cb82912

Hack: explicitly set outPath before executing anything

00c0249

Set oupath when writing mill-runner-state.json

8e80cbc

[autofix.ci] apply automated fixes

2ac5708

Fix outpath propagation to test runner

d79a99b

fix expected show output

75377c2

[autofix.ci] apply automated fixes

6c6c405

Fix deprecation

ff26b24

Name tuple items

a420435

Don't use current evaluator to get the current output path

83e217a

Stabilize PathRef hashcode by using the encoded path

304cd99

[autofix.ci] apply automated fixes

c4e5221

Fix tests

8d8a2b0

Refactor outDir init and propagation

db55495

[autofix.ci] apply automated fixes

c735f56

Replace expected literals in itests

c3ab1d6

Set outPath in UnitTester

b338d3a

[autofix.ci] apply automated fixes

ddebbbd

Change design: root path mapping is now dynamic

0d4a916

Addd more checks

ba63983

Fix defaults

082ae2e

Cleanup

cb672de

Always use current mapping when serializing to json

965b5f2

Ensure, we don't map any root paths in RPC communication

191bdb0

Signed-off-by: Tobias Roeser <[email protected]>

[autofix.ci] apply automated fixes

2b08564

Don't use placeholders in test runner testargs files

64ee917

[autofix.ci] apply automated fixes

98feb64

Made test condition on Java version

75656d1

Readd pathref config in GroupExecution

adb6cff

lefou added 5 commits October 27, 2025 18:44

Fixed test expectation

b944cbe

Revert more eager expected test output changes

a56745e

Ajust reading of json files

356fa98

Fix tests

4d65636

Fix test

82b8676

lefou mentioned this pull request Oct 28, 2025

[WIP] Map known root paths in cache files #6021

Closed

lefou marked this pull request as ready for review October 28, 2025 07:39

lefou mentioned this pull request Oct 28, 2025

Make out/ folder contents (more) reproducible and filesystem layout agnostic #3660

Open

lefou changed the title ~~[WIP] Dynamically map known root paths in cache files~~ Dynamically map known root paths in cache files Oct 28, 2025

Revert accidental code paste

006615c

lefou and others added 9 commits October 28, 2025 10:37

cleanup

ebc0e19

cleanup

cc5d81d

[autofix.ci] apply automated fixes

3c81644

Moved new API to MappedRoots object

03f22c6

Merge branch 'main' into tr-path-mapping-optional

9cd6301

[autofix.ci] apply automated fixes

21ede11

Renamings

9af1b09

Ensure MappedRoots.withMillDefaults is used with named parameters

289595e

cleanup

3d2501f

lefou requested a review from lihaoyi October 28, 2025 12:31

lefou added 2 commits October 28, 2025 18:10

Merge branch 'main' into tr-path-mapping-optional

6914aca

Merge branch 'main' into tr-path-mapping-optional

075c058

lefou added the later The issue is still relevant, but has now high priority right now label Oct 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Dynamically map known root paths in cache files #6031

Dynamically map known root paths in cache files #6031

lefou commented Oct 27, 2025 •

edited

Loading

Uh oh!

lefou commented Oct 28, 2025 •

edited

Loading

Uh oh!

lefou commented Oct 28, 2025

Uh oh!

lefou commented Oct 29, 2025 •

edited

Loading

Uh oh!

lihaoyi commented Oct 29, 2025

Uh oh!

lefou commented Oct 29, 2025 •

edited

Loading

Uh oh!

lihaoyi commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Dynamically map known root paths in cache files #6031

Are you sure you want to change the base?

Dynamically map known root paths in cache files #6031

Conversation

lefou commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

API

Default Mapping

Design Choices

PathRef hashCode changes

Other changes

Other effects

Uh oh!

lefou commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lefou commented Oct 28, 2025

Uh oh!

lefou commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lihaoyi commented Oct 29, 2025

Uh oh!

lefou commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lihaoyi commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lefou commented Oct 27, 2025 •

edited

Loading

PathRef `hashCode` changes

lefou commented Oct 28, 2025 •

edited

Loading

lefou commented Oct 29, 2025 •

edited

Loading

lefou commented Oct 29, 2025 •

edited

Loading