Cache Pure Top Level Definitions on startup #5379

ChrisPenner · 2024-09-30T02:26:35Z

Overview

Before this change all pure top-level definitions were evaluated at every call site, which is obviously unnecessary.

Now we can detect these using type info from the codebase, and inline the evaluated result into usage sites.

We also discussed whether to bake the evaluated results into the compiled artifact.
The consensus was yes, to serialize the results; I'd still like to do that, but after a day of fiddling it proved tougher than I expected.
Issues with that were:

Closures can contain Foreigns, which are arbitrary Haskell values, most of these won't appear in a pure value, but some thing like stdin/stdout Handles do, patterns as well, and things like text/bytes literals. These are all serializable in their own way, but it takes some care to ensure we do it correctly.
Closures contain argument references which refer to closures via Seg 'BX, but this is fixed in the Stack implementation to include inlined Sections, which we can't serialize. We can most likely either parameterize this type, or replace usage sites with a more generic version, but it affects a big chunk of things and is going to be a bit of a pain to figure out and propagate.

So serializing the pre-evalutated pure results is possible with more work, but it'd be nice to release the non-serializing version for now. The currently implemented version pre-evaluates cacheable values on executable startup.

When running something from a cold runtime (e.g. via ucm run or a fresh ucm session) there'll be a short delay while ucm evaluates everything, but it'll be snappy after that.

Implementation notes

I also did the optimization Dan suggested to avoid the RComb indirection by including the CombIx directly in the Section/Instr and avoid an additional redirection. This will also help with serializing closures, but as mentioned above that part isn't done yet.
Add getTypeOfTerm to Runtime CodeLookup.

It's a bit annoying, because in order to evaluate the pure defns I need the rest of all the combs to be in the right format for the runtime, so here's how it goes:

Inline new combs
Iterate over new cacheable definitions and evaluate them, replacing the comb in the map with the evaluated closure
Re-inline new combs so they now inline the evaluated results.

Benchmarks

Made a test which references a 1k element map definition:

map1K = range 0 1000 |> map (i -> (i, i)) |> Map.fromList

pureTest : '{IO, Exception} ()
pureTest =
  do
    use List map replicate
    use Nat + - == range
    repeat n action =
      if n == 0 then ()
      else
        _ = action()
        repeat (n - 1) action

    printTime
      "Map.lookup (1k element map)" 1 let
        n -> repeat n do Map.get 12 map1K

=======Trunk
Map.lookup (1k element map)
3.361018ms

=======This Branch
Map.lookup (1k element map)
5.3µs

Loose ends

See overview about serializing closures.

ChrisPenner · 2024-09-30T02:34:45Z

unison-runtime/src/Unison/Runtime/Machine.hs

@@ -4,14 +4,13 @@
 {-# LANGUAGE PatternSynonyms #-}
 {-# LANGUAGE RankNTypes #-}
 {-# LANGUAGE ViewPatterns #-}
-- TODO: Fix up all the uni-patterns
-{-# OPTIONS_GHC -Wno-incomplete-uni-patterns #-}


I replaced the uni-patterns with explicit errors so I could re-enable this.

ChrisPenner · 2024-09-30T02:38:42Z

unison-runtime/src/Unison/Runtime/Machine.hs

-    combs :: EnumMap Word64 RCombs
-    combs =
-      ( mapWithKey
+    srcCombs :: EnumMap Word64 Combs


I now include all the original non-recursive Combs in the CCache now because it makes it much easier to serialize

pchiusano

🎉

dolio

So, I noticed one thing that seems weird to me. Having a cached closure in a PAp seems wrong, so I think we need to tweak something. I'm not sure what yet, though. I'll think about it and look things over some more.

dolio · 2024-10-01T15:56:54Z

unison-runtime/src/Unison/Runtime/Machine.hs

-          (useg, bseg) <- closeArgs C ustk bstk useg bseg args
-          ustk <- discardFrame =<< frameArgs ustk
-          bstk <- discardFrame =<< frameArgs bstk
+apply !env !denv !activeThreads !ustk !bstk !k !ck !args = \case


I think it doesn't really make much sense to have a partial application to a CachedClosure, similar to the Let situation. I haven't finished reading everything, but maybe there needs to be a type that is actually just combinator references, not both those and pre-evaluated closures.

dolio · 2024-10-01T18:34:19Z

Okay, so, maybe it's as simple as this...

Separate the Lam constructor of GComb into its own type. This is the type that should occur in PAp.

Make GComb a sum referring to this new type. The new type should be unpacked into GComb, and pattern aliases could make using it look essentially like the current GComb.

Occurrences of the new, separate type don't need to be lazy, because they aren't from knot tying, they're the concrete entry info for an actual combinator.

A PAp should only contain an actual combinator, not a cached value. So, the combinator case has been factored out of and unpacked into GComb. This way a PAp can refer to the factored-out part.

ceedubs · 2024-10-04T14:02:38Z

@ChrisPenner I'm curious about when this kicks in and what "on startup" means. Could you help me understand? Some of my questions:

You've specified that this applies to ucm run. Does it apply to run.compiled and run.native as well?
For run, run.compiled, and run.native, does it only evaluate/cache the dependencies of the function being run?
It sounds like this also applies to interactive ucm sessions. Does it cache the result of every pure top-level term in the codebase at startup? Or when you switch to a project does it compute only the ones with names in the current branch of the project? Or something else entirely?
Is this relevant for both the interpreter and the JIT runtime?

Here's one case that I'm wondering about:

Historically it hasn't been too bad to have a (pure) test in a project that checks 100k samples and takes 10 seconds to evaluate, because the result gets cached. But are we going to pay that 10 second penalty just to cache [Ok "passed"] in more places now (like when I run my main method or when I switch into that project)?

ceedubs · 2024-10-04T14:05:10Z

I should have mentioned in my last comment that I'm excited about this change @ChrisPenner! I've seen repeated evaluation bite newcomers, and I've been curious about to what extent it might be hitting us with certain Pattern values, decoders, etc. Thank you!

This allows code sent between machines to pre-evaluate things if applicable. Naturally, this requires serialization version changes. Some tweaks have been made to avoid changing hashes as much as possible between versions. Old serialized values and code can still be loaded, but obviously they will be treated as completely uncacheable.

…into cp/cache-toplevel

ChrisPenner · 2024-10-04T23:21:34Z

@ceedubs

For run, run.compiled, and run.native, does it only evaluate/cache the dependencies of the function being run?

It will run and cache each pure top-level definition as part of adding it to the runtime's code-cache, we only load the dependencies of whatever we're trying to run, so it'll evaluate dependencies of a main the first time it's added to the runtime's code cache (e.g. once per ucm session, or once at the start of a run.compiled when we load the code from a uc file.)

You've specified that this applies to ucm run. Does it apply to run.compiled and run.native as well?

See above, for ucm run and run.compiled the pre-evaluation triggers for each thing the first time it's added to the code cache.

run.native doesn't benefit from this change at all unfortunately.

It sounds like this also applies to interactive ucm sessions. Does it cache the result of every pure top-level term in the codebase at startup? Or when you switch to a project does it compute only the ones with names in the current branch of the project? Or something else entirely?

It only pre-evaluates the dependencies of things you try to run, and will cache evaluation results for the lifespan of that runtime, which corresponds to the lifetime of the UCM process for the most part, e.g. if you run main it'll evaluate all cacheable definitions which main depends on (transitively). If you run main again, all the code is already loaded, so it won't do it again. If you change main it'll load any new hashes and evaluate anything cacheable in ONLY the new hashes, old ones are still cached.

Is this relevant for both the interpreter and the JIT runtime?
Only the interpreter for now

Historically it hasn't been too bad to have a (pure) test in a project that checks 100k samples and takes 10 seconds to evaluate, because the result gets cached. But are we going to pay that 10 second penalty just to cache [Ok "passed"] in more places now (like when I run my main method or when I switch into that project)?

This is only an issue if you reference the test in your main.

We may want to look into optimizations for docs and termLinks where we don't necessarily depend on the actual result of a referenced definition 🤔

ChrisPenner · 2024-10-04T23:25:26Z

One downside of this change is if you have code like this:

main = do
  if something 
    then exit 0
    else Map.lookup k hugeExpensiveMap

Previously hugeExpensiveMap would only be evaluated when it was actually hit by the interpreter, so this program would exit immediately if something is true.
Now we'll pre-evaluate hugeExpensiveMap once before even starting to run the program, which could lead to an increased startup time in some cases.

So this change will be good for longer-term processes like cloud and webservers, but potentially worse for short-lived processes like cli apps.

If the problem proves noticeable enough we probably can fix this delay by continuing on the Closure serialization work I mention in the overview, which I think we still want to do, the work was just starting to hit diminishing returns for how tricky it was when we have bigger possible gains with other work 😄 .

pchiusano · 2024-10-06T15:42:24Z

I noticed this broke the serialized form round trip golden tests. Which might be fine, since I'd expect this could change the serialized form for some definitions, but it's worth checking.

@SystemFw do we have Unison Cloud client tests to make sure that no hashes change for schema types and functions? The way these could work is to just crypto.hash various values and save as a hex string literal, and then the test does a comparison and fails if the hashes differ.

SystemFw · 2024-10-07T21:02:02Z

We now have a test in the Cloud client to test for the relevant hash changes. It passes on this branch.

It was meant to be a test in a `match` expression, but was missing a #:when

aryairani · 2024-10-08T13:52:05Z

serializing the pre-evaluated pure results is possible with more work

fwiw, contrary to the majority opinion in the meeting, I don't think this would be a good idea. afaik other languages don't do it, apart maybe from some trivial constant folding — if someone really wants to precompute a pure result, they can always use add.run

aryairani · 2024-10-08T13:53:50Z

.github/workflows/ci.yaml

@@ -14,8 +14,8 @@ on:
 env:
  ## Some version numbers that are used during CI
  ormolu_version: 0.7.2.0
-  jit_version: "@unison/internal/releases/0.0.20"
-  runtime_tests_version: "@unison/runtime-tests/releases/0.0.1"
+  jit_version: "@unison/internal/releases/0.0.21"


just curious why did the jit version change?

I implemented the new serialization format for the jit. This means it now has access to the cacheability information as well.

I was going to implement the top level value behavior in the jit, too, but after some struggling, I realized that my stronglyConnectedComponents implementation does not actually output a topological sorting of the SCCs, which I need to output the racket definitions in an acceptable order. So I'll probably just merge this as is.

I guess I wouldn't have expected it to change in an interpreter-related PR, but is it just a compatibility thing? That the interpreter was updated to match the new serialization format?

The interpreter, jit and ucm use a common format for code interchange, so when that changes (to include the cacheability information), the jit needs to be updated to understand it. Otherwise ucm could send it stuff that would just make it bomb. For instance, the format is used for run.native and compile.native to tell the executable what code to run/compile.

Pretty sure it wouldn't even get past CI if I didn't update it in this case, because the version was incremented.

pchiusano · 2024-10-08T14:19:52Z

@SystemFw cool, I'm good with merging this then, whenever @dolio and/or @ChrisPenner are ready.

ChrisPenner and others added 27 commits September 26, 2024 15:51

Add closures as a GComb constructor

76d633b

Handle cached closures in the Machine

706e785

Compiling, but still need to actually inline closures

2c11521

Attempt to pre-eval

4240517

Edit CodeLookup

eefff5b

Compiling somehow

4b2c490

Check types of refs right before passing to CCache

d5a802b

Don't thread cacheability through floating

5ea32f0

Add type aliases for refs

df95712

Use backmap to look up types of codebase refs of top level defs

151f345

Cleanup

f367824

Working pre-evaluated closures.

da9a588

Debugging info

1940a4a

Fix missing pattern matches on Clos's

d0a95e9

Store srcCombs in SCache

e0bacf1

Pre-eval constants when loading from .uc files

47fd299

Start on serializing closures

e7ca2f5

Split CombIx out of RComb

e7d01c0

Successfully split of CombIx

699a23d

Handle serializing/deserializing split up combs

2fb33a2

Serialization WIP

c662bfc

Resolve conflicts

22913b7

Rewrite pre-evaluation

336c1a4

Fixed closure embedding

5a946f8

Don't serialize Closures

6ea04b8

Serialize cacheable combs and re-eval on load

44d2f82

automatically run ormolu

cd60a76

ChrisPenner commented Sep 30, 2024

View reviewed changes

ChrisPenner changed the title ~~Cache Top Level Definitions on startup~~ Cache Pure Top Level Definitions on startup Sep 30, 2024

Re-merge trunk and fix recursion schemes stuff

bfdf6c5

ChrisPenner marked this pull request as ready for review September 30, 2024 23:21

ChrisPenner self-assigned this Sep 30, 2024

pchiusano approved these changes Oct 1, 2024

View reviewed changes

dolio requested changes Oct 1, 2024

View reviewed changes

dolio and others added 2 commits October 4, 2024 01:05

Factor GComb a bit, and make PAp more correct

eeadb6d

A PAp should only contain an actual combinator, not a cached value. So, the combinator case has been factored out of and unpacked into GComb. This way a PAp can refer to the factored-out part.

automatically run ormolu

3c70787

dolio approved these changes Oct 4, 2024

View reviewed changes

dolio and others added 4 commits October 4, 2024 16:32

Merge remote-tracking branch 'refs/remotes/origin/cp/cache-toplevel' …

af19b0c

…into cp/cache-toplevel

automatically run ormolu

4b45eea

Fix MCode tests

545f5ea

dolio added 2 commits October 7, 2024 21:48

Fix a guard in unison-runtime

d8d2f69

It was meant to be a test in a `match` expression, but was missing a #:when

Bump @unison/internal version to fix jit compatibility for serialization

1660ea4

dolio requested a review from a team as a code owner October 8, 2024 02:18

dolio added 2 commits October 7, 2024 22:23

Merge remote-tracking branch 'origin/trunk' into cp/cache-toplevel

8d6f283

Bump runtime-tests version for new serialization format

b3d9d63

aryairani reviewed Oct 8, 2024

View reviewed changes

dolio merged commit 92e08be into trunk Oct 9, 2024
32 checks passed

dolio deleted the cp/cache-toplevel branch October 9, 2024 04:58

ceedubs mentioned this pull request Oct 31, 2024

Top level definitions evaluation is by name / not cached #4495

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache Pure Top Level Definitions on startup #5379

Cache Pure Top Level Definitions on startup #5379

ChrisPenner commented Sep 30, 2024 •

edited

Loading

ChrisPenner Sep 30, 2024 •

edited

Loading

ChrisPenner Sep 30, 2024

pchiusano left a comment

dolio left a comment

dolio Oct 1, 2024

dolio commented Oct 1, 2024

ceedubs commented Oct 4, 2024

ceedubs commented Oct 4, 2024

ChrisPenner commented Oct 4, 2024

ChrisPenner commented Oct 4, 2024 •

edited

Loading

pchiusano commented Oct 6, 2024

SystemFw commented Oct 7, 2024

aryairani commented Oct 8, 2024 •

edited

Loading

aryairani Oct 8, 2024

dolio Oct 8, 2024

aryairani Oct 9, 2024

dolio Oct 9, 2024 •

edited

Loading

pchiusano commented Oct 8, 2024

Cache Pure Top Level Definitions on startup #5379

Cache Pure Top Level Definitions on startup #5379

Conversation

ChrisPenner commented Sep 30, 2024 • edited Loading

Overview

Implementation notes

Benchmarks

Loose ends

ChrisPenner Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

ChrisPenner Sep 30, 2024

Choose a reason for hiding this comment

pchiusano left a comment

Choose a reason for hiding this comment

dolio left a comment

Choose a reason for hiding this comment

dolio Oct 1, 2024

Choose a reason for hiding this comment

dolio commented Oct 1, 2024

ceedubs commented Oct 4, 2024

ceedubs commented Oct 4, 2024

ChrisPenner commented Oct 4, 2024

ChrisPenner commented Oct 4, 2024 • edited Loading

pchiusano commented Oct 6, 2024

SystemFw commented Oct 7, 2024

aryairani commented Oct 8, 2024 • edited Loading

aryairani Oct 8, 2024

Choose a reason for hiding this comment

dolio Oct 8, 2024

Choose a reason for hiding this comment

aryairani Oct 9, 2024

Choose a reason for hiding this comment

dolio Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

pchiusano commented Oct 8, 2024

ChrisPenner commented Sep 30, 2024 •

edited

Loading

ChrisPenner Sep 30, 2024 •

edited

Loading

ChrisPenner commented Oct 4, 2024 •

edited

Loading

aryairani commented Oct 8, 2024 •

edited

Loading

dolio Oct 9, 2024 •

edited

Loading