Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache Pure Top Level Definitions on startup #5379

Merged
merged 45 commits into from
Oct 9, 2024
Merged

Cache Pure Top Level Definitions on startup #5379

merged 45 commits into from
Oct 9, 2024

Conversation

ChrisPenner
Copy link
Contributor

@ChrisPenner ChrisPenner commented Sep 30, 2024

Overview

Before this change all pure top-level definitions were evaluated at every call site, which is obviously unnecessary.

Now we can detect these using type info from the codebase, and inline the evaluated result into usage sites.

We also discussed whether to bake the evaluated results into the compiled artifact.
The consensus was yes, to serialize the results; I'd still like to do that, but after a day of fiddling it proved tougher than I expected.
Issues with that were:

  1. Closures can contain Foreigns, which are arbitrary Haskell values, most of these won't appear in a pure value, but some thing like stdin/stdout Handles do, patterns as well, and things like text/bytes literals. These are all serializable in their own way, but it takes some care to ensure we do it correctly.
  2. Closures contain argument references which refer to closures via Seg 'BX, but this is fixed in the Stack implementation to include inlined Sections, which we can't serialize. We can most likely either parameterize this type, or replace usage sites with a more generic version, but it affects a big chunk of things and is going to be a bit of a pain to figure out and propagate.

So serializing the pre-evalutated pure results is possible with more work, but it'd be nice to release the non-serializing version for now. The currently implemented version pre-evaluates cacheable values on executable startup.

When running something from a cold runtime (e.g. via ucm run or a fresh ucm session) there'll be a short delay while ucm evaluates everything, but it'll be snappy after that.

Implementation notes

  • I also did the optimization Dan suggested to avoid the RComb indirection by including the CombIx directly in the Section/Instr and avoid an additional redirection. This will also help with serializing closures, but as mentioned above that part isn't done yet.
  • Add getTypeOfTerm to Runtime CodeLookup.

It's a bit annoying, because in order to evaluate the pure defns I need the rest of all the combs to be in the right format for the runtime, so here's how it goes:

  1. Inline new combs
  2. Iterate over new cacheable definitions and evaluate them, replacing the comb in the map with the evaluated closure
  3. Re-inline new combs so they now inline the evaluated results.

Benchmarks

Made a test which references a 1k element map definition:

map1K = range 0 1000 |> map (i -> (i, i)) |> Map.fromList

pureTest : '{IO, Exception} ()
pureTest =
  do
    use List map replicate
    use Nat + - == range
    repeat n action =
      if n == 0 then ()
      else
        _ = action()
        repeat (n - 1) action

    printTime
      "Map.lookup (1k element map)" 1 let
        n -> repeat n do Map.get 12 map1K
=======Trunk
Map.lookup (1k element map)
3.361018ms

=======This Branch
Map.lookup (1k element map)
5.3µs

Loose ends

See overview about serializing closures.

@@ -4,14 +4,13 @@
{-# LANGUAGE PatternSynonyms #-}
{-# LANGUAGE RankNTypes #-}
{-# LANGUAGE ViewPatterns #-}
-- TODO: Fix up all the uni-patterns
{-# OPTIONS_GHC -Wno-incomplete-uni-patterns #-}
Copy link
Contributor Author

@ChrisPenner ChrisPenner Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the uni-patterns with explicit errors so I could re-enable this.

combs :: EnumMap Word64 RCombs
combs =
( mapWithKey
srcCombs :: EnumMap Word64 Combs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now include all the original non-recursive Combs in the CCache now because it makes it much easier to serialize

@ChrisPenner ChrisPenner changed the title Cache Top Level Definitions on startup Cache Pure Top Level Definitions on startup Sep 30, 2024
@ChrisPenner ChrisPenner marked this pull request as ready for review September 30, 2024 23:21
@ChrisPenner ChrisPenner self-assigned this Sep 30, 2024
Copy link
Member

@pchiusano pchiusano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Contributor

@dolio dolio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I noticed one thing that seems weird to me. Having a cached closure in a PAp seems wrong, so I think we need to tweak something. I'm not sure what yet, though. I'll think about it and look things over some more.

(useg, bseg) <- closeArgs C ustk bstk useg bseg args
ustk <- discardFrame =<< frameArgs ustk
bstk <- discardFrame =<< frameArgs bstk
apply !env !denv !activeThreads !ustk !bstk !k !ck !args = \case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't really make much sense to have a partial application to a CachedClosure, similar to the Let situation. I haven't finished reading everything, but maybe there needs to be a type that is actually just combinator references, not both those and pre-evaluated closures.

@dolio
Copy link
Contributor

dolio commented Oct 1, 2024

Okay, so, maybe it's as simple as this...

Separate the Lam constructor of GComb into its own type. This is the type that should occur in PAp.

Make GComb a sum referring to this new type. The new type should be unpacked into GComb, and pattern aliases could make using it look essentially like the current GComb.

Occurrences of the new, separate type don't need to be lazy, because they aren't from knot tying, they're the concrete entry info for an actual combinator.

dolio and others added 2 commits October 4, 2024 01:05
A PAp should only contain an actual combinator, not a cached value.
So, the combinator case has been factored out of and unpacked into
GComb. This way a PAp can refer to the factored-out part.
@ceedubs
Copy link
Contributor

ceedubs commented Oct 4, 2024

@ChrisPenner I'm curious about when this kicks in and what "on startup" means. Could you help me understand? Some of my questions:

  • You've specified that this applies to ucm run. Does it apply to run.compiled and run.native as well?
  • For run, run.compiled, and run.native, does it only evaluate/cache the dependencies of the function being run?
  • It sounds like this also applies to interactive ucm sessions. Does it cache the result of every pure top-level term in the codebase at startup? Or when you switch to a project does it compute only the ones with names in the current branch of the project? Or something else entirely?
  • Is this relevant for both the interpreter and the JIT runtime?

Here's one case that I'm wondering about:

Historically it hasn't been too bad to have a (pure) test in a project that checks 100k samples and takes 10 seconds to evaluate, because the result gets cached. But are we going to pay that 10 second penalty just to cache [Ok "passed"] in more places now (like when I run my main method or when I switch into that project)?

@ceedubs
Copy link
Contributor

ceedubs commented Oct 4, 2024

I should have mentioned in my last comment that I'm excited about this change @ChrisPenner! I've seen repeated evaluation bite newcomers, and I've been curious about to what extent it might be hitting us with certain Pattern values, decoders, etc. Thank you!

dolio and others added 4 commits October 4, 2024 16:32
This allows code sent between machines to pre-evaluate things if
applicable. Naturally, this requires serialization version changes.
Some tweaks have been made to avoid changing hashes as much as
possible between versions.

Old serialized values and code can still be loaded, but obviously
they will be treated as completely uncacheable.
@ChrisPenner
Copy link
Contributor Author

@ceedubs

For run, run.compiled, and run.native, does it only evaluate/cache the dependencies of the function being run?

It will run and cache each pure top-level definition as part of adding it to the runtime's code-cache, we only load the dependencies of whatever we're trying to run, so it'll evaluate dependencies of a main the first time it's added to the runtime's code cache (e.g. once per ucm session, or once at the start of a run.compiled when we load the code from a uc file.)

You've specified that this applies to ucm run. Does it apply to run.compiled and run.native as well?

See above, for ucm run and run.compiled the pre-evaluation triggers for each thing the first time it's added to the code cache.

run.native doesn't benefit from this change at all unfortunately.

It sounds like this also applies to interactive ucm sessions. Does it cache the result of every pure top-level term in the codebase at startup? Or when you switch to a project does it compute only the ones with names in the current branch of the project? Or something else entirely?

It only pre-evaluates the dependencies of things you try to run, and will cache evaluation results for the lifespan of that runtime, which corresponds to the lifetime of the UCM process for the most part, e.g. if you run main it'll evaluate all cacheable definitions which main depends on (transitively). If you run main again, all the code is already loaded, so it won't do it again. If you change main it'll load any new hashes and evaluate anything cacheable in ONLY the new hashes, old ones are still cached.

Is this relevant for both the interpreter and the JIT runtime?
Only the interpreter for now

Historically it hasn't been too bad to have a (pure) test in a project that checks 100k samples and takes 10 seconds to evaluate, because the result gets cached. But are we going to pay that 10 second penalty just to cache [Ok "passed"] in more places now (like when I run my main method or when I switch into that project)?

This is only an issue if you reference the test in your main.

We may want to look into optimizations for docs and termLinks where we don't necessarily depend on the actual result of a referenced definition 🤔

@ChrisPenner
Copy link
Contributor Author

ChrisPenner commented Oct 4, 2024

One downside of this change is if you have code like this:

main = do
  if something 
    then exit 0
    else Map.lookup k hugeExpensiveMap

Previously hugeExpensiveMap would only be evaluated when it was actually hit by the interpreter, so this program would exit immediately if something is true.
Now we'll pre-evaluate hugeExpensiveMap once before even starting to run the program, which could lead to an increased startup time in some cases.

So this change will be good for longer-term processes like cloud and webservers, but potentially worse for short-lived processes like cli apps.

If the problem proves noticeable enough we probably can fix this delay by continuing on the Closure serialization work I mention in the overview, which I think we still want to do, the work was just starting to hit diminishing returns for how tricky it was when we have bigger possible gains with other work 😄 .

@pchiusano
Copy link
Member

I noticed this broke the serialized form round trip golden tests. Which might be fine, since I'd expect this could change the serialized form for some definitions, but it's worth checking.

@SystemFw do we have Unison Cloud client tests to make sure that no hashes change for schema types and functions? The way these could work is to just crypto.hash various values and save as a hex string literal, and then the test does a comparison and fails if the hashes differ.

@SystemFw
Copy link
Contributor

SystemFw commented Oct 7, 2024

We now have a test in the Cloud client to test for the relevant hash changes. It passes on this branch.

It was meant to be a test in a `match` expression, but was missing
a #:when
@dolio dolio requested a review from a team as a code owner October 8, 2024 02:18
@aryairani
Copy link
Contributor

aryairani commented Oct 8, 2024

serializing the pre-evaluated pure results is possible with more work

fwiw, contrary to the majority opinion in the meeting, I don't think this would be a good idea. afaik other languages don't do it, apart maybe from some trivial constant folding — if someone really wants to precompute a pure result, they can always use add.run

@@ -14,8 +14,8 @@ on:
env:
## Some version numbers that are used during CI
ormolu_version: 0.7.2.0
jit_version: "@unison/internal/releases/0.0.20"
runtime_tests_version: "@unison/runtime-tests/releases/0.0.1"
jit_version: "@unison/internal/releases/0.0.21"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious why did the jit version change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the new serialization format for the jit. This means it now has access to the cacheability information as well.

I was going to implement the top level value behavior in the jit, too, but after some struggling, I realized that my stronglyConnectedComponents implementation does not actually output a topological sorting of the SCCs, which I need to output the racket definitions in an acceptable order. So I'll probably just merge this as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I wouldn't have expected it to change in an interpreter-related PR, but is it just a compatibility thing? That the interpreter was updated to match the new serialization format?

Copy link
Contributor

@dolio dolio Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interpreter, jit and ucm use a common format for code interchange, so when that changes (to include the cacheability information), the jit needs to be updated to understand it. Otherwise ucm could send it stuff that would just make it bomb. For instance, the format is used for run.native and compile.native to tell the executable what code to run/compile.

Pretty sure it wouldn't even get past CI if I didn't update it in this case, because the version was incremented.

@pchiusano
Copy link
Member

@SystemFw cool, I'm good with merging this then, whenever @dolio and/or @ChrisPenner are ready.

@dolio dolio merged commit 92e08be into trunk Oct 9, 2024
32 checks passed
@dolio dolio deleted the cp/cache-toplevel branch October 9, 2024 04:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants