-
-
Notifications
You must be signed in to change notification settings - Fork 198
Add Wolfe line search to Laplace approximation #3250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…lues for W, B, etc. are used
… mdivide_left_test, and mdivide_right_test
…to fix/wolfe-zoom1
|
I'll have a stab at reviewing the code but from the description this looks like an amazing PR! A few initial questions:
|
So currently line search is always on. If we have line search off should we always just be taking one full newton step each iteration? My thought process was we should always have it on since it only requires calculating the gradient with respect to theta. sometimes we can actually get away with even taking 2x a newton step. And for some functions, like Aki's example, we need to have a very small stepsize at first but then can just back to 1x steps after.
No before it was just being computed in
That was a change for the unit tests. Though they were breaking other tests so I'm going to revert those. I think that needs to be looked at inside of another PR.
No we do not. If I have time I'd like to get add some logging and compare the previous and current implementation in terms of time spent. I have mostly just been looking at the unit tests and ballparking if they seem to go slower or faster when I run them. But yes actual performance tests would be nice. |
…y values have small values that lose accuracy with finite diff
|
@charlesm93 another Q. If we fail to converge after N iterations should we throw a hard error or return back what we have so far with a warning? I'm thinking about how for wolfe we just return with a warning instead of throwing away the results. |
… Fix orders for constructor initializer list of WolfeStatus
Yes, with no linesearch you take a full Newton step. As to whether linesearch should always be on, it depends on how it affects performance in cases where no linesearch is needed. Most of the unit tests we have don't require it, and so in those cases we don't want to increase runtime. So I'm in favor of keeping it optional, unless the increase in runtime is marginal. I guess that speaks to my last point about having performance tests.
We should reject the current metropolis proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some comments. I'll need some help clarifying some of the code. We should find a time to go over the different strategies of the Wolfe linesearch and also the unit tests for the Wolfe linsearch.
(I didn't review the finite diff changes.)
| } | ||
|
|
||
| /** | ||
| * Computes theta gradient `f` wrt `theta` and `args...` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the gradients are not only with respect to theta (the latent Gaussian variables), we might want to change the name theta gradients here to simply gradients of the log likelihood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to add a note to clarify this. If a user passes in args that are reverse mode autodiff types then this would compute the gradients wrt both theta and args. But if the user passes in args that are not autodiff then we still compute the gradient wrt theta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main point of the function is to compute theta's gradients, which is why I make the name theta_grad. The args having their gradients calculated are more of a side effect.
| } | ||
|
|
||
| /** | ||
| * Computes likelihood argument gradient of `f` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so is this only differentiating wrt args but not theta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
| } | ||
|
|
||
| /** | ||
| * Computes negative block diagonal Hessian of `f` wrt`theta` and `args...` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the description slightly to distinguish from previous function and indicate that one handles the diagonal case and one doesn't.
| /* iterations end when difference in objective function is less than tolerance | ||
| /** | ||
| * iterations end when difference in objective function is less than tolerance | ||
| * Default is sqrt(machine_epsilon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very interesting... Did you find that reducing the tolerance (from 1e-6 to sqrt(machine_epsilon)) and increasing max step size improves performance on difficult problems (i.e. the roach problem)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this down because some of the tests were failing when doing the finite difference checks comparing the gradients. Also boost's integrators use sqrt machine epsilon as the default tolerance and it seemed like a reasonable number.
| * negative steps; descent is enforced by the line search. | ||
| * @warning The vectors must have identical size. Non-finite inputs yield the | ||
| * safe fallback. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a reference? I'd like to discuss this method---this is not something I'm familiar with and I want to make sure I understand the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes happy to chat about it. I found it via google and a bit of chatgpt so I can't say I'm an expert. It's a heuristic to find a decent start for the next line search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. We should have a reference that we can point other developers to, maybe even a short doc we write ourselves. This doesn't seem like an overly complicated procedure and writing it up should be straightforward and help us review the implementation of the procedure.
| * | ||
| * The line search delegates *all* numerical work to a user-supplied callable | ||
| * `update_fun` that evaluates the objective and its directional derivative at a | ||
| * trial step and prepares the candidate state for acceptance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused here. So the Stan user needs to provide UpdateFun or is all this still happening internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all happening internally. We don't expose the wolfe line search to the user.
| } // namespace internal | ||
|
|
||
| } // namespace stan::math | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like there is a lot happening in this function. It would be worth having a design doc that outlines the steps and different scenarios. It's also unclear to me what the user will need (or have the option) to supply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing here will touch user space. This is only used within laplace
| solver, | ||
| tolerance, | ||
| max_num_steps, | ||
| laplace_line_search_options{max_steps_line_search}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read this correctly, the user interface has not changed? So the user has not control over the wolfe linesearch and this is all handled internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user has no control over the line search. I think adding all of those parameters would make the signature crazy looking and imo idt users need to worry about it as long as we have decent default settings.
| expect_near_rel(msg, ll_laplace_val, std::log(piece), rel_tol, | ||
| "laplace_val", "integrated_val"); | ||
| } catch (const std::domain_error& e) { | ||
| // NOTE: Failures for integration our fine since we are testing laplace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"our" --> "are"
| } | ||
|
|
||
| // Tests stay focused on the strong-Wolfe variant because the API does not yet | ||
| // expose a weak-Wolfe configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok... I lost track of the difference between the strong-Wolfe and weak-Wolfe configuration. Can you give me a reminder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See page 7 from here
You have to flip the signs since we are doing ascent instead of descent. Both use the same Armijo condition (i), but the curve check is different. The absolute values on the strong wolfe condition stop you from overshooting, where weak only checks if the slope direction (for descent) is less than the start. But that does mean you are allowed to overshoot in the weak wolfe.
If we use weak wolfe we would overstep far sometimes and have to backtrack in the opposite direction causing jumping around to happen. In my head it feels like that would be more work / iterations through I did not test that.
Jenkins Console Log Machine informationNo LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focalCPU: G++: Clang: |
Yes I'll throw in a little json logger on a side branch so we can get some performance numbers over all the tests. Aki's example will fail if we turn off the line search. So I think we should leave it on and allow it to try a few iterations. But yeah we can hop on a call and discuss the line search strategy. |
What is the intuition with that? My thought process is that if someone asks for 1e-12 and the optimizer gets down to 1e-10, should we really throw out the whole result? I feel like just telling users "Hey we were only able to get down to |
I think it's fine to have unit tests which fail with the default control parameters, as long as we get useful error messages, and we can get the unit tests to pass with non-default control parameters. The whole point of giving users control over tuning parameters is that sometimes the defaults don't cut it.
Sure thing!
You raise a valid point and I'm open to the idea of issuing a warning message. The argument for rejecting the proposal is that the user decides what an acceptable tolerance is for the solver---if the solver doesn't achieve that tolerance, then we might be concerned that the marginal likelihood is poorly approximated, say because the chain wondered into a pathological region of the parameter space and then it is better to backtrack. Still, I like the idea of a warning message. It's then up to the user to check the quality of the inference directly, rather than relying on the quality of the numerical methods. (Note that issuing a warning message would be inconsistent with what we've done with other numerical methods, like the newton_solver.) |
|
I wrote this branch that has a full json logger inside of it. There is a gist here with R code for making some graphs out of the json data (it's below the failure logs). NOTE: the test
Sorry, by fail I mean the gradients and value that we return are completely wrong for the roach data without the wolfe line search. For example here is the output below that shows this when line search is set to zero. I'm going to think for a little bit to see if we can't find a happy medium that does a little line search. This graph shows the number of evaluations inside of the wolfe line search for the roach data over multiple runs of the data. Each line is a separate run.
So we need a bunch of evaluations in the beginning, middle, and end. Maybe I can just revert back to trying a full newton step, then if we pass the Wolfe conditions with one full step we keep going and otherwise we fallback to the wolfe line search. I tried being too cutesy with this graph, but this is the initial stepsize and final stepsize for each of the laplace iterations over all of the runs of the roach data.
While there is too much going on here, you can see that we need to take a teeny tiny first step, but then after that we can get away with step sizes that are pretty big! A few other graphs and notes This shows the amount of time spent doing the wolfe line search for each of the tests in
This graph shows the runtime for each run of the tests given we either do a full newton step or use the line search.
So sometimes Wolfe is not that much worse. However, in the same graph below for the motorcycle data wolfe is way way slower!
But a full newton step here also fails the AD test suite by about 3e-5. So I think there is something we can do where, if a full newton step seems hairy we can fallback to wolfe, but otherwise just accept the full newton step and keep going. My intuition was that I thought the gradient for This is what the arrow graph above is supposed to look like and I think it is more clear what is going on relative to the one above. At iteration 0 we start with a step size of 1 and end up at a step size of 2. Then at iteration 1 we jump back down from a stepsize of 2 to a stepsize of 1.
I need to go into the AD testing framework and add logging for if a test passes or fails with newton or wolfe (and by how much). It requires diving into the code for the general AD test suite we use so I didn't get around to it yet.
Agree it would be nonstandard relative to our other solver. Though for LBFGS if we go through line search without hitting the tolerances we still report back values. If we can craft a nice and clear message for this then I think it would be nice to issue a warning. |






Summary
This PR makes the following changes for the laplace approximation:
thetastarted the model in the tail of the distribution. The quick line search we did which only tested half of a newton step was not robust enough for this model to reach convergance. This PR adds a full wolfe line search to the Newton solver used in the laplace approximation to improve convergence in such cases.The graphic below shows the difference in estimates of the log likelihood for
laplacerelative tointegrate_1don the roach test data plotted along the mu and sigma estimates. There is still a bias relative tointegrate_1das mu becomes negative and sigma becomes larger, but it is much nicer than before.laplace_marginal_density_estis expensive as it requires calculating either a diagonal hessian or block diagonal hessian with 2nd order autodiff. The wolfe line search only requires the gradients of the likelihood with respect to theta. So with that in mind the wolfe line search tries pretty aggressively get the best step size. If our initial step size is successful, we try to keep doubling until we hit a step size where the strong wolfe conditions fail and then return the information for the step right before that failure. If our initial step size does not satisfy strong wolfe then we do a bracketed zoom with cubic interpolation until till we find a step size that satisfies the strong wolfe conditions.Tests for the wolfe line search are added to
test/unit/math/laplace/wolfe_line_search.hpp.In the last iteration of the laplace approximation we were returning the negative block diagonal hessian and derived matrices from the previous search. This is fine if the line search in that last step failed. But if the line search succeeds then we need to go back and recalculate the negative block diagonal hessian and it's derived quantities.
Previously we had one
block_hessianfunction that calculated both the block hessian or the diagonal hessian at runtime. But this function is only used in places where we know at compile time whether we want a block or diagonal hessian. So I split out the two functions to avoid unnecessary runtime branching.For an initial step size estimate before each line search we use the Barzilai-Borwein method to get an estimate.
Previously we calculated them eargerly in each laplace iteration. But they are not needed within the inner loop so we wait till we finish the inner search then calculate their adjoints once afterwards.
We were calculating the covariance matrix from inside of
laplace_density_est, but this required us to then return it from that function and imo looked weird. So I pulled it out and nowlaplace_marginal_density_estis passed the covariance matrix.There were a few places where we could use
log_sum_expetc. so I made those changes.The finite difference method in Stan was previously using stepsize optimzied a 2nd order method. But the code is a 6th order method. I modified
finite_diff_stepsizeto use epsilon^(1/7) instead of cbrt(epsilon). With this change all of the laplace tests pass with a much higher tolerance for precision.Tests
All the AD tests now have a tighter tolerance for the laplace approximation.
There are also tests for the wolfe line search in
test/unit/math/laplace/wolfe_line_search.hpp.Release notes
Improve laplace approximation with wolfe line search and bug fixes.
Checklist
Copyright holder: Steve Bronder
The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
./runTests.py test/unit)make test-headers)make test-math-dependencies)make doxygen)make cpplint)the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested