Principled forward model for uniformly converged cost to go #766

dannyopts · 2024-07-25T12:18:39Z

Hi,

I am interested in understanding the cost to go function across a broad range of different states, not just on chain.

With the default forward pass we end up only exploring the state space in a restricted domain which is visitied when the system is controlled close to optimally and the model can perform poorly "off-chain".

For example I have an energy system with a number of stores and I want to understand the marginal value of additional storage in different states on the converged model, so uniformly converging the store state components of the CTG function is important.

One option would be to uniformly randomly sample from the store state in the forward pass, but I'm curious if there is a more principled way to select the next set of points to explore (forward path) in order to maximise the expected learning.

At first I thought about if bayesian optimisation or similar could be used to sample from the state space like in hyper parameter optimisation however the big difference here is that the "black box" function we want to minimise is something like a combination of expected optimal policy improvement and the expected off policy CTG error improvement which is continually changing.

Once we have visited an off-chain point, it seems reasonable to assume that the expected off policy performance improvement from visiting that point again will be significantly reduced since we have corrected the first order errors. So a reasonable approximation of expected off policy performance improvement would be the time since we evaluated near that point.

One way to do this would be to have a modified forward pass which maintains a tabu list and includes the L1 distance from each of these points as a form of regularisation.

Does this make sense? Is there a better way of approaching this problem?

odow · 2024-07-25T21:52:26Z

You are entirely correct that the forward pass does not need to "optimize" anything and that we can select the trial points somewhat arbitrarily.

I don't have any suggestions for what the right approach is, and I'm not aware of any literature on this, but it shouldn't be too hard to implement as a new forward pass:
https://github.com/odow/SDDP.jl/blob/master/src/plugins/forward_passes.jl

We do a form of upper confidence bound learning in the duality handlers:

SDDP.jl/src/plugins/duality_handlers.jl

Lines 351 to 388 in f5d24b7

    
           function BanditDuality() 
        
               return BanditDuality(ContinuousConicDuality(), StrengthenedConicDuality()) 
        
           end 
        
           function _choose_best_arm(handler::BanditDuality) 
        
               _, index = findmax( 
        
                   map(handler.arms) do arm 
        
                       return Statistics.mean(arm.rewards) + Statistics.std(arm.rewards) 
        
                   end, 
        
               ) 
        
               handler.last_arm_index = index 
        
               return handler.arms[index] 
        
           end 
        
           function _update_rewards(handler::BanditDuality, log::Vector{Log}) 
        
               # The bound is monotonic, so instead of worring about whether we are 
        
               # maximizing or minimizing, let's just compute: 
        
               #          |bound_t - bound_{t-1}| 
        
               # reward = ----------------------- 
        
               #            time_t - time_{t-1} 
        
               t, t′ = log[end], log[end-1] 
        
               reward = abs(t.bound - t′.bound) / (t.time - t′.time) 
        
               # This check is needed because we should probably keep using the first 
        
               # handler until we start to improve the bound. This can take quite a few 
        
               # iterations in some models. (Until we start to improve, the reward will be 
        
               # zero, so we'd never revisit it. 
        
               const_bound = isapprox(log[1].bound, log[end].bound; atol = 1e-6) 
        
               # To start with, we should add the reward to all arms to construct a prior 
        
               # distribution for the arms. The 10 is somewhat arbitrary. 
        
               if length(log) < 10 || const_bound 
        
                   for arm in handler.arms 
        
                       push!(arm.rewards, reward) 
        
                   end 
        
               else 
        
                   push!(handler.arms[handler.last_arm_index].rewards, reward) 
        
               end 
        
               return 
        
           end

We also do a form of the tabu list/L2-distance for cyclical graphs where we terminate after each cycle:

SDDP.jl/src/plugins/forward_passes.jl

Lines 76 to 94 in f5d24b7

    
           # ===== Begin: starting state for infinite horizon ===== 
        
           starting_states = options.starting_states[node_index] 
        
           if length(starting_states) > 0 
        
               # There is at least one other possible starting state. If our 
        
               # incoming state is more than δ away from the other states, add it 
        
               # as a possible starting state. 
        
               if distance(starting_states, incoming_state_value) > 
        
                  options.cycle_discretization_delta 
        
                   push!(starting_states, incoming_state_value) 
        
               end 
        
               # TODO(odow): 
        
               # - A better way of randomly sampling a starting state. 
        
               # - Is is bad that we splice! here instead of just sampling? For 
        
               #   convergence it is probably bad, since our list of possible 
        
               #   starting states keeps changing, but from a computational 
        
               #   perspective, we don't want to keep a list of discretized points 
        
               #   in the state-space δ distance apart... 
        
               incoming_state_value = 
        
                   splice!(starting_states, rand(1:length(starting_states)))

The closest thing is Regan's problem-child algorithm, but this selects only the realization of the random variable that leads to be the greatest learning: https://optimization-online.org/wp-content/uploads/2018/02/6449.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Principled forward model for uniformly converged cost to go #766

Principled forward model for uniformly converged cost to go #766

dannyopts commented Jul 25, 2024

odow commented Jul 25, 2024

Principled forward model for uniformly converged cost to go #766

Principled forward model for uniformly converged cost to go #766

Comments

dannyopts commented Jul 25, 2024

odow commented Jul 25, 2024