Skip to content

Latest commit

 

History

History
465 lines (382 loc) · 24.9 KB

README.md

File metadata and controls

465 lines (382 loc) · 24.9 KB

RL-for-Transportation

Paper list of Reinforcement Learning (RL) applied on transportation

Ride-sourcing system

Survey

  1. Reinforcement Learning for Ridesharing: A Survey. 2021

Dataset

  1. DiDi GAIA Dataset

Competition

  1. Learning to Dispatch and Reposition on a Mobility-on-Demand Platform

Book

  1. Approximate Dynamic Programming: Solving the curses of dimensionality. Powell, W. B. (2007).

Paper

Order dispatching

  1. predict cancellation probability of vehicle-order pair $p_{ij}$
  2. maximize total success rate:
    1. $a_{ij}$: matching decision
    2. NP hard combinatorial optimization
      1. HillClimbing Algorithm
  1. bipartite matching + evaluation
    1. weight: advantage trick
      1. discounted reward + furture value - current value
    2. state: time & space zone (no contextual information)
    3. value: policy evaluation (offline)
  1. bipartite matching + policy evaluation
    1. weight: advantage value - distance
    2. state:
      1. time & space zone + supply-demand info
      2. coarse coding with hierarchical hexagon grid
    3. value (offline)
      1. policy evaluation: with supply-demand info
      2. distillation: marginalize time-space value
  1. MARL (on policy)
    1. state: contextual information
    2. acrion: active order pool
      1. mean action: defined as number of neighbor drivers
    3. reward:
      1. order fare
      2. pick distance
      3. destination supply-demand gap
  1. MARL (on policy)
    1. state: contextual features
    2. TD loss + KL regulizer
      1. regulizer: distribution between order and vehicles

Order delaying

  1. MARL (on policy)
    1. state: contextual features
    2. action: {0,1}, match or hold
    3. reward: customer waiting time
      1. weighted global + individual reward
  1. RL (off policy)
    1. state: global grid-based state -> flatten
    2. action: {0,1}, match or hold
    3. reward:
      1. matching wating time
      2. pickup wiating time

Order pooling

  1. RL (on policy)
    1. state: time & space grid
    2. action: wait, TK1, TK2
      1. wait: stay current location
      2. TK1: pick orders within max pick time
      3. TK2: TK1 + larger pick time for second order
    3. reward: effective distance traveled
  1. RL (on policy)
    1. state: global supply-demand profile map
    2. action (sequentially): follow shorest path
      1. find another customer
      2. next zone
    3. reward
      1. served customer
      2. detour time
  1. The same as DeepPool
    1. consider the change of MDP (with different models)
    2. online Dirichlet change point detection (ODCP) to detect changes
  1. ADP
    1. decision
      1. routes is determined using the shortest-path strategy
      2. one decision one assignment
      3. linear approximation for value function
      4. linear assignment problem
        1. dual update
  1. ADP
    1. decision
      1. follow shorest path
      2. generate feasible order combinations
      3. value approximation: individual value
        1. low-dimensional embedding for each location of vehicle
      4. linear assignment
        1. TD update
  1. like NeuralADP
    1. average of neighbor value to approximate indivadule value:
    2. each neighbor value is the conditional expectation of action
    3. final approximation

Order pricing

  1. RL (on-policy)
    1. use TD learning to calculate furture value
    2. use Contextual Bandit to give price

Vehicle relocation

  1. recommend route for vacant vehicles
    1. segment profit
      1. earning:
      2. cost:
    2. expected route profits:
    3. algorithm
      1. Brute-Force based MNP Recommendation
      2. Recursive Recommendation Strategy
    4. multi-driver routes recommendation
      1. recommend the route with the lowest correlationship to the second driver
  1. Defined as MDP:
    1. calibrate pick probability (discounted by number of taxis)
    2. passenger destination probability
  2. solving
    1. Rolling Horizon:
    2. DP approach
      1. discounting pick prob:
  1. similar to last one
    1. long-horizon
    2. discounted prob of competing drivers
  1. RL (on-policy)
    1. state: heatmap + CNN
    2. making decision sequentially for each vehicle
  1. MARL (on policy)
    1. centralize critic + decentralized policy
    2. conut based global state
    3. difference policy gradient
      1. Wonderful Life Utility (WLU)
        1. with and without agent i
      2. Aristrocratic Utility (AU)
        1. fix actions of other agents, marginalize agent i
        2. conterfactual
    4. approximating central xritic
      1. linear combination
        1. individual information
        2. f without info of other agents
      2. first-order approximation
        1. those coefficients are evaluated at the overall state-action counts
  1. RL (on-policy)
    1. state + contextual features
    2. action: neighbor girds
      1. sequentially make decision
      2. avoid moving in conflict directions
        1. add collaborative context indicating directions of previous vehicles
      3. avoid moving to low-value grid
  1. Policy evaluation (off-line)
    1. time-space value function
      1. dual policy evaluation
        1. conditional value: V(s|b)
        2. marginal value: V(s)
      2. Value-based Policy Search (VPS)
        1. one-step:
        2. two-step:
      3. implementation
        1. step selection: small works well
        2. long search: choose global top points
        3. contextual value:
        4. SD regulizer
          1. add destination supply-demand gap

Joint dispatching and relocation

  1. MARL (on-policy), sequentially decision making
    1. hierarchical strucutre
      1. upper level
        1. generate encoding of env using RNN
      2. lower level
        1. using info from upper level, generate prob of different grids
        2. dispatching and relocating
    2. reward
      1. gap between manager’s entropy and global average entropy
      2. KL divergence of supplt and demand
    3. coordination
      1. using attention to aggregate info of neighbor grids
  1. policy evaluation (off-line)
  2. on-line updateing
    1. using current transitions
  3. ensemble of offline and online value
  4. dispatching: bipartite matching
  5. relocaintg:
  1. RL (on-policy)
    1. centralized programming model
      1. planning in both dispatching and relocating
    2. TD learning for updating value function
  1. ADP (marco level)
    1. decision
    2. path based pricing (market cleaning)
    3. routing after distaching (constrained zone choice)
    4. order sharing
    5. relocation
  2. piece-wise linear approximation of value function

Intersection control

Survey

  1. A survey on traffic signal control methods. 2019
  2. Recent advances in reinforcement learning for traffic signal control: A survey of models and evaluation. 2021

Dataset

  1. Cityflow: A multi-agent reinforcement learning environment for large scale city traffic scenario
  2. Reinforcement Learning for Traffic Signal Control

Competition

  1. City Brain Challenge

Paper

Single-agent

  1. state:
  2. action: {1,0}
    1. change to the next phase
    2. keep phase
  3. reward:
    1. wighted reward of (queue length, delay, waiting time, light switches, number of vehicles, and travel time)
  4. algorithm:
    1. DQN
1. imitation learning
   1. actor: ![](pic/2021-10-25-19-38-48.png)
   2. critic:![](pic/2021-10-25-19-39-39.png)
  1. state:
    1. current pahse (one-hot)
    2. number of vehicles
  2. action:
    1. pre-defined phases
  3. reward:
    1. pressure for movement:
    2. total pressure:
  4. algorithm:
    1. DQN
  1. reward
    1. Pressure with Remaining Capacity of Outgoing Lane:
  1. state
    1. current phase (one-hot)
    2. number of vehicles
  2. action:
    1. pre-defined phases
  3. reward:
    1. queue length
  4. invariance:
    1. flip and rotation
1. FRAP + pressure reward
2. reward:
   1. pressure based on queuing vehicles
  1. state
    1. traffic characteristics in each lane
  2. action
    1. pre-defined phases
  3. reward
    1. pressure
  4. algorithm:
    1. PG + MC
  5. invariance
    1. topology
  1. gradient-based meta learning
    1. training agent in clusetered environments
    2. meta-training
  1. FRAP + gradient-based meta-learning

Multi-agent

  1. information aggregation:
  2. GAT to aggregation neighbor intersection information
  1. differentiable commnication
  1. intrinsic reward
    1. error of predict neighbor reward and transitions
  2. latent variable policy
    1. RNN encoded environment
  1. Hierarchy
    1. select sub-policies with different reward function
  2. weighted local and neighbor reward
    1. adaptive weighting mechanism