Can you change the reward function after training in reinforcement learning?

Arpan Kusari
6 min readNov 3, 2020

You want to train your robot to find the box containing mana. But right next to the box is the bottomless pit that the robot has to avoid or else it dies. For every grid it goes to without reaching the box, it expends some amount of energy. What I am explaining here is the classical example of gridworld.

Gridworld environment

Now, of course, for this case, the reward function is well-defined with the positive reward of +1, negative reward of -1 and living reward (cost of traversing each grid) of -0.02. You complete the training process and get the optimal value function using this reward function.

Optimal policy and optimal value function using the default reward function

But you suddenly find out that the living reward needs to be changed to -0.2 (an increase of a factor of 10). So what do you do? You change the living reward and run the training process again.

Optimal policy and optimal value function using living reward = -0.2

Now, if the living reward changes again, you can quickly see that the repetitive training can become cumbersome. Now, imagine you have a more complicated reward function of the form:

where you have to find the desired weights by trial-and-error, you can quickly understand how it can be a time-consuming problem to solve.

This is what we were stumped with (by we I mean my collaborator Prof. Jonathan How of MIT and myself). We asked ourselves a simple question:

Given that we have the optimal value functions at few sample weight points, can we interpolate over the whole space of optimal value functions?

Spoiler alert: Yes, we can.

We published a paper titled “Predicting optimal value functions by interpolating reward functions in scalarized multi-objective reinforcement learning” at International Conference of Robotics and Automation (ICRA) 2020 and created a video explaining the problem and its solution:

Now we delve into the paper: The problem as it turns out, is one of supervised learning, where the optimal value functions are the response variables for the given weights and we have to predict optimal value functions at unknown weights. The change in weights may be non-uniform, which makes the process highly nonlinear. Thus, it becomes a SL problem where with the increase in the number of objectives, the weight space increases and data points becomes extremely sparse. Finding accurate value function values across problem space would be extremely beneficial for machine learning in general. The aim of this research is to interpolate through the space of the value functions as a result of changing the weights of the reward function using a Gaussian Process (GP).

Next we spend a bit of time looking at the mathematics that makes it possible. Feel free to skip it and look directly at the results.

Theorem: The gradient of the state-value function with respect to the weights exists, if all the rewards at the current state and action are finite

Similar result also holds for the Q-function. A limitation of the theorem is that it holds only for strictly convex reward functions. Now we look at the results:

Gridworld

  1. We change the living reward weight in steps of -0.1 from 0 to -0.5 and train the RL using a value iteration approach. We then randomly choose five different weights; four interpolated and one extrapolated, and report the mean squared error and median sigma to validate our theorem. The following figure shows the result of interpolation at living reward weight= -0.23
Fig. 1. (a) Optimal policy and optimal value function for living reward (0) and (b) optimal policy and value function for living reward −0.5. © For the interpolation of living reward (−0.23), we show the optimal value functions for two neighboring points with living reward (−0.2) and living reward (−0.3). (d) Predicted and actual optimal function values for living reward (−0.23).

2. We next change the terminal negative reward weight from -1 to -5 with steps of -0.5. Again, both interpolation (first four rows) and extrapolation (last row) evaluation cases were considered.

3. Finally, we alter the positive terminal reward weight from 1 to 5 with steps of 0.5 and similar to the previous cases, we present the interpolation at random weight points

The results clearly show that, in both interpolation and extrapolation, the GP is able to recover the value functions

Objectworld

Objectworld is an extension of gridworld that features random objects placed in the grid (Figure 2(a)). The objects are assigned a random outer and inner color (out of C colors) with the state vector being composed of the Euclidean distance to the nearest object with a specific inner or outer color. The true reward is positive in states that are both within 3 cells of outer color 1 and 2 cells of outer color 2, negative within 3 cells of outer color 1, and zero otherwise. Inner colors and all other outer colors are distractors. In the given example, we use two colors, blue and red. Fifteen different objects are placed randomly within the 10 × 10 grid with randomly chosen inner and outer color. The positive reward is varied from 0.5 to 1 with 0.6, 0.7 and 0.8 points being predicted

Fig. 2. (a) Objectworld with 15 randomly placed objects in blue and red inner and outer colors chosen randomly; white represents positive reward, black negative reward and grey zero reward (b) Actual value function for positive reward (0.8) © Predicted value function

The interpolation is not accurate as in gridworld due to the nonlinearity of the reward with respect to the states, but the GP can still recover values close to the actual values, especially in the positive reward region.

Pendulum

The pendulum environment is an well-known problem in the control literature in which a pendulum starts from a random orientation and the goal is to keep it upright while applying the minimum amount of force.

The state vector is composed of the cosine (and sine) of the angle of the pendulum, and the derivative of the angle. The action is the joint effort as 5 discrete actions linearly spaced within the [−2,2] range. The reward function is

where w1, w2 and w3 are the reward weights for the angle θ, derivative of angle θ˙ and action a respectively.

The pendulum environment is solved using the DQN approach for various w3 = {0.1,0.01,0.001,0.0001} with the evaluation performed at w3 = 0.001. The following boxplot shows the difference in values for 5 evaluation episodes

Utilizing a DQN provides no guarantees that the states seen during testing have been visited during training, which can lead to out-of-distribution states.

The boxplots show that the GP is able to recover a value close to the actual value (with zero being no difference and greater than 1 meaning that the predicted value is not able to recover the actual value at all) for the majority of the episodes for continuous state domain problems.

Example code given in https://github.com/shunyo/predicting-optimal-value-functions

--

--