Chapter 2 - Multi-armed Bandits

Figures 2.1 - 2.2 Epsilon Greedy Bandits

Recreate the following experiments using the cli command below:

python run.py -m run.steps=1000 run.n_runs=2000 +bandit.epsilon=0,0.01,0.1 +bandit.random_argmax=true experiment.tag=fig2.2 experiment.upload=true

Figure 2.1 (Sutton & Barto): An example bandit problem from the 10-armed testbed. The true value q(a) of each of the ten actions was selected according to a normal distribution with mean zero and unit variance, and then the actual rewards were selected according to a mean q(a), unit-variance normal distribution, as suggested by these gray distributions.

Figure 2.1 (rlbook): The testbed used for this experiment used similar normal distributions with recreated means and unit variance as the Sutton & Barto example. Also provided are the actions and rewards across steps for a single run- notice how exploration increases with epsilon. Link to wandb artifact.

Figure 2.2 (Sutton & Barto): Average performance of epsilon-greedy action-value methods on the 10-armed testbed. These data are averages over 2000 runs with different bandit problems. All methods used sample averages as their action-value estimates.

Figure 2.2 (rlbook): The `+bandit.random_argmax=true` flag was used to switch over to an argmax implementation that randomizes between tiebreakers rather than first occurence used in the default numpy implementation to better align with the original example. Link to wandb artifact.

Figure 2.3 Optimistic Initial Q Estimates

Recreate the following experiment using the cli command below:

python run.py -m run.steps=1000 run.n_runs=2000 +bandit.epsilon=0.1 +bandit.random_argmax=true bandit.alpha=0.1 bandit.Q_init=0 experiment.tag=fig2.3 experiment.upload=true

python run.py -m run.steps=1000 run.n_runs=2000 +bandit.epsilon=0 +bandit.random_argmax=true bandit.alpha=0.1 bandit.Q_init=5 experiment.tag=fig2.3 experiment.upload=true

Figure 2.3 (Sutton & Barto): The effect of optimistic initial action-value estimates on the 10-armed testbed. Both methods used a constant step-size parameter, alpha=0.1

Figure 2.3 (rlbook): The `+bandit.random_argmax=true` flag was used to switch over to an argmax implementation that randomizes between tiebreakers rather than first occurence used in the default numpy implementation to better align with the original example. Link to wandb artifact

Figure 2.4 UCL Bandits

Recreate the following experiment using the cli command below:

python run.py run.steps=1000 run.n_runs=2000 experiment.tag=fig2.4 experiment.upload=true bandit._target_=rlbook.bandits.algorithms.UCB +bandit.c=2

python run.py run.steps=1000 run.n_runs=2000 experiment.tag=fig2.4 experiment.upload=true bandit._target_=rlbook.bandits.algorithms.EpsilonGreedy +bandit.epsilon=0.1

Figure 2.4 (Sutton & Barto): Average performance of UCB action selection on the 10-armed testbed. As shown, UCB generally performs better than "-greedy action selection, except in the first k steps, when it selects randomly among the as-yet-untried actions.

Figure 2.4 (rlbook): rlbook UCB implementation. Na, the array that keeps the count of how many times an action has been chosen was initialized with 1e-100 instead of 0 to prevent a divide by zero error.
Link to wandb artifact

Figure 2.5 Gradient Bandits

Recreate the following experiment using the cli command below:

python run.py -m run.steps=1000 run.n_runs=2000 experiment.tag=fig2.5 experiment.upload=true bandit._target_=rlbook.bandits.algorithms.Gradient +bandit.lr=0.1,0.4 +bandit.disable_baseline=false,true

Note that the testbed was modified to increase the means by +4:

testbed:
  _target_: rlbook.bandits.testbeds.NormalTestbed 
  expected_values:
    0: 
     mean: 4.2
     std: 1
    1:
     mean: 3.2
     std: 1
    2:
     mean: 5.7
     std: 1
    3:
     mean: 4.5
     std: 1
    4:
     mean: 5.5
     std: 1
    5:
     mean: 2.5
     std: 1
    6:
     mean: 3.8
     std: 1
    7:
     mean: 3.0
     std: 1
    8:
     mean: 4.1
     std: 1
    9:
     mean: 3.2
     std: 1

Figure 2.5 (Sutton & Barto): Average performance of the gradient bandit algorithm with and without a reward baseline on the 10-armed testbed when the q*(a) are chosen to be near +4 rather than near zero.

Figure 2.5 (rlbook): Note that the baseline R hat t did not include Rt, following the intent of equation 2.12 where "R hat t is the average of the rewards up to but not including time t". This is different from the Sutton and Barto empirical results shown inFigure 2.5 where they have the footnote: "In the empirical results in this chapter, the baseline R hat t also included Rt." Link to wandb artifact