The value is the expected reward
Goal : Maximize the expected reward
epsilon = 0 -> only greedy
epsilon-greedy for balancing exploration and exploitation
upper-confidence bound action selection
how ucb drive to exploration
normarly distributed with the mean zero and standard deviation one.
rewards are sampled from univariance normal with mean
, compare UCB and epsilon-greedy