Button multi arm bandit problem

Question

Let's say you have the following buttons to press, labeled a - j, that would earn you the following amounts of money:

a: \$6
b: \$9
c: \$15
d: \$26
e: \$45
f: \$78
g: \$136
h: \$416
i: \$728
j: \$1274

You have two weeks to earn the most amount of money. You know from the start that the g button would net you $136. You can only click on one button per day. You do not know the amount of money you would earn from the other buttons until you take one day to press one of them.

What is the most optimal strategy to earn the most amount of money in the course of two weeks?

Welcome to puzzling! A little note: what you are asking for is an optimization problem to maximise the outcome. If not, the best strategy is just to press the j-button (yeah, even if you do not know nothing about anything). So you want to look to a strategy that provides a best average outcome. That can be solved with optimization algorithms like epsilon-greedy or thompson sampling, proportionated to the outcomes of the arms. I guess that tuning hyperparameters of them will provide the (nearly) best average outcome. — 10010100102ohno, Commented Nov 2, 2023 at 17:48
So you know the button which will give you $136 but all the other buttons are unlabeled and you only know that they correspond to the other numbers in your list but you don't know which button corresponds to which other amount? — quarague, Commented Nov 4, 2023 at 11:01
If @quarague's interpretation is correct, your question is currently unclear: "You do not know the amount of money you would earn from the other buttons until you take one day to press one of them" makes it sound like we don't know the information given in the list above. Can I suggest you rephrase the question slightly? For example by adding: "Unfortunately the labels above have been randomly mixed up. You know from the start that the g button would still net you $136, but not which amount the other buttons now correspond to." — Silverfish, Commented Nov 11, 2023 at 14:40

devyen936 · Accepted Answer · 2023-11-06 14:03:24Z

This is as Multi-armed Bandit problem. Based on nature of the rewards I think that choosing explore as most often is optimal.
An optimal outcome is hitting 1274 each time. Regret is the difference between optimal outcome and chosen path.
After calculating the average regret for each experiment 1000 times with Epsilon values from .1 to 1--meaning what percent of the time will the algo explore vs exploit its known rewards--my conclusion is explore until finding the 1274 is optimal.

epsilon	Avg regret
0.1	14010
0.2	11949
0.3	10767
0.4	9257
0.5	8391
0.6	7577
0.7	6885
0.8	6079
0.9	5642
1	5057

import pandas as pd
import numpy as np
import random
import statistics as sts

def explore():
    reward = random.choice(reward_list)
    known_list.append(reward)
    return max(known_list)

def exploit():
    return max(known_list)

def calculate_regret(results):
    opt = np.full(shape=14, fill_value=1274)  # optimal strategy = 17836
    r = sum(opt) - sum(results)
    print(f'Regret: {r}')
    return r


def run_experiment():    
    results = []
    for j in range(14):
        if random.random() > epsilon:
            results.append(exploit())
        else:
            results.append(explore())
    return results

r_list = []
for i in range(1000):
    # E % of days: explore
    # 1 - E of days:  exploit
    epsilon = 1
    reward_list = [6, 9, 15, 26, 45, 78, 416, 728, 1274]
    known_list = [136]

    results = run_experiment()
    r = calculate_regret(results)
    r_list.append(r)

sts.mean(r_list)

I'm not convinced this is correct. Even if you find \$728 on Day 1, it will only take you on average 4 more days to find \$1274, leaving you on average 9 days to make an additional \$546/day. Even if you earn absolutely nothing on those 4 days, it's still worth it. — Nuclear Hoagie, Commented Nov 2, 2023 at 18:58
The original question asked about a week, a month, and a year. My napkin math told me that the optimal strategy will be different for each timespan, with "always" strategies winning for less than a week and more than two weeks, and "adaptive" strategies winning in between. — codewarrior0, Commented Nov 6, 2023 at 20:51

RobPratt · Accepted Answer · 2023-11-09 15:00:16Z

3

You can solve the problem via dynamic programming as follows. Let $B=\{0,\dots,9\}$ be the set of buttons, and let $r_i$ be the reward for button $i\in B$. Let value function $V(S,d)$ denote the maximum expected value when subset $S \subseteq B$ is known and $d$ days remain. Then $V(S,d)$ satisfies the recurrence $$V(S,d) = \begin{cases} 0 & \text{if $d=0$} \\ \max\limits_{i \in S} r_i + V(S,d-1) & \text{if $d>0$ and $S=B$} \\ \max\left( \max\limits_{i \in S} r_i + V(S,d-1), \frac{1}{|B \setminus S|} \sum\limits_{i \in B \setminus S} (r_i + V(S\cup\{i\},d-1)) \right) & \text{otherwise} \end{cases}$$ The value of $V(\{6\},14)$ turns out to be $13401.5$, and an optimal strategy is to choose $9$ if $9\in S$ and otherwise to choose the largest $i\in S$ if $d$ is sufficiently small.

For comparison, I computed the expected value if you always explore unless $9\in S$. The DP recursion is simpler: $$V(S,d) = \begin{cases} 0 & \text{if $d=0$} \\ r_9 + V(S,d-1) & \text{if $d>0$ and $9\in S$} \\ \frac{1}{|B \setminus S|} \sum\limits_{i \in B \setminus S} (r_i + V(S\cup\{i\},d-1)) & \text{otherwise} \end{cases}$$ The value of $V(\{6\},14)$ again turns out to be $13401.5$, so that is an alternative optimal strategy.

edited Nov 9, 2023 at 15:00

answered Nov 6, 2023 at 19:15

RobPratt

14.4k1 gold badge31 silver badges59 bronze badges

$\begingroup$ can you dumb this down for me.. not sure if I understand it correctly $\endgroup$
– user55665484375
Commented Nov 8, 2023 at 20:24
1

$\begingroup$ The idea is that if you are currently in state $(S,d)$ (that is, you know the locations of the rewards in $S$ and there are $d$ days remaining), you want to choose the action that yields the largest expected total value. if $d=0$, you are done and there is no action to take. if $d>0$ and $S=B$, there is no uncertainty, so you choose the largest reward and decrement the number of days remaining. Otherwise, the optimal action is the one that yields the larger expected total reward: choose the largest known reward or explore. $\endgroup$
– RobPratt
Commented Nov 8, 2023 at 21:33
1

$\begingroup$ At the beginning, you know the location of reward $6$ and there are $14$ days remaining, so you want to compute $V(\{6\},14)$. $\endgroup$
– RobPratt
Commented Nov 8, 2023 at 21:40
$\begingroup$ Does this mean that the selected answer is wrong? $\endgroup$
– Dmitry Kamenetsky
Commented Nov 9, 2023 at 11:20
$\begingroup$ @DmitryKamenetsky See my updated answer. $\endgroup$
– RobPratt
Commented Nov 9, 2023 at 15:00

Add a comment |

tehtmi · Accepted Answer · 2023-11-11 00:32:04Z

The value under the optimal subsequent strategy of testing a new lever is a least as good as any particular strategy that tests a new lever on the next step. We will consider the particular strategy of testing new levers until we find 1274. To simplify the calculation, we will only consider profits from the 1274 lever as a lower bound on the value. With $n$ unknown levers and $5+n$ days remaining (if we have tested every day, the only case we will be concerned with), the expected number of pulls of the 1274 lever under this strategy is $((5+n) + (4+n) + (3+n) + \dots + 7 + 6)/n=(n+11)/2$ for an expected daily value of $1274(n+11)/(2n+10)$. The value is less when $n$ is larger, so in the worst case of $n=9$, we still expect to earn at least 910 per day, more than the second best lever.

Any strategy that involves using a known lever before testing a new lever is equivalent to using the known lever at the end instead (it can be delayed since it provides no information). (Actually it is obviously non-optimal unless we have found 1274 already since at the end we should use the best known lever then which might be better than the best we have found now.) So, we only need to consider strategies that do all their testing before pulling any known levers. Among such strategies, the best strategy that doesn't test on the next step is obviously just pulling the best known lever on all remaining days which can't do better than 728 per day unless we have found 1274. (If we have found 1274, obviously pulling that is best.)

Thus, we can't do better (in expectation) than testing until we find 1274.

If we care to calculate the exact expected value of this strategy, we take the expected value from the 1274 lever which is $14 \cdot 910$ and add half the value of every other unknown lever (since there is a 50% chance to find each particular lever before 1274; and linearity of expectation) to get 13401.5.

Stack Exchange Network

Button multi arm bandit problem

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
mathematics
optimization
probability
or ask your own question.

Linked

Hot Network Questions

Button multi arm bandit problem

3 Answers 3

Not the answer you're looking for? Browse other questions tagged mathematicsoptimizationprobability or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
mathematics
optimization
probability
or ask your own question.