2
$\begingroup$

Let's say you have the following buttons to press, labeled a - j, that would earn you the following amounts of money:

  • a: \$6
  • b: \$9
  • c: \$15
  • d: \$26
  • e: \$45
  • f: \$78
  • g: \$136
  • h: \$416
  • i: \$728
  • j: \$1274

You have two weeks to earn the most amount of money. You know from the start that the g button would net you $136. You can only click on one button per day. You do not know the amount of money you would earn from the other buttons until you take one day to press one of them.

What is the most optimal strategy to earn the most amount of money in the course of two weeks?

$\endgroup$
6
  • 2
    $\begingroup$ A rewrite of your Improving Judgement pill problem. $\endgroup$ Commented Nov 2, 2023 at 16:34
  • 1
    $\begingroup$ Welcome to puzzling! A little note: what you are asking for is an optimization problem to maximise the outcome. If not, the best strategy is just to press the j-button (yeah, even if you do not know nothing about anything). So you want to look to a strategy that provides a best average outcome. That can be solved with optimization algorithms like epsilon-greedy or thompson sampling, proportionated to the outcomes of the arms. I guess that tuning hyperparameters of them will provide the (nearly) best average outcome. $\endgroup$ Commented Nov 2, 2023 at 17:48
  • 3
    $\begingroup$ So you know the button which will give you $136 but all the other buttons are unlabeled and you only know that they correspond to the other numbers in your list but you don't know which button corresponds to which other amount? $\endgroup$
    – quarague
    Commented Nov 4, 2023 at 11:01
  • $\begingroup$ @quarague that should be correct $\endgroup$ Commented Nov 6, 2023 at 16:57
  • 1
    $\begingroup$ If @quarague's interpretation is correct, your question is currently unclear: "You do not know the amount of money you would earn from the other buttons until you take one day to press one of them" makes it sound like we don't know the information given in the list above. Can I suggest you rephrase the question slightly? For example by adding: "Unfortunately the labels above have been randomly mixed up. You know from the start that the g button would still net you $136, but not which amount the other buttons now correspond to." $\endgroup$
    – Silverfish
    Commented Nov 11, 2023 at 14:40

3 Answers 3

2
$\begingroup$

This is as Multi-armed Bandit problem. Based on nature of the rewards I think that choosing explore as most often is optimal.
An optimal outcome is hitting 1274 each time. Regret is the difference between optimal outcome and chosen path.
After calculating the average regret for each experiment 1000 times with Epsilon values from .1 to 1--meaning what percent of the time will the algo explore vs exploit its known rewards--my conclusion is explore until finding the 1274 is optimal.

epsilon Avg regret
0.1 14010
0.2 11949
0.3 10767
0.4 9257
0.5 8391
0.6 7577
0.7 6885
0.8 6079
0.9 5642
1 5057
import pandas as pd
import numpy as np
import random
import statistics as sts

def explore():
    reward = random.choice(reward_list)
    known_list.append(reward)
    return max(known_list)

def exploit():
    return max(known_list)

def calculate_regret(results):
    opt = np.full(shape=14, fill_value=1274)  # optimal strategy = 17836
    r = sum(opt) - sum(results)
    print(f'Regret: {r}')
    return r


def run_experiment():    
    results = []
    for j in range(14):
        if random.random() > epsilon:
            results.append(exploit())
        else:
            results.append(explore())
    return results

r_list = []
for i in range(1000):
    # E % of days: explore
    # 1 - E of days:  exploit
    epsilon = 1
    reward_list = [6, 9, 15, 26, 45, 78, 416, 728, 1274]
    known_list = [136]

    results = run_experiment()
    r = calculate_regret(results)
    r_list.append(r)

sts.mean(r_list)
$\endgroup$
2
  • 4
    $\begingroup$ I'm not convinced this is correct. Even if you find \$728 on Day 1, it will only take you on average 4 more days to find \$1274, leaving you on average 9 days to make an additional \$546/day. Even if you earn absolutely nothing on those 4 days, it's still worth it. $\endgroup$ Commented Nov 2, 2023 at 18:58
  • $\begingroup$ The original question asked about a week, a month, and a year. My napkin math told me that the optimal strategy will be different for each timespan, with "always" strategies winning for less than a week and more than two weeks, and "adaptive" strategies winning in between. $\endgroup$ Commented Nov 6, 2023 at 20:51
3
$\begingroup$

You can solve the problem via dynamic programming as follows. Let $B=\{0,\dots,9\}$ be the set of buttons, and let $r_i$ be the reward for button $i\in B$. Let value function $V(S,d)$ denote the maximum expected value when subset $S \subseteq B$ is known and $d$ days remain. Then $V(S,d)$ satisfies the recurrence $$V(S,d) = \begin{cases} 0 & \text{if $d=0$} \\ \max\limits_{i \in S} r_i + V(S,d-1) & \text{if $d>0$ and $S=B$} \\ \max\left( \max\limits_{i \in S} r_i + V(S,d-1), \frac{1}{|B \setminus S|} \sum\limits_{i \in B \setminus S} (r_i + V(S\cup\{i\},d-1)) \right) & \text{otherwise} \end{cases}$$ The value of $V(\{6\},14)$ turns out to be $13401.5$, and an optimal strategy is to choose $9$ if $9\in S$ and otherwise to choose the largest $i\in S$ if $d$ is sufficiently small.


For comparison, I computed the expected value if you always explore unless $9\in S$. The DP recursion is simpler: $$V(S,d) = \begin{cases} 0 & \text{if $d=0$} \\ r_9 + V(S,d-1) & \text{if $d>0$ and $9\in S$} \\ \frac{1}{|B \setminus S|} \sum\limits_{i \in B \setminus S} (r_i + V(S\cup\{i\},d-1)) & \text{otherwise} \end{cases}$$ The value of $V(\{6\},14)$ again turns out to be $13401.5$, so that is an alternative optimal strategy.

$\endgroup$
5
  • $\begingroup$ can you dumb this down for me.. not sure if I understand it correctly $\endgroup$ Commented Nov 8, 2023 at 20:24
  • 1
    $\begingroup$ The idea is that if you are currently in state $(S,d)$ (that is, you know the locations of the rewards in $S$ and there are $d$ days remaining), you want to choose the action that yields the largest expected total value. if $d=0$, you are done and there is no action to take. if $d>0$ and $S=B$, there is no uncertainty, so you choose the largest reward and decrement the number of days remaining. Otherwise, the optimal action is the one that yields the larger expected total reward: choose the largest known reward or explore. $\endgroup$
    – RobPratt
    Commented Nov 8, 2023 at 21:33
  • 1
    $\begingroup$ At the beginning, you know the location of reward $6$ and there are $14$ days remaining, so you want to compute $V(\{6\},14)$. $\endgroup$
    – RobPratt
    Commented Nov 8, 2023 at 21:40
  • $\begingroup$ Does this mean that the selected answer is wrong? $\endgroup$ Commented Nov 9, 2023 at 11:20
  • $\begingroup$ @DmitryKamenetsky See my updated answer. $\endgroup$
    – RobPratt
    Commented Nov 9, 2023 at 15:00
1
$\begingroup$

The value under the optimal subsequent strategy of testing a new lever is a least as good as any particular strategy that tests a new lever on the next step. We will consider the particular strategy of testing new levers until we find 1274. To simplify the calculation, we will only consider profits from the 1274 lever as a lower bound on the value. With $n$ unknown levers and $5+n$ days remaining (if we have tested every day, the only case we will be concerned with), the expected number of pulls of the 1274 lever under this strategy is $((5+n) + (4+n) + (3+n) + \dots + 7 + 6)/n=(n+11)/2$ for an expected daily value of $1274(n+11)/(2n+10)$. The value is less when $n$ is larger, so in the worst case of $n=9$, we still expect to earn at least 910 per day, more than the second best lever.

Any strategy that involves using a known lever before testing a new lever is equivalent to using the known lever at the end instead (it can be delayed since it provides no information). (Actually it is obviously non-optimal unless we have found 1274 already since at the end we should use the best known lever then which might be better than the best we have found now.) So, we only need to consider strategies that do all their testing before pulling any known levers. Among such strategies, the best strategy that doesn't test on the next step is obviously just pulling the best known lever on all remaining days which can't do better than 728 per day unless we have found 1274. (If we have found 1274, obviously pulling that is best.)

Thus, we can't do better (in expectation) than testing until we find 1274.

If we care to calculate the exact expected value of this strategy, we take the expected value from the 1274 lever which is $14 \cdot 910$ and add half the value of every other unknown lever (since there is a 50% chance to find each particular lever before 1274; and linearity of expectation) to get 13401.5.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.