Skip to main content
ML Quest
Python Idle

The warehouse robot works — but it always takes the same path, even when better routes might exist. The team realizes the agent needs a smarter action selection strategy. Enter the multi-armed bandit: imagine a row of 5 slot machines, each with a different hidden payout probability. You must figure out which arm is best by pulling them — but every pull on a bad arm is wasted money. Pull randomly and you explore but earn less. Always pull your current best and you might miss the true winner. The epsilon-greedy strategy elegantly balances this fundamental RL dilemma.

~20 minscenario
Loading Python runtime...
Goals: 4 tests
should collect at least 1000 rewards
epsilon should decay from start to a lower end value
average reward should improve over time
Python loading...