Universal Learning for Decision-Making | Moise Blanchard
We provide general-use decision-making algorithms under provably minimal assumptions on the data, using the universal learning framework. Classically, learning guarantees typically require two types of assumptions: (1) restrictions on target policies to be learned and (2) assumptions on the data-generating process. Instead, we show that we can provide consistent algorithms with vanishing regret compared to the best policy in hindsight, (1) irrespective of the optimal policy, known as universal consistency, and (2) well beyond standard i.i.d. or stationary assumptions on the data. We present our results for the classical online regression problem as well as for the contextual bandit problem, where the learner's rewards depend on their selected actions and an observable context. This generalizes the standard multi-armed bandit to the case where side information is available, e.g., patients' records or customers' history, which allows for personalized treatment. Precisely, we give necessary and sufficient conditions on the context-generating process for universal consistency to be possible. Surprisingly, for finite action spaces, universally learnable processes are the same for contextual bandits as for the supervised learning setting, suggesting that going from full feedback (supervised learning) to partial feedback (contextual bandits) came at no extra cost in terms of learnability. We then show that there always exists an algorithm that guarantees universal consistency whenever this is achievable. In particular, such an algorithm is universally consistent under provably minimal assumptions: if it fails to be universally consistent for some context-generating process, then no other algorithm would succeed either. In the case of finite action spaces, this algorithm balances a fine trade-off between generalization (similar to structural risk minimization) and personalization (tailoring actions to specific contexts).