# Reinforcement learning

Reinforcement learning

Inspired by related psychological theory, in computer science, reinforcement learning is a sub-area of machine learning concerned with how an "agent" ought to take "actions" in an "environment" so as to maximize some notion of long-term "reward". Reinforcement learning algorithms attempt to find a "policy" that maps "states" of the world to the actions the agent ought to take in those states. In economics and game theory, reinforcement learning is considered as a boundedly rational interpretation of how equilibrium may arise.

The environment is typically formulated as a finite-state Markov decision process (MDP), and reinforcement learning algorithms for this context are highly related to dynamic programming techniques. State transition probabilities and reward probabilities in the MDP are typically stochastic but stationary over the course of the problem.

Reinforcement learning differs from the supervised learning problem in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The exploration vs. exploitation trade-off in reinforcement learning has been mostly studied through the multi-armed bandit problem.

Formally, the basic reinforcement learning model consists of:

# a set of environment states $S$;
# a set of actions $A$; and
# a set of scalar "rewards" in $Bbb\left\{R\right\}$.

At each time $t$, the agent perceives its state $s_t in S$ and the set of possible actions $A\left(s_t\right)$. It chooses an action $a in A\left(s_t\right)$ and receives from the environment the new state $s_\left\{t+1\right\}$ and a reward $r_\left\{t+1\right\}$. Based on these interactions, the reinforcement learning agent must develop a policy $pi:S ightarrow A$ which maximizes the quantity $R=r_0 + r_1 + cdots + r_n$ for MDPs which have a terminal state, or the quantity $R = sum_t gamma^t r_t$ for MDPs without terminal states (where $0leqgammaleq1$ is some "future reward" discounting factor).

Thus, reinforcement learning is particularly well suited to problems which include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including robot control, elevator scheduling, telecommunications, backgammon and chess (Sutton 1998, Chapter 11).

Algorithms

After we have defined an appropriate return function to be maximized, we need to specify the algorithm that will be used to find the policy with the maximum return.

The naive brute force approach entails the following two steps: a) For each possible policy, sample returns while following it. b) Choose the policy with the largest expected return. One problem with this is that the number of policies can be extremely large, or even infinite. Another is that returns might be stochastic, in which case a large number of samples will be required to accurately estimate the return of each policy. These problems can be ameliorated if we assume some structure and perhaps allow samples generated from one policy to influence the estimates made for another. The two main approaches for achieving this are value function estimation and direct policy optimization.

Value function approaches do this by only maintaining a set of estimates of expected returns for one policy $pi$ (usually either the current or the optimal one).In such approaches one attempts to estimate either the expected return starting from state $s$ and following $pi$ thereafter,

:$V\left(s\right) = E \left[R|s,pi\right]$,

or the expected return when taking action $a$ in state $s$ and following $pi$; thereafter,

:$Q\left(s,a\right) = E \left[R|s,pi,a\right]$

If someone gives us $Q$ for the optimal policy, we can always choose optimal actions by simply choosing the action with the highest value at each state. In order to do this using $V$, we must either have a model of the environment, in the form of probabilities $P\left(s\text{'}|s,a\right)$, which allow us to calculate $Q$ simply through

:$Q\left(s,a\right) = sum_\left\{s\text{'}\right\} V\left(s\text{'}\right)P\left(s\text{'}|s,a\right),$

or we can employ so-called Actor-Critic methods, in which the model is split into two parts: the critic, which maintains the state value estimate $V$, and the actor, which is responsible for choosing the appropriate actions at each state.

Given a fixed policy $pi$, Estimating $E \left[R|cdot\right]$ for $gamma=0$ is trivial, as one only has to average the immediate rewards. The most obvious way to do this for $gamma>0$ is to average the total return after each state. However this type of Monte Carlo sampling requires the MDP to terminate.

Thus carrying out this estimation for $gamma>0$ in the general does not seem obvious. In fact, it is quite simple once one realises that the expectation of $R$ forms a recursive Bellman equation:

$E \left[R|s_t\right] = r_t + gamma E \left[R|s_\left\{t+1\right\}\right]$

By replacing those expectations with our estimates, $V$, and performing gradient descent with a squared error cost function, we obtain the temporal difference learning algorithm TD(0). In the simplest case, the set of states and actions are both discrete and we maintain tabular estimates for each state. Similar state-action pair methods are Adaptive Heuristic Critic(AHC), SARSA and Q-Learning. All methods feature extensions whereby some approximating architecture is used, though in some cases convergence is not guaranteed. The estimates are usually updated with some form of gradient descent, though there have been recent developments with least squares methods for the linear approximation case.

The above methods not only all converge to the correct estimates for a fixed policy, but can also be used to find the optimal policy. This is usually done by following a policy π that is somehow derived from the current value estimates, i.e. by choosing the action with the highest evaluation most of the time, while still occasionally taking random actions in order to explore the space. Proofs for convergence to the optimal policy also exist for the algorithms mentioned above, under certain conditions. However, all those proofs only demonstrate asymptotic convergence and little is known theoretically about the behaviour of RL algorithms in the small-sample case, apart from within very restricted settings.

An alternative method to find the optimal policy is to search directly in policy space. Policy space methods define the policy as a parameterised function $pi\left(s, heta\right)$ with parameters $heta$. Commonly, a gradient method is employed to adjust the parameters. However, the application of gradient methods is not trivial, since no gradient information is assumed. Rather, the gradient itself must be estimated from noisy samples of the return. Since this greatly increases the computational cost, it can be advantageous to use a more powerful gradient method than steepest gradient descent. Policy space gradient methods have received a lot of attention in the last 5 years and have now reached a relatively mature stage, but they remain an active field. There are many other approaches, such as simulated annealing, that can be taken to explore the policy space. Other direct optimization techniques, such as evolutionary computation are used in evolutionary robotics.

Current research

Current research topics include: Alternative representations (such as the Predictive State Representation approach), gradient descent in policy space, small-sample convergence results, algorithms and convergence results for partially observable MDPs, modular and hierarchical reinforcement learning.Recently, reinforcement learning has been used in the domain of Psychology to explain human learning and performance. In particular, it has been used in cognitive models that simulate human performance during problem solving and/or skill acquisition (e.g., Sun, Merril, & Peterson, 2001; Sun, Slusarz, & Terry, 2005; Gray, Sims, Fu, & Schoelles, 2006; Fu & Anderson, 2006). It was also used to propose a model of the human error-processing system (Holroyd & Coles, 2002). Multiagent or Distributed Reinforcement Learning is also a topic of interest in current research in this field.

* Temporal difference learning
* Q learning
* SARSA
* Fictitious play
* Optimal control

References

* cite journal
last = Kaelbling | first = Leslie P. | authorlink = Leslie P. Kaelbling
coauthors = Michael L. Littman; Andrew W. Moore
title = Reinforcement Learning: A Survey
journal = Journal of Artificial Intelligence Research
volume = 4
pages = 237&ndash;285
publisher =
date = 1996
url = http://www.cs.washington.edu/research/jair/abstracts/kaelbling96a.html

* cite book
last = Sutton | first = Richard S. | authorlink = Richard S. Sutton
coauthors = Andrew G. Barto
title = Reinforcement Learning: An Introduction
publisher = MIT Press
date = 1998
isbn = 0-262-19398-1
pages =
url = http://www.cs.ualberta.ca/~sutton/book/ebook/the-book.html
id = refSutton1998

* cite book
last = Bertsekas | first = Dimitri P. | authorlink = Dimitri P. Bertsekas
coauthors = John Tsitsiklis
title = Neuro-Dynamic Programming
publisher = Athena Scientific
date = 1996
location = Nashua, NH
isbn = 1-886529-10-8
url = http://www.athenasc.com/ndpbook.html

* Ron Sun, E. Merrill, and T. Peterson, From implicit skills to explicit knowledge: A bottom-up model of skill learning. Cognitive Science, Vol.25, No.2, pp.203-244. 2001. http://www.cogsci.rpi.edu/~rsun/sun.cs01.pdf

* Ron Sun, P. Slusarz, and C. Terry, The interaction of the explicit and the implicit in skill learning: A dual-process approach . Psychological Review, Vol.112, No.1, pp.159-192. 2005. http://www.cogsci.rpi.edu/~rsun/sun-pr2005-f.pdf

* cite conference
last = Peters | first = Jan
coauthors = Sethu Vijayakumar; Stefan Schaal
title = Reinforcement Learning for Humanoid Robotics
booktitle = IEEE-RAS International Conference on Humanoid Robots
year = 2003
url = http://www-clmc.usc.edu/publications/p/peters-ICHR2003.pdf

* cite journal
last = Gray | first = Wayne D. | authorlink = Wayne D. Gray
coauthors = Chris R. Sims; Wai-Tat Fu; Michael J. Schoelles
title = The Soft Constraints Hypothesis: A Rational Analysis Approach to Resource Allocation for Interactive Behavior
journal = Psychological Review
volume = 113
issue = 3
pages = 461–482
publisher =
date = 2006
url = http://www.rpi.edu/~grayw/pubs/papers/GSFS06_PsycRvw/GSFS06_PsycRvw.htm
doi = 10.1037/0033-295X.113.3.461

* cite journal
last = Fu | first = Wai-Tat | authorlink = Wai-Tat Fu
coauthors = John R. Anderson
title = From Recurrent Choice to Skill Learning: A Reinforcement-Learning Model
journal = Journal of Experimental Psychology: General
volume = 135
issue = 2
pages = 184 –206
publisher =
date = 2006
url = http://www.humanfactors.uiuc.edu/Reports&PapersPDFs/JournalPubs/Fu&Anderson06-JEPG(published)(ReinforcementLearning).pdf
doi = 10.1037/0096-3445.135.2.184

* [http://www-anw.cs.umass.edu/rlr/ Reinforcement Learning Repository]
* [http://rlai.cs.ualberta.ca/ Reinforcement Learning and Artificial Intelligence]
* [http://glue.rl-community.org RL-Glue]
* [http://www.dia.fi.upm.es/~jamartin/download.htm Software Tools for Reinforcement Learning (Matlab and Python)]
* [http://rlai.cs.ualberta.ca/RLR/index.html The UofA Reinforcement Learning Library (texts)]
* [http://www.igi.tugraz.at/ril-toolbox The Reinforcement Learning Toolbox from the (Graz University of Technology) ]
* [http://www.cogsci.rpi.edu/~rsun/hybrid-rl.html Hybrid reinforcement learning]
* [http://sourceforge.net/projects/piqle/ Piqle: a Generic Java Platform for Reinforcement Learning]
* [http://www.user.cifnet.com/~lwebzem/ttt/ttt.html Reinforcement Learning applied to Tic-Tac-Toe Game]
* [http://www.scholarpedia.org/article/Reinforcement_Learning Scholarpedia Reinforcement Learning]

Wikimedia Foundation. 2010.

### Look at other dictionaries:

• Reinforcement Learning — Bestärkendes Lernen bzw. Verstärkendes Lernen (engl. Reinforcement Learning) ist eine Variante des Maschinellen Lernens, bei dem ein Agent (ein Computerprogramm) lediglich durch ein System von Belohnung und Bestrafung lernt, seinen Nutzen zu… …   Deutsch Wikipedia

• Reinforcement — Reinforce redirects here. For the Magical Girl Lyrical Nanoha character, see Reinforce (Nanoha). This article is about the term used in operant conditioning. For the construction materials reinforcement, see Rebar. For reinforcement learning in… …   Wikipedia

• Learning classifier system — A learning classifier system, or LCS, is a machine learning system with close links to reinforcement learning and genetic algorithms. First described by John Holland, his LCS consisted of a population of binary rules on which a genetic algorithm… …   Wikipedia

• learning — /lerr ning/, n. 1. knowledge acquired by systematic study in any field of scholarly application. 2. the act or process of acquiring knowledge or skill. 3. Psychol. the modification of behavior through practice, training, or experience. [bef. 900; …   Universalium

• Learning Automata — A branch of the theory of Adaptive control is devoted to learning automata surveyed by Narendra and Thathachar which were originally described explicitly as finite state automata. ReferencesPhilip Aranzulla and John Mellor.Narendra K., Thathachar …   Wikipedia

• learning theory — ▪ psychology Introduction       any of the proposals put forth to explain changes in behaviour produced by practice, as opposed to other factors, e.g., physiological development.       A common goal in defining any psychological (psychology)… …   Universalium

• Learning theory (education) — In psychology and education, a common definition of learning is a process that brings together cognitive, emotional, and enviromental influences and experiences for acquiring, enhancing, or making changes in one s knowledge, skills, values, and… …   Wikipedia

• Learning — Learn and Learned redirect here. For other uses, see Learn (disambiguation) and Learned (disambiguation). Neuropsychology Topics …   Wikipedia

• reinforcement — /riɪnˈfɔsmənt / (say reein fawsmuhnt) noun 1. the act of reinforcing. 2. the state of being reinforced. 3. something that reinforces or strengthens. 4. (often plural) an additional supply of troops, ships, etc., for a military or naval force. 5.… …   Australian-English dictionary

• Temporal difference learning — is a prediction method. It has been mostly used for solving the reinforcement learning problem. TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. [2] TD resembles a Monte Carlo method because it learns by… …   Wikipedia