# Reinforcement learning in Scala

You may wonder how robots, autonomous systems or a software game player learn. The answer lies in a field of AI known as reinforcement learning. For example, a robot navigating a maze plans his next move according to its current location and previous moves. Teaching a robot all possible move according to the different location in the maze is not realistic, making any supervised learning technique inadequate. This article describes a very common reinforcement learning methodology, Q-learning and its implementation in Scala.

**Overview**

There are many different reinforcement learning techniques. One of the most commonly used method is searching the value function space using temporal difference method. All known reinforcement learning methods share the same objective of solving the problem of finding the optimum sequential decision tasks. In a sequential decision task, an agent interacts with a dynamic system by selecting actions that affect the transition between states in order to optimize a given reward function. At any given step i, the agent select an action

**a(i)**on the current state

**s(i)**. The dynamic system responds by rewarding the agent for its optimal selection of the next state:

The learning agent infers the policy that maps the set of states

**{s}**to the set of available actions

**{a}**, using a value function

The policy is defined at

**Temporal Difference**

The most common approach of learning a **value function V**is to use the

**Temporal Difference**method (TD). The method uses observations of prediction differences from consecutive states, s(i) & s(i+1). If we note r the reward for selection an action from state s(i) to s(i+1) and n the learning rate, then the value V is updated as

Therefore the goal of the temporal difference method is to learn the value function for the optimal policy. The

**Q 'action-value'**function represents the expected value of action a on a state s and defined as

where r is the reward value for the state.

**On-policies vs. Off-policy**

The Temporal Difference method relies on the estimate of the final reward to be computed for each state. There are two methods of the Temporal Difference algorithm:On-Policy and Off-Policy:**On-Policy**method learns the value of the policy used to make the decision. The value function is derived from the execution of actions using the same policy but based on history**Off-Policy**method learns potentially different policies. Therefore the estimate is computed using actions that have not been executed yet.

The most common formula for temporal difference approach is the**Q-learning**formula. It introduces the concept of discount rate to reduce the impact of the first few states on the optimization of the policy. It does not need a model of its environment. The**exploitation**of action-value approach consists of selecting the next state is by computing the action with the maximum reward. Conversely the**exploration**approach focus on the total anticipated reward. The update equation for the Q-Learning is

One of the most commonly used On-Policy method is**Sarsa**which does not necessarily select the action that offer the most value.The update equation is defined as

Functional languages are particularly suitable for iterative computation. We use Scala for the implementation of the temporal difference algorithm. We allow the user to specify any variant of the learning formula, using local functions or closures.Firstly, we have to define a state class,**States and Actions****QLState**(line 1) that contains a list of actions of type**QLAction**(line 3) that can be executed from this state. The only purpose of this class is to connect a list of action to a source state. The parameterized class argument**property**(line 4) is used to "attach" some extra characteristics to this state.

As described in the introduction, an action of class`1 class QLState[T]( 2 val id: Int, 3 val actions: List[QLAction[T]] = List.empty, 4 property: T) { 5 6 @inline 7 def isGoal: Boolean = !actions.isEmpty 8 }`

**QLAction**has a source state from and a destination state**to**(state which is reached following the action). A state except the goal state, has multiple actions but an action has only one destination or resulting state.`class QLAction[T](from: Int, to: Int)`

The state and action can be loaded, generated and managed by a directed graph or search space of type **QLSpace**. The search space contains the list of all the possible states available to the agent. One or more of these states can be selected as goals. The algorithm does not restrict the agent to a single state. The process ends when one of the goal states is reached (OR logic). The algorithm does not support combined goals (AND logic).

Let's implement the basic components of the search space **QLSpace**. The class list all available **states** (line 2) and one or more final or goal states **goalIds** (line 3). Although you would expect that the search space contains a single final or goal state, it is not uncommon to have online training using more than one goal state.

```
1 class QLSpace[T](
2 states: Array[QLState[T]],
3 goalIds: Array[Int]) {
4
5 // Indexed map of states
6 val statesMap: immutable.Map[Int, QLState[T]] =
7 states.map(st => (st.id, st)).toMap
8 // List set of one or more goals
9 val goalStates = new immutable.HashSet[Int]() ++ goalIds
10
11 // Compute the maximum Q value for a given state and policy
12 def maxQ(st: QLState[T], policy: QLPolicy[T]): Double = {
13 val best = states.filter( _ != st)
14 .maxBy(_st => policy.EQ(st.id, _st.id))
15 policy.EQ(st.id, best.id)
16 }
17
18 // Retrieves the list of states destination of state, st
19 def nextStates(st: QLState[T]): List[QLState[T]] =
20 st.actions.map(ac => statesMap.get(ac.to).get)
21
22 def init(r: Random): QLState[T] =
23 states(r.nextInt(states.size-1))
24 }
```

A hash map **statesMap** maintains a dictionary of all the possible states with the state id as unique key (lines 6, 7). The class **QLSpace** has three important methods:

**init**initializes the search with a random state for each training epoch (lines 22, 23)**nextStates**returns the list of destination states associated to the state st (lines 19, 20)**maxQ**return the maximum Q-value for this state st given the current policy**policy**(lines 12-15). The method filters out itself from the search from the next best action. It then compute the maximum reward or Q(state, action) value according to the given policy**policy**

The next step is to defined a policy.

A policy is defined by three components**Learning Policy**- A
**reward**collected after transitioning from one state to another state (line 2). The reward is provided by the user - A Q(State, Action) value,
**value**associated to a transition state and an action (line 4) - A
**probability**(with default values of 1.0) that defines the obstacles or hindrance to migrate from one state to another (line 3)The**estimate**combine the Q-value (incentive to move to the best next step) and probability (hindrance to move to any particular state) (line 7).`1 class QLData { 2 var reward: Double = 1.0 3 var probability: Double = 1.0 4 var value: Double = 0.0) { 5 6 @inline 7 final def estimate: Double = value*probability 8 }`

The policy of type QLPolicy is a container for the state transition attributes, rewards, Q-values and probabilities.`1 class QLPolicy[T](numStates: Int, input: Array[QLInput]) { 2 3 val qlData = { 4 val data = Array.tabulate(numStates)( 5 _ => Array.fill(numStates)(new QLData) 6 ) 7 8 input.foreach(in => { 9 data(in.from)(in.to).reward = in.reward 10 data(in.from)(in.to).probability = in.prob 11 }) 12 data 13 } 14 15 def setQ(from: Int, to: Int, value: Double): Unit = 16 qlData(from)(to).value = value 17 18 def Q(from: Int, to: Int): Double = qlData(from)(to).value 19 }`

The constructor for**QLPolicy**takes two arguments: - Number of states numStates (line 1)
- Sequence of input of type QLInput to the policyThe constructor creates a numStates x numStates matrix of transition of type
**QLData**(lines 3 - 12), from the input.

The type**QLInput**wraps the input data (index of the input state from, index of the output state**to**,**reward**and**probability**associated to the state transition) into a single convenient class.`ase class QLInput( from: Int, to: Int, reward: Double = 1.0, prob: Double = 1.0)`

The first step is to define a model for the reinforcement learning. A model is created during training and is composed of**Model and Training** - Best policy to transition from any initial state to a goal state
- Coverage ratio as defined as the percentage of training cyles that reach the (or one of the) goal.
`class QLModel[T](val bestPolicy: QLPolicy[T], val coverage: Double)`

The**QLearning**class takes 3 arguments - A set of configuration parameters
**config** - The search/states space
**qlSpace** - The initial policy associated with the states (reward and probabilities)
**qlPolicy**`1 class QLearning[T]( 2 config: QLConfig, 3 qlSpace: QLSpace[T], 4 qlPolicy: QLPolicy[T]) 5 6 //model in Q-learning algorithm 7 val model: Option[QLModel[T]] = train.toOption 8 9 // Generate a model through multi-epoch training 10 def train: Try[Option[QLModel[T]]] {} 11 private def train(r: Random): Boolean {} 12 13 // Predict a state as a destination of this current 14 // state, given a model 15 def predict : PartialFunction[QLState[T], QLState[T]] {} 16 17 // Select next state and action index 18 def nextState(st: (QLState[T], Int)): (QLState[T], Int) {} 19 }`

The model of type**Option[QLModel]**(line 7) is created by the method train (line 10). Its value is None if training failed.

The training method**train**consists of executing**config.numEpisodes**cycle or episode of a sequence of state transition (line 5). The random generator r is used in the initialization of the search space.`1 def train: Option[QLModel[T]] = { 2 val r = new Random(System.currentTimeMillis) 3 4 Try { 5 val completions = Range(0, config.numEpisodes).filter(train(r) ) 6 7 val coverage = completions.toSize.toDouble/config.numEpisodes 8 if(coverage > config.minCoverage) 9 new QLModel[T](qlPolicy, coverage) 10 else 11 QLModel.empty[T] 12 }.toOption 13 }`

The training process exits with the model if the minimum**minCoverage**(number of episodes for which the goal state is reached) is met (line 8).

The method**train(r: scala.util.Random)**uses a tail recursion to transition from the initial random state to one of the goal state. The tail recursion is implemented by the search method (line 4). The method implements the recursive temporal difference formula (lines 14-18).

The**state**for which the action generates the highest reward**R**given a policy**qlPolicy**(line 10) is computed for each new state transition. The Q-value of the current policy is then updated**qlPolicy.setQ**before repeating the process for the next state, through recursion (line 21).`1 def train(r: Random): Boolean = { 2 3 @scala.annotation.tailrec 4 def search(st: (QLState[T], Int)): (QLState[T], Int) = { 5 val states = qlSpace.nextStates(st._1) 6 if( states.isEmpty || st._2 >= config.episodeLength ) 7 (st._1, -1) 8 9 else { 10 val state = states.maxBy(s => qlPolicy.R(st._1.id, s.id)) 11 if( qlSpace.isGoal(state) ) 12 (state, st._2) 13 else { 14 val r = qlPolicy.R(st._1.id, state.id) 15 val q = qlPolicy.Q(st._1.id, state.id) 16 // Q-Learning formula 17 val deltaQ = r + config.gamma*qlSpace.maxQ(state, qlPolicy) -q 18 val nq = q + config.alpha*deltaQ 19 20 qlPolicy.setQ(st._1.id, state.id, nq) 21 search((state, st._2+1)) 22 } 23 } 24 } 25 26 r.setSeed(System.currentTimeMillis*Random.nextInt) 27 28 val finalState = search((qlSpace.init(r), 0)) 29 if( finalState._2 != -1) 30 qlSpace.isGoal(finalState._1) 31 else 32 false 33 }`

**Note:**There is no guarantee that one of the goal state is reached from any initial state chosen randomly. It is expected that some of the training epoch fails. This is the reason why monitoring coverage is critical. Obviously, you may choose a deterministic approach to the initialization of each training epoch by picking up any state beside the goal state(s), as a starting state.

Once trained, the model is used to predict the next state with the highest value (or probability) given an existing state. The prediction is implemented as a partial function.**Prediction**`1 def predict : PartialFunction[QLState[T], QLState[T]] = { 2 case state: QLState[T] if(model != None) => 3 if( state.isGoal) state else nextState(state, 0)._1 4 }`

The method**nextState**does the heavy lifting. It retrieves the list of**states**associated with the current state**st**through its actions set (line 2). The next most rewarding state**qState**is computed using the reward matrix**R**of the best policy of the QLearning model (lines 6 - 8).`1 def nextState(st: (QLState[T], Int)): (QLState[T], Int) = { 2 val states = qlSpace.nextStates(st._1) 3 if( states.isEmpty || st._2 >= config.episodeLength) 4 st 5 else { 6 val qState = states.maxBy( 7 s => model.map(_.bestPolicy.R(st._1.id, s.id)) 8 .getOrElse(-1.0) 9 ) 10 11 nextState( (qState, st._2+1)) 12 } 13 }`

An article or blog spot can not realistically describe all the elements and strategies of reinforcement learning from K-armed bandits to deep learning. However, this chapter should provide you with a road map on how to implement a simple reinforcement learning algorithm in Scala.**Conclusion**

Originally published on www.scalaformachinelearning.com