**Understanding Reinforcement Learning**

We are living in the 21st century, the era of automation. Machine Learning has been a rock band in the field of automation. The automated machines that we create using the techniques of Machine Learning carry out iterative tasks to reduce human effort and time.

However, the real-world tasks are way too complex for a machine to execute. It is a highly redundant task to program every course of action for a machine. There emerges the need for a technique that enables the machine to learn and improve itself. This Machine Learning technique is called **reinforcement learning**.

Reinforcement learning in Machine Learning is a technique where a machine learns to determine the right step based on the results of the previous steps in similar circumstances.

**Watch this video on Reinforcement Learning Tutorial:**

**Mechanism of Reinforcement Learning**

- Reinforcement learning works on the principle of feedback and improvement.
- In reinforcement learning, we do not use datasets for training the model.
- Instead, the machine takes certain steps on its own, analyzes the feedback, and then tries to improve its next step to get the best outcome.
.**For the best of career growth, check out Intellipaat’s Machine Learning Course and get certified**

**Reinforcement Learning Process**

Reinforcement learning is the craftsmanship of devising optimal judgments for a machine using experiences. Splitting it further, the method of reinforcement learning includes the following steps:

- Investigating

circumstances - Deciding an action by applying some tactics
- Performing the action
- Obtaining a reward or punishment
- Discovering new areas with the help of past experiences

and improving the approach - Iteratively

sticking to the strategy and performing the action until the machine

learns properly

Let’s now understand the

theory behind reinforcement learning with the help of a use case to make the

picture clearer.

You have a chessboard in front of you. You don’t have any idea of playing chess. The game has started, and you have to make a move. Now, you randomly picked up a *Bishop* (the **RL agent**) and made a straight move as shown in the image below:

But, that’s a wrong move! A *Bishop *can only move *diagonally* either through white or black squares, backward or forward, given the way is empty. So, the learning outcome from this move is that next time you would probably try to make the right move. In a similar way, you would iteratively continue gaining a thorough knowledge of moves from the feedback you receive and try to learn the right moves.

This is nothing but reinforcement learning. With the help of this reinforcement learning example, we have understood the theory behind it. Now, we will look into the algorithm that is used to implement reinforcement learning.

**How do we implement Reinforcement Learning?**

So far, we have discussed

the theoretical aspects of reinforcement learning. But, the question that arises

is, how do we implement reinforcement learning on a model? Is there any method

or a reinforcement learning algorithm to do so?

Yes! There is an algorithm named **Q-learning** that helps the RL (reinforcement learning) agent decide the actions it needs to take in different circumstances.*Learn more about Artificial Intelligence from this Artificial Intelligence Course to get ahead in your career!*

**How does Q-learning work?**

The Q-learning

technique acts as a crib sheet for the reinforcement learning agent. It enables

the RL agent to use the feedback of the environment to learn the best actions

it can take in different circumstances.

Q-learning

makes use of **Q-values **to track and improve the performance of the RL

agent. Initially, the Q-values are set to any arbitrary value. When the RL

agent performs different actions and receives the feedback (a reward or a

punishment) for the actions, the Q-values are updated.

To update the Q-values, we use the following Bellman equation:

The above equation can also be written as follows:

Here,

**S**: The present

condition (**state**) of the RL agent

**A**: The present **action** to be performed

**S′**: The subsequent

state where the agent stops

**A′**: The next

most suitable step to be chosen using the present Q-value

**R**: The immediate **reward** received from the environment

in response to the action performed

*α***:** The **learning rate**. Its value is greater

than 0 and less than or equal to 1. It is used to measure the degree at which the

updates in Q-values happen in each iteration

** γ**: The

**discount factor**. Its value lies between

0 and 1 (0 ≤

*γ*≤ 1). It determines

the significance of future rewards. A high value for

*γ*(nearly 1) carries a long-term productive reward, and a value of

0 for γ denotes that the RL agent reflects only on instant rewards

The above Bellman equation declares that the Q-value generated from staying at state S and implementing an action A is the next reward R(S,A) plus the highest Q-value probable from the next state S’.

Also, Q(S’,A) is further dependent on Q(S”,A), and so on as shown in the below equation:

When we

adjust the γ value, it will decrease or enhance the contribution of the

expected rewards.

Since the

Bellman equation is recursive, we can make random hypotheses for all the

Q-values. By gaining exposure, the model will focalize to the optimal strategy.

Practically, it is implemented as follows:

where, **t** denotes the iterations.

We can also make a **ε-greedy** policy for the chosen action. We do this by evaluating the Q-value.

The action,

for which the value of **Q is large** and probability **1-****ε**,is chosen. After that, the actions with probability **ε **is chosen

at random.

Presently, we have looked at all the theoretical concepts. Now, in this blog on ‘What is Reinforcement Learning?’ we will implement Q-learning in Python.

**Implementing Q-learning for Reinforcement Learning in Python**

For

implementing algorithms of reinforcement learning such as Q-learning, we use the

OpenAI Gym environment available in Python.

Now, let’s look at the **steps to implement Q-learning**:

**Step 1:** Importing Libraries

import gym

import itertools

import matplotlib

import matplotlib.style

import numpy as np

import pandas as pd

import sys

from collections import defaultdict

from windy_gridworld import WindyGridworldEnv

import plotting

matplotlib.style.use('ggplot')

**Step 2:** Creating the Gym Environment

env = WindyGridworldEnv()

**Step 3:** Constituting the Greedy Strategy

`def createEpsilonGreedyPolicy(Q, epsilon, n_action): def policyFunction(state): Action_probabilities = np.ones(n_action, dtype = float) * epsilon / n_action best_step = np.argmax(Q[state]) Action_probabilities[best_step] += (1.0 - epsilon) return Action_probabilities return policyFunction `

**Step 4:** Building the Q-learning Model

`def qLearning(env, num_episodes, discount_factor = 1.0, alpha = 0.6, epsilon = 0.1): Q = defaultdict(lambda: np.zeros(env.action_space.n)) # Tracking the important statistics stats = plotting.EpisodeStats( episode_lengths = np.zeros(num_episodes), episode_rewards = np.zeros(num_episodes)) # Creating function for an epsilon greedy policy policy = createEpsilonGreedyPolicy(Q, epsilon, env.action_space.n) for ith_episode in range(num_episodes): state = env.reset() for t in itertools.count(): action_probabilities = policy(state) action = np.random.choice(np.arange( len(action_probabilities)), p = action_probabilities) next_state, reward, done, _ = env.step(action) stats.episode_rewards[i_episode] += reward stats.episode_lengths[i_episode] = t best_next_step = np.argmax(Q[next_state]) td_target = reward + discount_factor * Q[next_state][best_next_step] td_delta = td_target - Q[state][action] Q[state][action] += alpha * td_delta if done: break state = next_state return Q, stats `

**Step 5:** Training the Model

Q, stats = qLearning(env, 1000)

**Step 6:** Plotting the Visualization Graph

plotting.plot_episode_stats(stats)

From the above graph we can infer that reward is increasing as the time increases. The maximum value of reward per episode shows that the RL agent learns to take right action by maximizing its total reward.

This is all about Reinforcement Learning and its implemented.

*Go through this Machine Learning Interview Questions And Answers to excel in your Machine Learning ** Interview*.

## 2,766 Comments