An Introduction to Q-Learning Part 1 (2024)

Back to Articles

PublishedMay 18, 2022

Update on GitHub

Upvote2

ThomasSimoniniThomas Simonini

Unit 2, part 1 of theDeep Reinforcement Learning Class with Hugging Face 🤗

⚠️ A new updated version of this article is available here 👉 https://huggingface.co/deep-rl-course/unit1/introduction

This article is part of the Deep Reinforcement Learning Class. A free course from beginner to expert. Check the syllabushere.

An Introduction to Q-Learning Part 1 (4)

⚠️ A new updated version of this article is available here 👉 https://huggingface.co/deep-rl-course/unit1/introduction

This article is part of the Deep Reinforcement Learning Class. A free course from beginner to expert. Check the syllabushere.

In thefirst chapter of this class, we learned about Reinforcement Learning (RL), the RL process, and the different methods to solve an RL problem. We also trained our first lander agent toland correctly on the Moon 🌕 and uploaded it to the Hugging Face Hub.

So today, we're going todive deeper into one of the Reinforcement Learning methods: value-based methodsand study our first RL algorithm:Q-Learning.

We'll alsoimplement our first RL agent from scratch: a Q-Learning agent and will train it in two environments:

  1. Frozen-Lake-v1 (non-slippery version): where our agent will need togo from the starting state (S) to the goal state (G)by walking only on frozen tiles (F) and avoiding holes (H).
  2. An autonomous taxi will needto learn to navigatea city totransport its passengers from point A to point B.
An Introduction to Q-Learning Part 1 (5)

This unit is divided into 2 parts:

An Introduction to Q-Learning Part 1 (6)

In the first part, we'lllearn about the value-based methods and the difference between Monte Carlo and Temporal Difference Learning.

And in the second part,we'll study our first RL algorithm: Q-Learning, and implement our first RL Agent.

This unit is fundamentalif you want to be able to work on Deep Q-Learning(unit 3): the first Deep RL algorithm that was able to play Atari games andbeat the human level on some of them(breakout, space invaders…).

So let's get started!

  • What is RL? A short recap
  • The two types of value-based methods
    • The State-Value function
    • The Action-Value function
  • The Bellman Equation: simplify our value estimation
  • Monte Carlo vs Temporal Difference Learning
    • Monte Carlo: learning at the end of the episode
    • Temporal Difference Learning: learning at each step

What is RL? A short recap

In RL, we build an agent that canmake smart decisions. For instance, an agent thatlearns to play a video game.Or a trading agent thatlearns to maximize its benefitsby making smart decisions onwhat stocks to buy and when to sell.

An Introduction to Q-Learning Part 1 (7)

But, to make intelligent decisions, our agent will learn from the environment byinteracting with it through trial and errorand receiving rewards (positive or negative)as unique feedback.

Its goalis to maximize its expected cumulative reward(because of the reward hypothesis).

The agent's decision-making process is called the policy π:given a state, a policy will output an action or a probability distribution over actions. That is, given an observation of the environment, a policy will provide an action (or multiple probabilities for each action) that the agent should take.

An Introduction to Q-Learning Part 1 (8)

Our goal is to find an optimal policy π*, aka., a policy that leads to the best expected cumulative reward.

And to find this optimal policy (hence solving the RL problem), thereare two main types of RL methods:

  • Policy-based methods:Train the policy directlyto learn which action to take given a state.
  • Value-based methods:Train a value functionto learnwhich state is more valuableand use this value functionto take the action that leads to it.
An Introduction to Q-Learning Part 1 (9)

And in this chapter,we'll dive deeper into the Value-based methods.

The two types of value-based methods

In value-based methods,we learn a value functionthatmaps a state to the expected value of being at that state.

An Introduction to Q-Learning Part 1 (10)

The value of a state is theexpected discounted returnthe agent can get if itstarts at that state and then acts according to our policy.

If you forgot what discounting is, you can read this section.

But what does it mean to act according to our policy? After all, we don't have a policy in value-based methods, since we train a value function and not a policy.

Remember that the goal of anRL agent is to have an optimal policy π.

To find it, we learned that there are two different methods:

  • Policy-based methods:Directly train the policyto select what action to take given a state (or a probability distribution over actions at that state). In this case, wedon't have a value function.
An Introduction to Q-Learning Part 1 (11)

The policy takes a state as input and outputs what action to take at that state (deterministic policy).

And consequently,we don't define by hand the behavior of our policy; it's the training that will define it.

  • Value-based methods:Indirectly, by training a value functionthat outputs the value of a state or a state-action pair. Given this value function, our policywill take action.

But, because we didn't train our policy,we need to specify its behavior.For instance, if we want a policy that, given the value function, will take actions that always lead to the biggest reward,we'll create a Greedy Policy.

An Introduction to Q-Learning Part 1 (12)

Consequently, whatever method you use to solve your problem,you will have a policy, but in the case of value-based methods you don't train it, your policyis just a simple function that you specify(for instance greedy policy) and this policyuses the values given by the value-function to select its actions.

So the difference is:

  • In policy-based,the optimal policy is found by training the policy directly.
  • In value-based,finding an optimal value function leads to having an optimal policy.
An Introduction to Q-Learning Part 1 (13)

In fact, most of the time, in value-based methods, you'll usean Epsilon-Greedy Policythat handles the exploration/exploitation trade-off; we'll talk about it when we talk about Q-Learning in the second part of this unit.

So, we have two types of value-based functions:

The State-Value function

We write the state value function under a policy π like this:

An Introduction to Q-Learning Part 1 (14)

For each state, the state-value function outputs the expected return if the agentstarts at that state,and then follow the policy forever after (for all future timesteps if you prefer).

An Introduction to Q-Learning Part 1 (15)

The Action-Value function

In the Action-value function, for each state and action pair, the action-value functionoutputs the expected returnif the agent starts in that state and takes action, and then follows the policy forever after.

The value of taking action an in state s under a policy π is:

An Introduction to Q-Learning Part 1 (16)
An Introduction to Q-Learning Part 1 (17)

We see that the difference is:

  • In state-value function, we calculatethe value of a state StS_tSt
  • In action-value function, we calculatethe value of the state-action pair ( St,AtS_t, A_tSt,At ) hence the value of taking that action at that state.
An Introduction to Q-Learning Part 1 (18)

In either case, whatever value function we choose (state-value or action-value function),the value is the expected return.

However, the problem is that it implies thatto calculate EACH value of a state or a state-action pair, we need to sum all the rewards an agent can get if it starts at that state.

This can be a tedious process, and that'swhere the Bellman equation comes to help us.

The Bellman Equation: simplify our value estimation

The Bellman equationsimplifies our state value or state-action value calculation.

An Introduction to Q-Learning Part 1 (19)

With what we learned from now, we know that if we calculate the V(St)V(S_t)V(St) (value of a state), we need to calculate the return starting at that state and then follow the policy forever after.(Our policy that we defined in the following example is a Greedy Policy, and for simplification, we don't discount the reward).

So to calculate V(St)V(S_t)V(St), we need to make the sum of the expected rewards. Hence:

An Introduction to Q-Learning Part 1 (20)

Then, to calculate the V(St+1)V(S_{t+1})V(St+1), we need to calculate the return starting at that state St+1S_{t+1}St+1.

An Introduction to Q-Learning Part 1 (21)

So you see, that's a pretty tedious process if you need to do it for each state value or state-action value.

Instead of calculating the expected return for each state or each state-action pair,we can use the Bellman equation.

The Bellman equation is a recursive equation that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:

The immediate reward Rt+1R_{t+1}Rt+1 + the discounted value of the state that follows ( gammaV(St+1)gamma * V(S_{t+1}) gammaV(St+1) ) .

An Introduction to Q-Learning Part 1 (22)

If we go back to our example, the value of State 1= expected cumulative return if we start at that state.

An Introduction to Q-Learning Part 1 (23)

To calculate the value of State 1: the sum of rewardsif the agent started in that state 1and then followed thepolicy for all the time steps.

Which is equivalent to V(St)V(S_{t})V(St) = Immediate reward Rt+1R_{t+1}Rt+1 + Discounted value of the next state gammaV(St+1)gamma * V(S_{t+1})gammaV(St+1)

An Introduction to Q-Learning Part 1 (24)

For simplification, here we don't discount, so gamma = 1.

  • The value of V(St+1)V(S_{t+1}) V(St+1) = Immediate reward Rt+2R_{t+2}Rt+2 + Discounted value of the next state ( gammaV(St+2)gamma * V(S_{t+2})gammaV(St+2) ).
  • And so on.

To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return,which is a long process.This is equivalentto the sum of immediate reward + the discounted value of the state that follows.

Monte Carlo vs Temporal Difference Learning

The last thing we need to talk about before diving into Q-Learning is the two ways of learning.

Remember that an RL agentlearns by interacting with its environment.The idea is thatusing the experience taken, given the reward it gets, willupdate its value or policy.

Monte Carlo and Temporal Difference Learning are two differentstrategies on how to train our value function or our policy function.Both of themuse experience to solve the RL problem.

On one hand, Monte Carlo usesan entire episode of experience before learning.On the other hand, Temporal Difference usesonly a step ( St,At,Rt+1,St+1S_t, A_t, R_{t+1}, S_{t+1}St,At,Rt+1,St+1 ) to learn.

We'll explain both of themusing a value-based method example.

Monte Carlo: learning at the end of the episode

Monte Carlo waits until the end of the episode, calculates GtG_tGt (return) and uses it asa target for updating V(St)V(S_t)V(St).

So it requires acomplete entire episode of interaction before updating our value function.

An Introduction to Q-Learning Part 1 (25)

If we take an example:

An Introduction to Q-Learning Part 1 (26)
  • We always start the episodeat the same starting point.

  • The agent takes actions using the policy. For instance, using an Epsilon Greedy Strategy, a policy that alternates between exploration (random actions) and exploitation.

  • We getthe reward and the next state.

  • We terminate the episode if the cat eats the mouse or if the mouse moves > 10 steps.

  • At the end of the episode,we have a list of State, Actions, Rewards, and Next States

  • The agent will sum the total rewards GtG_tGt(to see how well it did).

  • It will thenupdate V(st)V(s_t)V(st) based on the formula

An Introduction to Q-Learning Part 1 (27)
  • Thenstart a new game with this new knowledge

By running more and more episodes,the agent will learn to play better and better.

An Introduction to Q-Learning Part 1 (28)

For instance, if we train a state-value function using Monte Carlo:

  • We just started to train our Value function,so it returns 0 value for each state
  • Our learning rate (lr) is 0.1 and our discount rate is 1 (= no discount)
  • Our mouseexplores the environment and takes random actions
An Introduction to Q-Learning Part 1 (29)
  • The mouse made more than 10 steps, so the episode ends .
An Introduction to Q-Learning Part 1 (30)
  • We have a list of state, action, rewards, next_state,we need to calculate the return GtG{t}Gt
  • Gt=Rt+1+Rt+2+Rt+3...G_t = R_{t+1} + R_{t+2} + R_{t+3} ...Gt=Rt+1+Rt+2+Rt+3...
  • Gt=Rt+1+Rt+2+Rt+3G_t = R_{t+1} + R_{t+2} + R_{t+3}…Gt=Rt+1+Rt+2+Rt+3 (for simplicity we don’t discount the rewards).
  • Gt=1+0+0+0+0+0+1+1+0+0G_t = 1 + 0 + 0 + 0+ 0 + 0 + 1 + 1 + 0 + 0Gt=1+0+0+0+0+0+1+1+0+0
  • Gt=3G_t= 3Gt=3
  • We can now update V(S0)V(S_0)V(S0):
An Introduction to Q-Learning Part 1 (31)
  • New V(S0)=V(S0)+lr[GtV(S0)]V(S_0) = V(S_0) + lr * [G_t — V(S_0)]V(S0)=V(S0)+lr[GtV(S0)]
  • New V(S0)=0+0.1[30]V(S_0) = 0 + 0.1 * [3 – 0]V(S0)=0+0.1[3–0]
  • New V(S0)=0.3V(S_0) = 0.3V(S0)=0.3
An Introduction to Q-Learning Part 1 (32)

Temporal Difference Learning: learning at each step

  • Temporal difference, on the other hand, waits for only one interaction (one step) St+1S_{t+1}St+1
  • to form a TD target and update V(St)V(S_t)V(St) using Rt+1R_{t+1}Rt+1 and gammaV(St+1)gamma * V(S_{t+1})gammaV(St+1).

The idea withTD is to update the V(St)V(S_t)V(St) at each step.

But because we didn't play during an entire episode, we don't have GtG_tGt (expected return). Instead, we estimate GtG_tGt by adding Rt+1R_{t+1}Rt+1 and the discounted value of the next state.

This is called bootstrapping. It's called this because TD bases its update part on an existing estimate V(St+1)V(S_{t+1})V(St+1) and not a complete sample GtG_tGt.

An Introduction to Q-Learning Part 1 (33)

This method is called TD(0) orone-step TD (update the value function after any individual step).

An Introduction to Q-Learning Part 1 (34)

If we take the same example,

An Introduction to Q-Learning Part 1 (35)
  • We just started to train our Value function, so it returns 0 value for each state.
  • Our learning rate (lr) is 0.1, and our discount rate is 1 (no discount).
  • Our mouse explore the environment and take a random action:going to the left
  • It gets a reward Rt+1=1R_{t+1} = 1Rt+1=1 sinceit eats a piece of cheese
An Introduction to Q-Learning Part 1 (36)
An Introduction to Q-Learning Part 1 (37)

We can now update V(S0)V(S_0)V(S0):

New V(S0)=V(S0)+lr[R1+gammaV(S1)V(S0)]V(S_0) = V(S_0) + lr * [R_1 + gamma * V(S_1) - V(S_0)]V(S0)=V(S0)+lr[R1+gammaV(S1)V(S0)]

New V(S0)=0+0.1[1+100]V(S_0) = 0 + 0.1 * [1 + 1 * 0–0]V(S0)=0+0.1[1+10–0]

New V(S0)=0.1V(S_0) = 0.1V(S0)=0.1

So we just updated our value function for State 0.

Now wecontinue to interact with this environment with our updated value function.

An Introduction to Q-Learning Part 1 (38)

If we summarize:

  • With Monte Carlo, we update the value function from a complete episode, and so weuse the actual accurate discounted return of this episode.
  • With TD learning, we update the value function from a step, so we replace GtG_tGt that we don't have withan estimated return called TD target.
An Introduction to Q-Learning Part 1 (39)

So now, before diving on Q-Learning, let's summarise what we just learned:

We have two types of value-based functions:

  • State-Value function: outputs the expected return ifthe agent starts at a given state and acts accordingly to the policy forever after.
  • Action-Value function: outputs the expected return ifthe agent starts in a given state, takes a given action at that stateand then acts accordingly to the policy forever after.
  • In value-based methods,we define the policy by handbecause we don't train it, we train a value function. The idea is that if we have an optimal value function, wewill have an optimal policy.

There are two types of methods to learn a policy for a value function:

  • Withthe Monte Carlo method, we update the value function from a complete episode, and so weuse the actual accurate discounted return of this episode.
  • Withthe TD Learning method,we update the value function from a step, so we replace Gt that we don't have withan estimated return called TD target.
An Introduction to Q-Learning Part 1 (40)

So that’s all for today. Congrats on finishing this first part of the chapter! There was a lot of information.

That’s normal if you still feel confused with all these elements. This was the same for me and for all people who studied RL.

Take time to really grasp the material before continuing.

And since the best way to learn and avoid the illusion of competence is to test yourself. We wrote a quiz to help you find where you need to reinforce your study. Check your knowledge here 👉 https://github.com/huggingface/deep-rl-class/blob/main/unit2/quiz1.md

In the second part , we’ll study our first RL algorithm: Q-Learning, and implement our first RL Agent in two environments:

  1. Frozen-Lake-v1 (non-slippery version): where our agent will need togo from the starting state (S) to the goal state (G)by walking only on frozen tiles (F) and avoiding holes (H).
  2. An autonomous taxi will needto learn to navigatea city totransport its passengers from point A to point B.
An Introduction to Q-Learning Part 1 (41)

And don't forget to share with your friends who want to learn 🤗 !

Finally, we want to improve and update the course iteratively with your feedback. If you have some, please fill this form 👉 https://forms.gle/3HgA7bEHwAmmLfwh9

Keep learning, stay awesome,

An Introduction to Q-Learning Part 1 (2024)

FAQs

How many episodes are required for Q-learning? ›

The proposed Q-learning algorithm needs only 21 episodes for reaching con- vergence, while the Q(λ)-learning algorithm needs 43 episodes and the one-step Q-learning algorithm N f : number of episodes when one of the shortest collision-free paths is first found; Nc: number of episodes when convergence is reached; Nts: ...

How do you solve Q-learning problems? ›

Here's a step-by-step guide on how to construct and initialize a Q-table:
  1. Step 1: Define the Environment. First, identify and define the states and actions within your environment: ...
  2. Step 2: Initialize the Q-table. ...
  3. Step 3: Update the Q-table. ...
  4. Step 4: Use the Q-table for Decision-Making.
May 15, 2024

What is Q-learning PDF? ›

Q-learning is a model-free reinforcement learning algorithm. • Does not require a model of the environment. • For any finite Markov decision process (FMDP), Q-learning finds a policy that.

What is Q-learning easily explained? ›

Q-learning is a machine learning approach that enables a model to iteratively learn and improve over time by taking the correct action. Q-learning is a type of reinforcement learning. With reinforcement learning, a machine learning model is trained to mimic the way animals or children learn.

How many episodes does Q appear in? ›

Of the thirteen Star Trek episodes featuring Q prior to Star Trek: Picard Season 2, eight of them use the letter "Q" in the title, often forming a pun.

How many episodes are in season 3 of Generation Q? ›

How do I start Q-learning? ›

Here's how the Q-learning algorithm would work in this example:
  1. Initialize the Q-table: Q = [ [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], ...
  2. Observe the state: ...
  3. Choose an action: ...
  4. Execute the action: ...
  5. Update the Q-table: ...
  6. Repeat steps 2-5 until the agent reaches the goal state: ...
  7. Repeat steps 1-6 for multiple episodes:
Apr 27, 2023

Is Q-learning biased? ›

When using standard Q-Learning, the agent might overestimate the value of certain actions due to the maximization bias. For instance, if the Q-values for moving right are slightly overestimated due to random noise, the agent might consistently choose to move right, even if it leads to suboptimal outcomes.

Is the Q-learning model-free? ›

Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment (hence "model-free"), and it can handle problems with stochastic transitions and rewards without requiring adaptations.

What are the weakness of Q-learning? ›

Challenges and Limitations

Q-Learning faces challenges like the balance between exploration and exploitation, the curse of dimensionality in large state-action spaces, and overestimation bias.

What is Q in AI? ›

In Q-learning, the 'Q' stands for 'quality,' which refers to the value or benefit of taking a certain action in a specific state. The agent is rewarded for good actions and penalized for bad ones.

What is the difference between R learning and Q-learning? ›

Q-learning (Watkins, 1989) is a method for optimizing (cumulated) discounted reward, making far-future rewards less prioritized than near-term rewards. R-learning (Schwarz, 1993) is a method for optimizing average reward, weighing both far-future and near-term reward the same.

How many episodes are there in Detective School Q? ›

A 45-episode anime television series adaptation, animated by Pierrot and directed by Noriyuki Abe, was broadcast on TBS from April 15, 2003, to March 20, 2004. The episodes were collected in twelve DVD sets, released by Marvelous Entertainment between August 23, 2003, and July 24, 2004.

How many episodes are there going to be in classroom of the elite? ›

After nearly seven years and 38 episodes, the Classroom of the Elite anime has concluded its adaptation of the first-year light novel. With the recent airing of episode 13 of season 3, fans are left wondering about the possibility of Classroom of the Elite Season 4.

How many episodes are in the L Word Generation Q Season One? ›

Season 1 episodes (8)

Ten years after we last left them, we find Bette, Shane and Alice getting on with their lives and careers, and meet Dani Núñez, her girlfriend Sophie Suarez, their best friend Micah Lee, and the charming (Sarah) Finley. Bette weathers fallout as she prepares for an LGBTQIA Center talk.

What are the limitations of Q-learning? ›

Challenges and Limitations of Q-learning

Slow Convergence and High Computational Requirements - Q-learning can take significant time to converge, especially in complex environments. It may require substantial computational resources, making it less feasible for real-time applications.

Top Articles
1113 Evelyn St, Norfolk, VA 23518 Property Report
Homes for Sale - William Chorey - Chorey & Associates Realty, LTD.
Data reveals most expensive dog breeds in U.S. for 2024 
Sound Of Freedom Harkins Casa Grande
I Hop Restaurant Near Me
Gdp E124
Number One Buffet Ravenna
Craigslist Musicians Delaware
Kian And Jc Seatgeek Promo Code
Linkbuilding Specialist Amsterdam
Lpga Scores Espn
Über mich - Über Charly-G - Über Karl-Heinz Gebhardt
Caroline G. Atkinson Intermediate School
Ba Atm Near Me
91 East Freeway Accident Today 2022
Hourly Pay At Dick's Sporting Goods
Ylennoir
10 Best Hamster Toys (2023 Update) - The Pet Savvy
Un-Pc Purchase Crossword Clue
Ltlv Las Vegas
Hannah Palmer Of Leaked
Craigslist Southern West Va
New Age Lifestyle Blog
Nog Bible
Call of Duty: NEXT Event Intel, How to Watch, and Tune In Rewards
Duncan & Duncan Robotics Keycard
World's Most Expensive Tiles | Buy Premium & Luxurious Tiles at Ramirro Ceramica
412-690-0001
Oodweynenews
Munis Self Service Cumberland County
Vip Market Vetsource
The Philadelphia Inquirer from Philadelphia, Pennsylvania
25 Of The Best Crown Tattoos For Men in 2024 | FashionGroomSpot
Buy affordable car tyres
The top pumpkin patches across the U.S.
Happy Garden Fairmont Menu
1Bitch1Puppies
Atlas Gradebook Uiuc
Carly Hart Playboy
The Shoppes At Zion Directory
Stellaris Leader Cap
South Bend Weather Underground
Gen 50 Kjv
Week 2 NFL Power Rankings: 1-32 poll, plus which newcomer had the best performance for every team?
70 Fantastic creatures from mythology
Gdp E239 Bts
Warped Pocket Dimension
Tupperware Containers Ebay
8.7 Increase Of 841
Umcu Cd Rates
Half Sleeve Hood Forearm Tattoos
Bulletbound Codes
Latest Posts
Article information

Author: Dong Thiel

Last Updated:

Views: 5503

Rating: 4.9 / 5 (59 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Dong Thiel

Birthday: 2001-07-14

Address: 2865 Kasha Unions, West Corrinne, AK 05708-1071

Phone: +3512198379449

Job: Design Planner

Hobby: Graffiti, Foreign language learning, Gambling, Metalworking, Rowing, Sculling, Sewing

Introduction: My name is Dong Thiel, I am a brainy, happy, tasty, lively, splendid, talented, cooperative person who loves writing and wants to share my knowledge and understanding with you.