TML Scribe: L08

Ouvrir en tant que gabaritVoir le PDF
Auteur:
SIVA DURGA PRASAD M
Last Updated:
il y a 7 ans
License:
Creative Commons CC BY 4.0
Résumé:
Lecture#8 scribbling was done by Ritabrata and Prasad. Please review it
Étiquettes:
Lecture Notes
\begin{now}
Discover why over 25 million people worldwide trust Overleaf with their work.
\documentclass[11pt]{article}
\usepackage{macros}

\parindent0in
\pagestyle{plain}
\thispagestyle{plain}

\scribedby{Scribed By: Ritabrata Podder, M Siva Durga Prasad}
\assignment{Lecture \#8}
\lecturedate{August  23, 2019}

\begin{document}
\maketitle

\outline{In this lecture we will cover Monte Carlo Methods.Monte Carlo methods are learning methods, used for estimating value functions and discovering optimal policies.Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns. To ensure that well-defined returns are available, here we define Monte Carlo methods only for episodic tasks. That is, we assume experience is divided into episodes, and that all episodes eventually terminate no matter what actions are selected}\par

%%% Start writing and defining sections and subsections here
\tableofcontents
\clearpage
\section{Naive Approach:}
{For each s S, run from s for m times, where the ${i^{th}}$ run episode is ${E_i}$}\par 
Let $i^{th}$ episode 0 ${E_i}$ terminates at ${T_i}$.Thus,\par

\begin{equation}
    {E_i = S_0,A_0,R_1,S_1,A_1,R_2......S_T}
\end{equation} 
Let ${G_i}$ be the return of ${E_i}$.Thus 

Estimate the value of policy starting from s, by 
The variables ${G_i}$ are independent since the runs ${E_i}$ are independent

\section{MC Approaches}
\begin{itemize}
    \item First Visit MC-Average returns for first time s visited
    \item Every Visit MC-Averages the returns following all visits to s
\end{itemize}


\subsection{First Visit MC Policy Evaluation:}
\begin{itemize}
    \item Run ${\pi}$ from s for m times, where the ${i^{th}}$ run episode is ${E_i}$
    \item For each run ${E_i}$ and state s in it. Let ${G(s,E_i)}$ be the return of ${\pi}$ in run ${E_i}$ from first appearance of s in ${E_i}$ until the run ends (reaching T state).
    \item Let the value of state s under policy ${\pi}$
    \begin{equation}
        {\hat{V}_\pi(s) = \frac{1}{m} \displaystyle\sum_{i=1}^{m}G(s,{E_i})}
    \end{equation}
    \item The random variable ${G(s,{E_i})}$ for a given state s and different ${E_{i^s}}$ are independent since different runs (episodes) are independent, since different runs are independent.
\end{itemize}

\begin{figure}[h]
    \centering
    \includegraphics[width=0.5\linewidth]{assets/1st_visit_mc.PNG}
    \caption{Return in first visit MC}
    \label{figure:1st_visit_mc}
\end{figure}

\subsection{Every Visit MC Policy Evaluation:}

\begin{itemize}
    \item Run ${\pi}$ from s for m times, where the $i^{th}$ run (episode i is ${E_i}$).
    \item For each run ${E_i}$ and state s in it. Let ${G(s,{E_i})}$ be the return of ${\pi}$ in run ${E_i}$ from the $j^{th}$ appearance of s in ${E_i}$.
    \item Let ${N_i(s)}$ be the number of times state s has been visited in the run ${E_i}$
    \item Let the value of state s under policy ${\pi}$
     \begin{equation}
        {\hat{V}_\pi(s) = \frac{1}{\displaystyle\sum_{i=1}^{m} N_i(s)} \displaystyle\sum_{i=1}^{m} \displaystyle\sum_{i=1}^{N_i(s)} G(s, {E_i}, j)}
    \end{equation}
    The random variable ${G(s,E_i,j}$ for a given state s and ${E_{i^s}}$ are independent. Since different run are independent
\end{itemize}

\begin{figure}[h]
    \centering
    \includegraphics[width=0.5\linewidth]{assets/return_in_every_visit_mc.PNG}
    \caption{Return in every visit MC}
    \label{figure:return_in_every_visit_mc}
\end{figure}
\begin{itemize}
    \item  If S1 is visited at t = 2,5,10,18,....\par
    ${G(s1,E_i,1)}$ = R3 + R4 + ...\par
    ${G(s1,E_i,2)}$ = R6 + R7 + ...\par
\end{itemize}
 

\begin{figure}[h]
    \centering
    \includegraphics[width=0.5\linewidth]{assets/example.PNG}
    \caption{Example}
    \label{figure:example}
\end{figure}

\begin{align*}
Let:
    {S_0 = s_2},\quad 
    {S_1 = s_3},\quad 
    {S_2 = s_1}
\end{align*}

\begin{equation*}
    S_0A_0R_1S_1A_1R_2S_2A_2R_3S_3A_3R_4S_4A_4R_5\dots
\end{equation*}
\begin{equation*}
   G(s_1,E_i) = R_3 + \gamma R_4+\dots
\end{equation*}
\begin{equation*}
   G(s_2,E_i) = R_1 + \gamma R_2+\dots
\end{equation*}
\begin{equation*}
   G(s_3,E_i) = R_2 + \gamma R_3+\dots
\end{equation*}

\section{Dynamic Programming v/s Monte Carlo}
\subsection{Dynamic Programming (DP)}
\begin{itemize}
    \item We need full knowledge of environment.
    \begin{equation}
        {P_r} \{{R_{t+1}=r, S_{t+1}=s' | S_t=s, A_t=a}\}\quad \forall s\quad s'\in S, \forall a \quad \in A(s) \quad\forall r\}
    \end{equation}
    \item All of the expected rewards and transition probabilities must be computed before DP can be applied.
    \item Its complex and error prone.
\end{itemize}

\subsection{Monte Carlo (MC)}
\begin{itemize}
    \item Here generating samples is easy
    \item MC methods can be better even if complete knowledge of environment dynamics not known
    \item State value estimates for each state are independent
    \item The estimate of one state does not depend upon estimate of any other state
    \item MC methods do not bootstrap
    \item Computational expense of estimating the value of a single state is independent of number of states
    \item One can generate the episodes starting from state of interest,averaging the returns from only these states ignoring others
\end{itemize}

\section {MC estimation of Action Values}

MC is most useful when a model is not available, then it is particularly useful to estimate action values (the values of state–action pairs) rather than state values. With a model state values alone are sufficient to determine a policy; one simply looks ahead one step and chooses whichever action leads to the best combination of reward and next state. Without a model, however, state values alone are not sufficient. One must explicitly estimate the value of each action in order for the values to be useful in suggesting a policy. Thus, one of our primary goals for Monte Carlo methods is to estimate ${q_* }$ \par
It estimates ${q_\pi(s,a)}$ the expected return starting from state s, taking action a, then following policy ${\pi}$


\section{MC Control}

The value function is repeatedly altered to more closely approximate the value function for the current policy and the policy is repeatedly improved with respect to the current value function. In this method, we perform alternating complete steps of policy evaluation and policy improvement, beginning with an arbitrary policy ${\pi}$ and ending with the optimal policy and optimal action-value 


\begin{figure}[h]
    \centering
    \includegraphics[width=0.5\linewidth]{assets/gen_policy_iteration.PNG}
    \caption{Generalized policy iteration}
    \label{figure:gen_policy_iteration}
\end{figure}

function:
\begin{equation}
  {\pi_0 \xrightarrow{} \pi_1 \xrightarrow{}q_{\pi_2}\xrightarrow{}\dots \pi_* \xrightarrow{} q_*}
\end{equation}

Policy improvement is done by making the policy greedy with respect to the current value function. In this case we have an action-value function, and therefore no model is needed to construct the greedy policy. For any action- value function q, the corresponding greedy policy is the one that, ${\forall s \in S}$, deterministic-ally chooses an action with maximal action-value:

\begin{equation}
    \pi(s) = \arg \max_a q_{\pi_k}(s, a)
\end{equation}

\subsection {There are 2 possibilities here}
\begin{itemize}
    \item {On Policy}: Estimate and improve the same policy which is being used for the exploration/data generation.
    \item {Off Policy}: One policy is used for exploration called behaviour policy.The policy being learned is called target policy.
\end{itemize}










In the next lecture we will go with {On-Policy} and {Off-Policy} MC controls and important sampling


The recommended textbooks for the course are Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction \cite{sutton2018reinforcement} and Masashi  Sugiyama - Statistical reinforcement learning: modern machine learning approaches \cite{sugiyama2015statistical}. Cite as demonstrated in this document if you take any content verbatim, optionally using the \texttt{main.bib} file for new sources.


% \begin{figure}[h]
%    \centering
%    \includegraphics[width=0.5\linewidth]{assets/gridworld.png}
%    \caption{A sample figure depicting a gridworld}
%    \label{figure:gridworld}
% \end{figure}


%Keep figures (\texttt{.eps},\texttt{.jpg/png/etc},\texttt{.pdf}) and source codes inside the \texttt{assets/} folder. Please look at tthe source of Figure \ref{figure:gridworld} for reference.


\bibliographystyle{unsrt}
\bibliography{main}

\end{document}
TML Scribe: L08

Contactez-nous

Message received