Date[edit]

When did this algorithm get invented ? the day of the of the pear 19:46, 7 May 2007 (UTC)

First published 1994, added info. 220.253.135.178 16:50, 21 May 2007 (UTC)[reply]

Hey, thanks a lot for contributing to wikipedia ! XApple 23:05, 27 May 2007 (UTC)[reply]

Updates[edit]

For updates, SARSA uses the next action chosen, not the best next action, to reflect the value of the last state/action under the current policy. If using the best next action, you'll end up with Watkin's Q-Learning which SARSA was an attempt to provide an alternative to. By updating with the value of the best next action (Watkin's Q-Learning) the update can possibly over-estimate values, as the control method used will not pick this action all the time (due to the need to balance exploration and exploitation). A comparison between Q-Learning and SARSA, perhaps Cliff World from Rich Sutton's 'Reinforcement Learning An Introduction' (1998), may be useful to clarify the differences and the resulting behaviour --131.217.6.6 08:17, 29 May 2007 (UTC)[reply]

this is the algorithm presented in Q-Learning:

$Q(s_{t},a_{t})\leftarrow Q(s_{t},a_{t})+\alpha _{t}(s_{t},a_{t})[r_{t}+\gamma \max _{a}Q(s_{t+1},a)-Q(s_{t},a_{t})]$

SARSA:

$Q(s_{t},a_{t})\leftarrow Q(s_{t},a_{t})+\alpha [r_{t+1}+\phi Q(s_{t+1},a_{t+1})-Q(s_{t},a_{t})]$

Uses "backpropagation"? updates previous Q entry with future reward? Dspattison (talk) 19:20, 19 March 2008 (UTC)[reply]

Correct Algorithm ?[edit]

Is the algorithm given correct? Should it not be R(t) not R(t+1) ? I've looked at ^[1] and that seems to support what Thrun & Norvig teach in their Stanford ai-class wheeliebin (talk) 04:58, 12 November 2011 (UTC)[reply]

Note also section 1 where the page states "Taking every letter in the quintuple" it lists "R(t+1)." Shouldn't this be "R(t)" as well? – RDK

@wheeliebin , thanks for noting. I looked it up in Russel and Norvig's Introduction to Artificial Intelligence and there it also says R(s), where s is the older state. Changing it now.–Bomberzocker (talk) 11:55, 1 February 2018 (UTC)[reply]

I reverted the edit. There seems to be something wrong with the formula Norvig uses or something is differently defined. Needs some more clarification.--

Bomberzocker (talk) 19:36, 6 February 2018 (UTC)[reply]

There are different definitions in use. Sutton^[2] uses "R(t+1)" for the immediate reward when choosing action "a(t)" in state "s(t)" while Norvig uses "R(t)". It makes no difference really, but mentioning the different conventions might be a good idea.

^ http://scholar.google.co.uk/scholar_url?hl=en&q=http://citeseerx.ist.psu.edu/viewdoc/download%3Fdoi%3D10.1.1.17.2539%26rep%3Drep1%26type%3Dpdf&sa=X&scisig=AAGBfm2S_I1eUo9AsoieJcfOCrVk-kySEw&oi=scholarr
^ Sutton, Richard S., and Andrew G. Barto. Introduction to reinforcement learning. Vol. 2. No. 4. Cambridge: MIT press, 1998.

[1] ttp://scholar.google.co.uk/scholar_url?hl=en&q=http://citeseerx.ist.psu.edu/viewdoc/download%3Fdoi%3D10.1.1.17.2539%26rep%3Drep1%26type%3Dpdf&sa=X&scisig=AAGBfm2S_I1eUo9AsoieJcfOCrVk-kySEw&oi=scholarr

[2] Sutton, Richard S., and Andrew G. Barto. Introduction to reinforcement learning. Vol. 2. No. 4. Cambridge: MIT press, 1998.

[1]

[2]