Evaluation of “Human-level control through deep reinforcement learning”

I. Introduction

For the first time ever, deep learning can use reinforcement learning (RL) to estimate agent actions from high-dimensional sensory input. Outlining the creation of a novel Deep RL system, Volodymyr Mnih and fellow researchers of “Human-level control through deep reinforcement learning” describe the novel combination of Deep Neural Networks with end-to-end Reinforcement Learning that allows, unlike any previous Deep RL system, for the interaction of these two at scale [1]. With the minimal prior knowledge of only raw pixels and scores, the new convolutional neural network can tackle a range of humanly challenging tasks, tested through 49 Atari 2600 games. The deep Q-network (DQN) reinforcement learning agent uses its deep neural network trainings to transcend all other preceding agents as well as professional human games testers by measure of game performance. The authors examine how the DQN agent enhances environmental control to learn new policies via high-dimensional sensory inputs and agent-environment interactions. The agent outputs a value function projecting future rewards. Moreover, the researchers discuss how the DQN agent combats faults of its predecessors and how the agent builds on current developments in deep neural network trainings to achieve its groundbreaking success. This Deep RL method is a front-runner to the development of a general artificial intelligence through its ability to alter its behavior to a specified environment without human mediation,

I chose to evaluate this paper because apart from computer science and its implications in machine learning, I find the study of psychology and neuroscience—especially in a computational context—particularly fascinating. Mimicking biological neural networks involved in reinforcement learning, the artificial neural networks in the deep convolutional network architecture assess the current and previous environments to maximize future reward. I find this multidisciplinary approach incredibly intriguing as I set out with the intention to better discover how artificial intelligence and reinforcement learning interact, and more specifically, how RL supports a general-purpose framework for AI.

II. Background

While ‘Machine Learning’ and ‘Artificial Intelligence’ are often used interchangeably, AI is the creation of intelligent machines while ML is a particular approach to achieve AI. We currently live in a world of applied AI, whether that’s Facebook’s automated friend-tagging in photos or virtual personal assistants like Siri [2]. But even though most of today goal of the future remains Artificial general intelligence (AGI). AGI refers to the ability of machines to “consciously” think and perform any task. In other words, AGI aims to produce general-purpose systems with intelligence at or beyond that of humans [3]. A successful AGI is one that can compute a task challenging for humans. General artificial intelligence takes the innovations of applied AI one step forward and allows a system to learn and progress through every iteration [4]. AGI has the potential to influence society in a way that no other field has before from automating time-consuming jobs in industries like finance to managing emerging self-driving cars to improving treatments for medical patients [5].

In 2013, Mnih et al. first presented the integration of Deep RL to tackle high-dimensional sensory input in the context of Atari games [6]. Two years later, Mnih et al. escalated preceding work by creating the first DQN artificial agent that takes high-dimensional sensory inputs and interacts with the environment to learn new policies. The algorithm in this paper draws upon deep neural network innovations to create a successful artificial agent, in other words, an end-to-end reinforcement learning agent capable of completing a myriad of humanly challenging tasks. Previous reinforcement learning agents had succeeded in domains with handcrafted features or in fully observed, low dimensional state spaces. One such example is TD-Gammon’s ability to weigh risk against safety or batch reinforcement learning methods in autonomous robots [7][8]. Nevertheless, they served a restricted use beyond low-dimensional states and often required at least some level of human interference. Because of this limitation, preceding research never bridged the divide in developing a true AGI that combines high-dimensional sensory inputs and actions without humanly programmed features.

The DQN agent takes a step forward in the creation of this general-purpose RL. Keeping constant the algorithm, network architecture, and hyper-parameters, the DQN agent’s performance in the 49 Atari 2600 games exceeds that of a professional human games tester and that of every preceding agent. Moreover, unlike the other agents, the DQN does not rely on any human intervention to adapt its behavior. This represents a major advancement in the development of a general artificial intelligence. With the novel architecture of convolutional neural networks, the human factor is replaced completely by end-to-end reinforcement learning.

Certain challenges, apparent in previous RL agents as well as the DQN agent, still plague the improvement of AI. While the DQN performs well in games that require long-term strategies, the converse does not hold. Mnih et al. mention Montezuma’s Revenge as one such game that presents a struggle. Nevertheless, the DQN agent functions as a touchstone for future RL research. In regards to an AGI, recent debates still question the feasibility of its creation. In 2015—the same year as Mnih et al.’s research publication—Canadian computer scientist Richard Sutton predicted a 10% chance that AGI would never reach completion and a 25% chance that it would be 2030 [9]. Obviously, then, even with advancements past the DQN agent, true achievement of an AGI remains a volatile concept.

III. Overview

There are three primary forms of machine learning: supervised, unsupervised, and reinforcement learning. Task-driven supervised ML lends immediate feedback (labels provided for every input) and uses regression or classification for analysis whereas data-driven unsupervised ML caters no feedback (no labels provided) and is focused on cluster analysis. This paper focuses on the latter of the three forms of ML—reinforcement learning, wherein an agent learns to act upon an environment to maximize its rewards using sparse and time delayed labels. Essentially, an action determines the state of the environment which determines the reward. In this case, DQN agent is tested in its ability to improve performance and outperform previous agents and professional human players, based on a value-action function and deep Q learning with experience replay.

Reinforcement learning provides the answer to an ambiguous but significant question: how can we learn to survive in our environment? RL provides scalar feedback termed rewards. An agent senses and reacts to its environment, determining its reward. Q-learning obviates the need for a model by simply learning from experience. Simplifying assumptions (such as the state of the world depends solely on final state and action) were used to derive the Markov Decision Process (MDP). Further, although the real world is stochastic, deterministic dynamics and reward functions were assumed to hold.

A.   Reinforcement Learning and Markov Decision Process

Reinforcement learning, the focus of the specified research paper, mimics the real biological process of learning through interaction as explained by the Markov Decision Process (MDP) [10]. For example, in a game of Chess a player attempts to make a correct move for each state in the game with the reward at the end of the play given as a means of feedback. Based on this sequence of plays, the player continually learns how to attain the most reward. Then, RL provides a framework for how autonomous software agents map states to actions in order to maximize a cumulative reward. The DQN agent in this case takes an action to learn an environment, with an end goal of finding a function (policy) that determines an action in a specified state to maximize return. Essentially, the system will progressively improve performance to maximize the total potential reward based on the resulting state and cost/reward output of the action. The iterative t-SNE algorithm, that is used for the DQN agent and that allows for visualizing high-dimensional data, follows the MDP to a higher degree than its predecessors, which could only use low-dimensional sensory inputs to adjust their performances.

Screen Shot 2017-11-17 at 9.53.12 AM.png

B.   Discounted Future Reward

The Markov Decision Process entails a sequence of state, action, and return as follows [11]:

Screen Shot 2017-11-17 at 9.53.59 AM

Using this sequence as a basis and accounting for a stochastic environment, the total discounted future reward can be represented by equation defined below, where T represents the game’s ending time-step and γ represents the factor by which rewards are discounted.

Screen Shot 2017-11-17 at 9.54.03 AM

This equation, however, values immediate reward over future reward. A successful DQN agent should optimize actions by maximizing the cumulative discounted future award. A value-action function 𝑄(𝑠, 𝑎) represents the optimal reward after the agent conducts action a at state s. With a specified action and state, the Q function presents the “quality” of a function, or by definition the tail probability of normal distribution [12].

C.  Deep Q-Learning & Q-Network

Screen Shot 2017-11-17 at 9.55.55 AM

The equation above represents the Bellman Equation, which is an iterative update that RL algorithms use to approximate the action-value function. Essentially, the maximum future reward for a specified state and action is the immediate reward plus maximum future reward for the next state. Through iterations, the equation performs updates to the Q-function by first initializing an undetermined number of states and actions for Q(s,a) and then observing the initial state s. Until termination, the program iterates over selected action a with outcome reward r and new state s, which are implemented into the Q*(s,a) equation defined previously. s is updated to the new state before the next iteration. Because of the finite states and actions and the lack of ability to generalize to unobserved states, value iteration is impractical in practice. Hence this issue is tackled by a function approximator, 𝑄(𝑠, 𝑎, 𝜽) ≈ 𝑄*(𝑠, 𝑎), that takes in a parameter-inclusive function to estimate the Q-function.

Deep RL lends itself to several stability issues since Q-learning diverges with neural nets. Sequential data may lead to correlation between samples while policy may fluctuate rapidly with minor Q-value alterations, leading to extreme distribution of data. Another issue may rise due to the obscure scale of rewards and Q-values and unstable Q-learning gradients. These causes of instability in Deep RL and the nature of neural networks lead to a nonlinear function approximator known as a Q-network. Figure 2 depicts a Q-network where there may be certain unknown functions that need approximation.

Screen Shot 2017-11-17 at 9.56.35 AM.png

1) Experience Replay:

A Deep Q-Network uses experience replay to sever data correlations and improve behavior by drawing upon all past policies. In essence, collecting data from the DQN agent’s experiences allows for breaking correlations. A ε-greedy policy is used to take a certain action a. The agent’s experience is stored in replay memory D from which random experience samples, (s, a, r, s’) ~ U(D), are used to optimize the mean squared error between the Q-network and Q-learning targets. The loss function is as follows, where r + maxQ(s’, a’; 𝜽i) — Q(s, a; 𝜽i) represents the target.

Screen Shot 2017-11-17 at 9.57.43 AM

2) Target Q-Network:

Fixing the target Q-network with set Q-learning parameters, 𝜽i, helps limit oscillations and breaks Q-network and targets correlations. As before, the mean squared error can now be optimized while target network parameters 𝜽i are updated with Q-network parameters 𝜽i intermittently.

3) Adjusting the Reward Value Range:

Robust gradients are achieved by normalizing the reward structure to a reasonable range between -1 and +1 as to restrain the scope of error derivatives and facilitate the same learning rate across multiple games. The agent cannot distinguish between rewards of different magnitudes so the performance may be affected by the reward clipping. In short, if both the unknown and neural network systems are given the same inputs, then the objective of the function approximator is to adjust network parameter 𝜽i at the ith iteration to output a response that resembles that of the unknown function and ultimately generate the loss function. This Q-learning update returns a loss function Li (𝜽i) that changes with each iteration. The following gradient results from differentiating the loss function, with respect to the weights. A stochastic gradient descent is performed on Screen Shot 2017-11-17 at 10.02.15 AM.Screen Shot 2017-11-17 at 9.57.48 AM

4) Deep Q-­learning Algorithm:

Mnih et al. summarize these steps into the following algorithm which the DQN agent uses to select actions that will maximize cumulative future reward [1].

Screen Shot 2017-11-17 at 10.03.37 AM.png

IV. Impact

With over 1,500 citations, many papers have successfully extended Mnih et al.’s research. This work estimates a Q-value, the expected discounted reward if an agent takes action a from states s and follows the policy distribution, through a neural network that generalizes the Q-values from a few select states to all states. The DQN algorithm, drawing upon experience replay, reduces correlation between samples by using previous methods to optimize a loss function for gradient descent. Two researchers, Seungyul Han and Youngchul Sung, extend Mnih et al.’s work by minimizing large biases and increasing learning speed through multi-batch experience replay. This scheme heightens “off-policy actor-critic-style policy” gradient methods that ultimately enhance performance to a degree higher than possible through the DWN experience replay methods [12]. Mnih et al.’s DQN method is also the basis of the RL model in Salimans and Kingma’s research that aims to normalize the vector weights in a neural network to speed the convergence of stochastic gradient descent optimization [14]. In fact, experimental results show that weight normalization does lead to faster and better results with the nearly unchanged DQN algorithm.

Researchers like Heriberto Cuayahuitl specifically use the DQN method as a baseline in their work to further Deep Reinforcement Learning. Cuayahuitl explores the performance of Deep RL for conversational robots by creating an extension to the DQN that narrows an agent’s actions to only those that are ‘safe’, or in other words, actions that no not lead to strong outcomes like losing a game. The proposed method outperforms the baseline DQN, thereby advancing Mnih et al.’s work to create skill-learning, interactive robots on a much larger scale. Cuayahuitl proposes an interesting question at the conclusion of his research, remarking, “How much and when should a robot win or lose is still an open question to address in the future” [15]. If a robot constantly won, then there would be no demanding interaction. Problems in scalability, due to the number of inputs/outputs required, still prevent the full development of conversational and manlike robots. However, since Mnih et al. broke through the barrier to allow for high-dimensional sensory inputs and other researchers have already further expanded Deep RL performance in terms of algorithmic speed, robot interaction quality, and overall performance, eventually we may reach that end goal of AGI.

I found “Dermatologist-level classification of skin cancer with deep neural networks” a particularly interesting extension of Mnih et al.’s work because of its direct impact on medicine and science. Similar to how the DWN allowed robots to outperform humans in the Atari games, Esteva et al. created a convolutional neural network with performance commensurate with that of dermatologists in three diagnostic assignments. This paper, coupled with Mnih et al.’s, allowed me to understand the very real implications of machine learning and reinforcement learning beyond just robots playing games in everyday life. The fact that machines may have the capability to diagnose skin cancer through image-based classification of skin lesions is incredibly fascinating and a testament to the importance of ML as an interdisciplinary form of data analysis [16]. Similarly, advances in neuroscience through computation, specifically goal-driven hierarchical convolutional neural networks, have deepened the development of sensory cortical processing and modeling that previously obscured parts of the sensory systems [17]. Results from Mnih et al.’s work that demonstrate the potential of deep neural networks approaches in RL combined with the following computational neuroscience results attest to the possibilities of medical discovery and facilitation and improvement of currently human-run processes.

The ideas presented by Mnih et al. in this paper have provided a foundation for computational pursuits as well as research in other fields, like neuroscience, as previously discussed. Using the DQN to learn new policies from high-dimensional sensory inputs, the researchers push forth advances in deep neural networks for learning challenging tasks via end-to-end reinforcement learning. RL ideally enables robots to maximize performance with little human intervention. Although the DQN agent limits this human aspect, because of intricate samples higher complexity tasks continue to restrict Deep RL development. Still, the algorithm in this paper has made room for significant progress in the field. Google’s Deep RL research this year has proven the ability for an agent to master a collection of 3D manipulation skills without human-designed features, such as the difficult ability to open doors.  [18]. This example testifies to the importance of Mnih et al.’s research as the next step in machine learning and its continued function as a basis for further developments.

V. Critical Evaluation

Overall, Mnih et al. addressed several existing issues in deep RL and built a framework for future research as discussed in the previous section. However, the results in this paper still have certain limitations. Outcomes collected from RL require carefully developed reward functions as well as significant time for robot interaction. In addition, the method requires access to a dynamics model which on top of the other restrictions provides a challenge for engineers. Adhering to procedural tasks defined by game rules also leads to diminished performance in complex navigation tasks with scattered rewards.

A DQN’s finite Q-network outputs mean that the agent can only handle finite actions. In other words, the DQN can only be used in tasks with discrete action spaces. It also has trouble dealing with problems that require continuous action spaces. Policy parameterization and optimization with an objective function help combat these sorts of control tasks. Still, approximating continuous functions using discrete quantities does not perform well due to a large amount of exploration required to do so.

In terms of methodology, the paper concludes that the DQN agent performs at a level comparable to that of a human games tester. However, the human player most likely did not play as many games to improve personal performance unlike the DQN, which has a much weaker time restraint than a human does. In other words, the number of games the DQN plays is out of human capacity. To call the paper “Human level control” may therefore be misleading. This aspect is unclarified by the paper. Also, most of this research falls more under the umbrella of engineering rather than science, although many of the concepts are taken from neuroscience and psychology. This is an issue because much of science, especially neuroscience, remains uncharted.

Additionally, it is difficult to compare artificial neural networks to biological ones since there is still much uncertainty in the field of cognitive science. For instance, biological networks factor in the actions of in vivo proteins whose exact functionality remain obscure. Cell-based assays are therefore preferred over biochemical assays because specific molecular targets are not always known and host factors could be critical. Furthermore, due to the complexity of deep neural networks, which contain highly correlated samples and objective functions, non-one-dimensional state spaces do not allow for the application of RL to a large number of state spaces, such as those found in raw images.

The researchers also do not fully explain how they arrived at their results in terms of how many varied models they tested in architecture and learning procedures. How different were the tested models and how big was the sample size in the final approach? What metrics did the researchers use to select one model over another? These are questions not defined in the paper. Rather, the audience is given the final result with the final model/architecture. Perhaps inclusion of these variations would help future researchers avoid the same mistakes or use previous tests to improve the current model further. Likewise, the paper does not fully address the amount of time needed to analyze unknown engineering territory prior to capitalizing on new findings.

Moreover, the paper points out that the DQN is an improvement of the neural fitted Q-iteration with changes in procedure such as random sampling of experiences. In some sense, the DQN serves as an enhancement to the previous stochastic Q-learning with the addition of a deep network as a function approximator. In my opinion, then, the agent doesn’t necessarily advance science but rather engineering. Perhaps the two are not so separated, however, since engineering successes facilitate scientific learning. Furthermore, even if these ideas were not so novel when the paper was published, DeepMind still assembled existing information and developed an agent that improved performance like no other agent had before. In conclusion, the paper does serve to the needs of advancing research in ML but perhaps it should be presented less as a scientific breakthrough and more a computational upgrade.

VI. Conclusions

Reinforcement learning serves as a general-purpose framework for A.I. Using end-to-end RL to mold representations in the convolutional neural network for action-value approximation, an agent like the deep Q-Network agent can tackle an array of tasks challenging for humans and for the first time learn policies from high-dimensional sensory inputs and agent-environment interactions. This paper examines the methods through which a single architecture successfully learns control policies in various environments with limited prior knowledge and at a performance level at or above that of a professional human gamer. The DQN agent takes in only pixels and game score as input, keeping the algorithm, architecture, and hyper-parameters constant in each environment. The algorithm relied on experience replay, and its combination of RL with deep network architectures, to combat instability in the deep value-based RL by breaking correlations in data and learning from past policies.

The experimental results show that the DQN agent exceeded performance of previous RL methods on 43 out of 49 Atari 2600 games. This success was achieved with no additional prior knowledge that existing methods incorporated. The DQN method also performed at a performance level of a human games tester and earned over 75% of the human score on 29 games. In certain games, like the Breakout game, the agent could find a long-term strategy even though games like Montezuma’s Revenge still prove difficult for all agents, including DQN, due to the necessity for a more ephemeral planning strategy.

Mnih et al.’s work in generalizing deep reinforcement learning through the development of the DQN has without a doubt impacted the future of machine learning and deep learning. Looking into the future, the research here can be extended further into computational biology by analyzing outcomes after biasing experience replay content towards notable events. Also, examining solutions to current limitations of the DQN, such as performance in higher complexity tasks, could allow for RL in instances where the number of inputs and outputs do not restrain performance. As such advances in machine learning progress, we can take a step closer to achieving general Artificial Intelligence.


  1. Mnih et al. “Human-Level Control through Deep Reinforcement Learning.” Nature, Macmillan Publishers Limited, 26 Feb. 2015, doi:10.1038/nature14236.
  2. Gershgorn, Dave. “The Quartz Guide to Artificial Intelligence.” Quartz, Quartz Media, 10 Sept. 2017, qz.com/1046350/the-quartz-guide-to-artificial-intelligence-what-is-it-why-is-it-important-and-should-we-be-afraid/.
  3. Jee, Charlotte. “What Is Artificial General Intelligence? And Has Kimera Systems Made a Breakthrough?” Techworld, IDG UK, 26 Aug. 2016, http://www.techworld.com/data/what-is-artificial-general-intelligence-3645268/.
  4. Adams, R.L. “10 Powerful Examples Of Artificial Intelligence In Use Today.” Forbes, Forbes Magazine, 7 Feb. 2017, http://www.forbes.com/sites/robertadams/2017/01/10/10-powerful-examples-of-artificial-intelligence-in-use-today/1.
  5. Mesko, Bertalan. “Artificial Intelligence Will Redesign Healthcare.” The Medical Futurist, Webicina Kft, 17 July 2017, medicalfuturist.com/artificial-intelligence-will-redesign-healthcare/.
  6. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
  7. Tesauro, G. “Temporal difference learning and TD-Gammon.” Commun. ACM 38, 58–68 (1995)
  8. Riedmiller, M., Gabel, T., Hafner, R. & Lange, S. “Reinforcement learning for robot soccer.” Auton. Robots 27, 55–73 (2009)
  9. Khatchadourian, Raffi. “The Doomsday Invention.” The New Yorker, Condé Nast, 23 Nov. 2015, http://www.newyorker.com/magazine/2015/11/23/doomsday-invention-artificial-intelligence-nick-bostrom.
  10. Luk, Michael. “Patient Scheduling Using Markov Decision Process.” SFL Scientific, SFL Scientific, 20 Jan. 2016, sflscientific.com/data-science-blog/2016/5/10/multi-category-patient-scheduling-in-a-diagnostic-facility-using-markov-decision-process.
  11. Fragkiadaki, Katerina. “Markov Decision Processes.” Deep Reinforcement Learning and Control. Lecture 2, 5 Mar. 2017, Pittsburgh, Carnegie Mellon University.
  12. Weisstein, Eric W. “Normal Distribution Function.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/NormalDistributionFunction.html.
  13. Han, Seungyul, and Youngchul Sung. Multi-Batch Experience Replay for Fast Convergence of Continuous Action Control. Cornell University Library, 2017, Multi-Batch Experience Replay for Fast Convergence of Continuous Action Control, arxiv.org/abs/1710.04423.
  14. Salimans, Tim, and Diederik P. Kingma. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. Advances in Neural Information Processing Systems 29, NIPS, 2016, papers.nips.cc/book/advances-in-neural-information-processing-systems-29-2016.
  15. Cuayahuitl, Heriberto, Deep reinforcement learning for conversational robots playing games.In: IEEE RAS International Conference on Humanoid Robots, 15 – 17 November 2017, REP Theatre, Birmingham, UK.
  16. Esteva, Kuprel, Novoa, Ko, Swetter, Blau, Thrun, 2017 A. Esteva, B. Kuprel, R.A. Novoa, J. Ko, S.M. Swetter, H.M. Blau, S. Thrun, Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542 (7639) (2017), pp. 115-118.
  17. Yamins, D. L., & DiCarlo, J. J. (2016). Using goal-driven deep learning models to understand sensory cortex. Nat Neurosci, 19(3), 356-365. doi: 10.1038/nn.4244
  18. Gu, E. Holly, T. Lillicrap, and S. Levine. “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates.” International Conference on Robotics and Automation (ICRA), 2017.


NATURE ARTICLE: Human-level control through deep reinforcement learning



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s