Processing

Please wait...

Settings

Settings

Goto Application

1. WO2018153807 - ACTION SELECTION FOR REINFORCEMENT LEARNING USING NEURAL NETWORKS

Note: Text based on automatic Optical Character Recognition processes. Please use the PDF version for legal matters

[ EN ]

CLAIMS

1. A system for selecting actions to be performed by an agent that interacts with an environment by performing actions from a predetermined set of actions, the system comprising: a manager neural network subsystem that is configured to, at each of a plurality of time steps:

receive an intermediate representation of a current state of the environment at the time step,

map the intermediate representation to a latent representation of the current state in a latent state space,

process the latent representation using a goal recurrent neural network, wherein the goal recurrent neural network is configured to receive the latent representation and to process the latent representation in accordance with a current hidden state of the goal recurrent neural network to generate an initial goal vector in a goal space for the time step and to update the internal state of the goal recurrent neural network, and

pool the initial goal vector for the time step and initial goal vectors for one or more preceding time steps to generate a final goal vector for the time step;

a worker neural network subsystem that is configured to, at each of the plurality of time steps:

receive the intermediate representation of the current state of the environment at the time step,

map the intermediate representation to a respective action embedding vector in an embedding space for each action in the predetermined set of actions,

project the final goal vector for the time step from the goal space to the embedding space to generate a goal embedding vector, and

modulate the respective action embedding vector for each action by the goal embedding vector to generate a respective action score for each action in the predetermined set of actions; and

an action selection subsystem, wherein the action selection subsystem is configured to, at each of the plurality of time steps:

receive an observation characterizing the current state of the environment at the

time step,

generate the intermediate representation from the observation,

provide the intermediate representation as input to the manager neural network to generate the final goal vector for the time step,

provide the intermediate representation and the final goal vector as input to the worker neural network to generate the action scores, and

select an action from the predetermined set of actions to be performed by the agent in response to the observation using the action scores.

2. The system of claim 1, wherein selecting the action comprises selecting the action having a highest action score.

3. The system of any one of claims 1 or 2, wherein generating the intermediate

representation from the observation comprises processing the observation using a convolutional neural network.

4. The system of any one of claims 1-3, wherein mapping the intermediate representation to a respective action embedding vector in an embedding space for each action in the predetermined set of actions comprises:

processing the intermediate representation using an action score recurrent neural network, wherein the action score recurrent neural network is configured to receive the intermediate representation and to process the intermediate representation in accordance with a current hidden state of the action score recurrent neural network to generate the action embedding vectors and to update the hidden state of the action score neural network.

5. The system of any one of claims 1-4, wherein mapping the intermediate representation to a latent representation of the current state comprises processing the intermediate representation using a feedforward neural network.

6. The system of any one of claims 1-5, wherein the goal space has a higher dimensionality than the embedding space.

7. The system of claim 6, wherein the dimensionality of the goal space is at least ten times higher than the dimensionality of the embedding space.

8. The system of any one of claims 1-7, wherein the worker neural network subsystem has been trained to generate action scores that maximize a time discounted combination of rewards, wherein each reward is a combination of an external reward received as a result of the agent performing the selected action and an intrinsic reward dependent upon the goal vectors generated by the manager neural network subsystem.

9. The system of claim 8, wherein the manager neural network subsystem has been trained to generate initial goal vectors that result in action scores that encourage selection of actions that move the agent in advantageous directions in the latent state space.

10. The system of any one of claims 1-9, wherein the goal recurrent neural network is the dilated long short-term memory (LSTM) neural network of any one of claims 11-18.

11. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement:

a dilated LSTM neural network, wherein the dilated LSTM neural network is configured to maintain an internal state that is partitioned into r sub-states, wherein r is an integer greater than one, and wherein the dilated LSTM neural network is configured to, at each time step in a sequence of time steps:

receive a network input for the time step;

select a sub-state from the r sub-states; and

process current values of the selected sub-state and the network input for the time step using an LSTM neural network to update the current values of the selected sub-state and to generate a network output for the time step in accordance with current values of a set of LSTM network parameters.

12. The system of claim 11 , wherein the dilated LSTM neural network is further configured to, for each of the time steps:

pool the network output for the time step and the network outputs for up to a

predetermined number of preceding time steps to generate a final network output for the time step.

13. The system of any one of claims 11-12, wherein pooling the network outputs comprises summing the network outputs.

14. The system of any one of claims 11-13, wherein pooling the network outputs comprises averaging the network outputs.

15. The system of any one of claims 11-14, wherein pooling the network outputs comprises selecting a highest network output.

16. The system of any one of claims 11-15, wherein the time steps in the sequence of time steps are indexed starting from 1 for the first time step in the sequence to T for the last time step in the sequence, wherein each sub-state is assigned an index ranging from 1 to r, and wherein selecting a sub-state from the r sub-states comprises:

selecting the sub-state having an index that is equal to the index of the time step modulo r.

17. The system of any one of claims 11-16, wherein the LSTM neural network comprises a plurality of LSTM layers.

18. The system of any one of claims 11-17, wherein processing current values of the selected sub-state and the network input for the time step using an LSTM neural network to update the current values of the selected sub-state and to generate a network output for the time step in accordance with current values of a set of LSTM network parameters comprises:

setting an internal state of the LSTM neural network to the current values of the selected sub-state for the processing of the network input at the time step.

19. One or more computer storage media storing instructions that when executed by one or more computers cause the one or more computers to implement the system of any one of claims 1-10.

20. A method comprising the respective operations performed by the action selection subsystem of any one of claims 1-10.

21. One or more computer storage media storing instructions that when executed by one or more computers cause the one or more computers to implement the dilated LSTM neural network of any one of claims 11-18.