Processing

Please wait...

Settings

Settings

Goto Application

1. WO2021044365 - METHOD AND SYSTEM FOR GENERATING SYNTHETICALLY ACCESSIBLE MOLECULES WITH CHEMICAL REACTION TRAJECTORIES USING REINFORCEMENT LEARNING

Note: Text based on automatic Optical Character Recognition processes. Please use the PDF version for legal matters

[ EN ]

METHOD AND SYSTEM FOR GENERATING SYNTHETICALLY

ACCESSIBLE MOLECULES WITH CHEMICAL REACTION TRAJECTORIES

USING REINFORCEMENT LEARNING

FIELD

[0001] The present technology relates to machine learning algorithms in general and more specifically to methods and systems for generating synthetically accessible molecules with chemical reaction trajectories using reinforcement learning.

BACKGROUND

[0002] Designing new molecules with specific properties is an important problem in drug discovery and materials science. This problem remains challenging due to the sheer size of the chemical search space. For example, it has been estimated that the number of synthetically accessible drug- like compounds reaches 1060. Navigating through a space of this magnitude solely by means of in vitro experimentation with the aim of finding desirable novel molecules optimized for various criteria (e.g. binding to a biological target, appropriate pharmacokinetic properties, etc.) can therefore be prohibitively expensive and time consuming. Consequently, computational drug discovery has emerged as a new area of research to assist in the discovery of molecules with therapeutic potential.

[0003] With recent advances in artificial intelligence, specifically in deep generative modeling and deep reinforcement learning, as well as an improvement in our understanding of chemical and biological systems, progress has been made in the field, shedding light on an exciting new direction for drug discovery. In particular, several approaches applying deep learning models for the automatic de novo generation of valid molecular structures optimized for specific chemical and biological property objectives have been proposed.

[0004] However, some challenges remain for generating molecules.

SUMMARY

[0005] It is an object of the present technology to ameliorate at least some of the inconveniences present in the prior art. One or more embodiments of the present technology may provide and/or broaden the scope of approaches to and/or methods of achieving the aims and objects of the present technology.

[0006] One or more embodiments of the present technology have been developed based on developers’ appreciation that in drug discovery, candidate molecules are required to be optimized for several objectives simultaneously. The task of jointly optimizing generative models in the high-dimensional and discrete chemical space for objectives, which may be complex, non-differentiable and possibly conflicting, is challenging. Various approaches have emerged using reinforcement learning (RL) as a tool to guide molecular optimization.

[0007] Developers of the present technology have appreciated that in techniques using reinforcement learning approaches such as Markov decision processes (MDP) for molecule generation, there is an absence of chemical intuition when the action space is defined on atoms. The episode sequences give no indication of molecular synthesis routes, and the reinforcement learning agent needs to learn which atom additions and removals constitute valid actions, while others actions are penalized, which may add additional complexity to the task.

[0008] More specifically, the present technology arises from an observation by the developers that generation of molecules using Markov decision processes (MDP) could use temporal abstraction with the options framework to leverage hierarchy, action representations could be learned to generalize actions across action space and off-policy data could be leveraged for known compounds to obtain more accurate results.

[0009] One or more embodiments of the present technology enable embedding the knowledge of the dynamics of chemical transitions into a reinforcement learning system for guided exploration. A bias is induced over the optimization task which, given its close correspondence with natural molecular transitions, which may increase learning efficiency and improve chemical space navigation, thus leading to better performance across a larger, pharmacologically relevant chemical subspace. The bias may be induced into the transition model of an MDP by defining possible transitions as true chemical reactions. This bias enables the additional benefit of built-in synthetic accessibility, in addition to immediate access to one possible synthesis route for generated compounds.

[0010] The present technology allows to cater to real-world experimental constraints such as reactant availability, synthesis difficulty, and cost.

[0011] Thus, one or more embodiments of the present technology are directed to methods and systems for generating novel and optimized molecules based on pharmaceutical design criteria that also satisfy constraints of synthetic feasibility by using reinforcement learning.

[0012] In accordance with a broad aspect of the present technology, there is provided a method for generating a synthetically accessible molecule using a Markov decision process (MDP) in reinforcement learning, the method is executed by a processor, the processor is operatively connected to a database, the database includes a plurality of transformations to apply on molecules. The method comprises: receiving an indication of a molecule, generating, based on the indication of the molecule, a current state, selecting, by a control policy, from the database, an action to apply on the current state to obtain a product state, the action includes a transformation of the plurality of transformations and the product state corresponding to an other molecule. The method comprises generating, based on the current state and the action, the product state. The method comprises determining if the product state corresponds to a terminal state, in response to the product state corresponding to the terminal state: updating, based on a reward value, at least one parameter of the control policy to obtain at least one updated parameter, and outputting the product state corresponding to the other molecule as the synthetically accessible molecule.

[0013] In one or more embodiments of the method, the method further comprises: in response to the product state not corresponding to the terminal state: selecting, by the control policy, a further action to apply on the product state to obtain a further product state, the further action corresponding to a further transformation and the further product state corresponding to a further molecule.

[0014] In one or more embodiments of the method, prior to said determining, if the product state corresponds to a terminal state: calculating the reward value of the product state using a reward function; and wherein said determining if the product state corresponds to the terminal state is based on at least one of: the reward value, a number of steps, a set of properties of the product state and a selected action.

[0015] In one or more embodiments of the method, the transformation comprises at least one of a chemical reaction and a computational transformation.

[0016] In one or more embodiments of the method, said selecting, from the database, the action to apply on the current state to obtain the product state comprises: selecting, from the database, at least one reactant to perform the transformation.

[0017] In one or more embodiments of the method, said selecting, from the database, the action includes the transformation to apply on the current state to obtain the product state is based on the indication of the molecule.

[0018] In one or more embodiments of the method, the transformation comprises one of: an addition of a molecular fragment, a deletion of a molecular fragment, and a substitution of a molecular fragment.

[0019] In one or more embodiments of the method, the method further comprises, prior to said calculating the reward value using the reward function: receiving, from the database, based on the product state, a set of properties, said calculating the reward value using the reward function is based on the set of properties.

[0020] In one or more embodiments of the method, the set of properties comprise one or more of: an absorption distribution metabolism and excretion (ADME), an ADME-toxicity, a liberation, a bioavailability, a ligand efficiency, a lipophilic efficiency, a potency at a biological target, and a solubility.

[0021] In one or more embodiments of the method, the calculating the reward value using the reward function based on the set of properties comprises scalarizing a molecular vector includes the set of properties to obtain the reward value.

[0022] In one or more embodiments of the method, the reward function comprises a deterministic reward function.

[0023] In one or more embodiments of the method, the method further comprises, prior to said generating, based on the indication of the molecule, the initial state: generating, using the indication of the molecule, a feature vector thereof, said generating the initial state is based on the feature vector of the molecule.

[0024] In one or more embodiments of the method, the feature vector is generated using a Morgan Fingerprint.

[0025] In one or more embodiments of the method, the method further comprises, prior to said receiving the indication of the molecule: learning the control policy using a value-based approach.

[0026] In one or more embodiments of the method, the method further comprises, prior to said receiving the indication of the molecule: learning the control policy using one of: a hierarchical approach and a non-hierarchical approach.

[0027] In one or more embodiments of the method, the hierarchical approach comprises an option-critic architecture and the non-hierarchical comprises an actor-critic architecture.

[0028] In one or more embodiments of the method, said selecting the action to apply on the current state to obtain the product state comprises: selecting, from the database, an option corresponding to the transformation, and selecting, from the database, a reactant of a plurality of reactant for applying the transformation.

[0029] In one or more embodiments of the method, the method further comprises, prior to said receiving the indication of the molecule: selecting, based on a set of chemical transformations, the molecule.

[0030] In accordance with a broad aspect of the present technology, there is provided a system for generating a synthetically accessible molecule using a Markov decision process (MDP) in reinforcement learning. The system comprises: a processor, a non-transitory storage medium operatively connected to the processor, the non-transitory storage medium includes: a plurality of transformations to apply on molecules, and computer-readable instructions. The processor, upon executing the computer-readable instructions, is configured for: receiving an indication of a molecule, generating, based on the indication of the molecule, a current state, selecting, by a control policy, from the non-transitory storage medium, an action to apply on the current state to obtain a product state, the action includes a transformation of the plurality of transformations and the product state corresponding to an other molecule, generating, based on the current state and the action, the product state. The processor is configured for determining, based on if the product state corresponds to a terminal state, and in response to the product state corresponding to the terminal state: updating, based on the reward value, at least one parameter of the control policy to obtain at least one updated parameter, and outputting the product state corresponding to the other molecule as the synthetically accessible molecule.

[0031] In one or more embodiments of the system, the processor is further configured for: in response to the product state not corresponding to the terminal state: selecting, by the control policy, a further action to apply on the product state to obtain a further product state, the further action corresponding to a further transformation and the further product state corresponding to a further molecule.

[0032] In one or more embodiments of the system, wherein the processor is further configured for prior to said determining, if the product state corresponds to a terminal state: calculating the reward value of the product state using a reward function; and wherein said determining if the product state corresponds to the terminal state is based on at least one of: the reward value, a number of steps, a set of properties of the product state and a selected action.

[0033] In one or more embodiments of the system, the transformation comprises at least one of a chemical reaction and a computational transformation.

[0034] In one or more embodiments of the system, said selecting, from the non-transitory storage medium, the action to apply on the current state to obtain the product state comprises: selecting, from the non-transitory storage medium, at least one reactant to perform the transformation.

[0035] In one or more embodiments of the system, said selecting, from the non-transitory storage medium, the action includes the transformation to apply on the current state to obtain the product state is based on the indication of the molecule.

[0036] In one or more embodiments of the system, the transformation comprises one of: an addition of a molecular fragment, a deletion of a molecular fragment, and a substitution of a molecular fragment.

[0037] In one or more embodiments of the system, the processor is further configured for, prior to said calculating the reward value using the reward function: receiving, from the non-transitory storage medium, based on the product state, a set of properties, said calculating the reward value using the reward function is based on the set of properties.

[0038] In one or more embodiments of the system, the set of properties comprise one or more of: an absorption distribution metabolism and excretion (ADME), an ADME-toxicity, a liberation, a bioavailability, a ligand efficiency, a lipophilic efficiency, a potency at a biological target, and a solubility.

[0039] In one or more embodiments of the system, the calculating the reward value using the reward function based on the set of properties comprises scalarizing a molecular vector includes the set of properties to obtain the reward value.

[0040] In one or more embodiments of the system, the reward function comprises a deterministic reward function.

[0041] In one or more embodiments of the system, the processor is further configured for, prior to said generating, based on the indication of the molecule, the initial state: generating, using the indication of the molecule, a feature vector thereof, said generating the initial state is based on the feature vector of the molecule.

[0042] In one or more embodiments of the system, the feature vector is generated using a Morgan Fingerprint.

[0043] In one or more embodiments of the system, the processor is further configured for, prior to said receiving the indication of the molecule: learning the control policy using one of a policy-based approach and a value-based approach.

[0044] In one or more embodiments of the system, the hierarchical approach comprises an option-critic architecture and the non-hierarchical architecture comprises an actor-critic architecture.

[0045] In one or more embodiments of the system, said selecting the action to apply on the current state to obtain the product state comprises: selecting, from the non-transitory storage medium, an option corresponding to the transformation, and selecting, from the non-transitory storage medium, a reactant of a plurality of reactant for applying the transformation.

[0046] In one or more embodiments of the system, the processor is further configured for, prior to said receiving the indication of the molecule: selecting, based on a set of chemical transformations, the molecule.

Definitions

[0047] Machine Learning Algorithms (MLA)

[0048] A machine learning algorithm is a process or sets of procedures that helps a mathematical model adapt to data given an objective. A ML A normally specifies the way the data is transformed from input to output and how the model learns the appropriate mapping from input to output. The model specifies the mapping function and

holds the parameters while the learning algorithm updates the parameters to help the model satisfy the objective.

[0049] ML As may generally be divided into broad categories such as supervised learning, unsupervised learning and reinforcement learning. Supervised learning involves presenting a machine learning algorithm with training data consisting of inputs and outputs labelled by assessors, where the objective is to train the machine learning algorithm such that it learns a general rule for mapping inputs to outputs. Unsupervised learning involves presenting the machine learning algorithm with unlabeled data, where the objective is for the machine learning algorithm to find a structure or hidden patterns in the data. Reinforcement learning involves having a software agent taking actions in an environment so as to maximize some notion of cumulative reward, without requiring labelled data, and where sub-optimal actions do not need to be explicitly corrected.

[0050] Models used by supervised and unsupervised ML As include neural networks (including deep learning), decision trees, support vector machines (SVMs), Bayesian networks, and genetic algorithms. Models used in reinforcement learning include Markov decision processes.

[0051] Neural Networks (NNs)

[0052] Neural networks (NNs), also known as artificial neural networks (ANNs) are a class of non-linear models mapping from inputs to outputs and comprised of layers that can potentially learn useful representations for predicting the outputs. Neural networks are typically organized in layers, which are made of a number of interconnected nodes that contain activation functions. Patterns may be presented to the network via an input layer connected to hidden layers, and processing may be done via the weighted connections of nodes. The answer is then output by an output layer connected to the hidden layers.

[0053] In the context of the present specification, a “server” is a computer program that is running on appropriate hardware and is capable of receiving requests (e.g., from electronic devices) over a network (e.g., a communication network), and

carrying out those requests, or causing those requests to be carried out. The hardware may be one physical computer or one physical computer system, but neither is required to be the case with respect to the present technology. In the present context, the use of the expression a “server” is not intended to mean that every task (e.g., received instructions or requests) or any particular task will have been received, carried out, or caused to be carried out, by the same server (i.e., the same software and/or hardware); it is intended to mean that any number of software elements or hardware devices may be involved in receiving/sending, carrying out or causing to be carried out any task or request, or the consequences of any task or request; and all of this software and hardware may be one server or multiple servers, both of which are included within the expressions “at least one server” and “a server”.

[0054] In the context of the present specification, “electronic device” is any computing apparatus or computer hardware that is capable of running software appropriate to the relevant task at hand. Thus, some (non-limiting) examples of electronic devices include general purpose personal computers (desktops, laptops, netbooks, etc.), mobile computing devices, smartphones, and tablets, and network equipment such as routers, switches, and gateways. It should be noted that an electronic device in the present context is not precluded from acting as a server to other electronic devices. The use of the expression “an electronic device” does not preclude multiple electronic devices being used in receiving/ sending, carrying out or causing to be carried out any task or request, or the consequences of any task or request, or steps of any method described herein. In the context of the present specification, a “client device” refers to any of a range of end-user client electronic devices, associated with a user, such as personal computers, tablets, smartphones, and the like.

[0055] In the context of the present specification, the expression "computer readable storage medium" (also referred to as "storage medium” and “storage”) is intended to include non-transitory media of any nature and kind whatsoever, including without limitation RAM, ROM, disks (CD-ROMs, DVDs, floppy disks, hard drivers, etc.), USB keys, solid state- drives, tape drives, etc. A plurality of components may be combined to form the computer information storage media, including two or more media components of a same type and/or two or more media components of different types.

[0056] In the context of the present specification, a "database" is any structured collection of data, irrespective of its particular structure, the database management software, or the computer hardware on which the data is stored, implemented or otherwise rendered available for use. A database may reside on the same hardware as the process that stores or makes use of the information stored in the database or it may reside on separate hardware, such as a dedicated server or plurality of servers.

[0057] In the context of the present specification, the expression “information” includes information of any nature or kind whatsoever capable of being stored in a database. Thus information includes, but is not limited to audiovisual works (images, movies, sound records, presentations etc.), data (location data, numerical data, etc.), text (opinions, comments, questions, messages, etc.), documents, spreadsheets, lists of words, etc.

[0058] In the context of the present specification, unless expressly provided otherwise, an “indication” of an information element may be the information element itself or a pointer, reference, link, or other indirect mechanism enabling the recipient of the indication to locate a network, memory, database, or other computer-readable medium location from which the information element may be retrieved. For example, an indication of a document could include the document itself (i.e. its contents), or it could be a unique document descriptor identifying a file with respect to a particular file system, or some other means of directing the recipient of the indication to a network location, memory address, database table, or other location where the file may be accessed. As one skilled in the art would recognize, the degree of precision required in such an indication depends on the extent of any prior understanding about the interpretation to be given to information being exchanged as between the sender and the recipient of the indication. For example, if it is understood prior to a communication between a sender and a recipient that an indication of an information element will take the form of a database key for an entry in a particular table of a predetermined database containing the information element, then the sending of the database key is all that is required to effectively convey the information element to the recipient, even though the information element itself was not transmitted as between the sender and the recipient of the indication.

[0059] In the context of the present specification, the expression “communication network” is intended to include a telecommunications network such as a computer network, the Internet, a telephone network, a Telex network, a TCP/IP data network (e.g., a WAN network, a LAN network, etc.), and the like. The term “communication network” includes a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared and other wireless media, as well as combinations of any of the above.

[0060] In the context of the present specification, the words “first”, “second”, “third”, etc. have been used as adjectives only for the purpose of allowing for distinction between the nouns that they modify from one another, and not for the purpose of describing any particular relationship between those nouns. Thus, for example, it should be understood that, the use of the terms “server” and “third server” is not intended to imply any particular order, type, chronology, hierarchy or ranking (for example) of/between the server, nor is their use (by itself) intended imply that any “second server” must necessarily exist in any given situation. Further, as is discussed herein in other contexts, reference to a “first” element and a “second” element does not preclude the two elements from being the same actual real-world element. Thus, for example, in some instances, a “first” server and a “second” server may be the same software and/or hardware, in other cases they may be different software and/or hardware.

[0061] Implementations of the present technology each have at least one of the above-mentioned object and/or aspects, but do not necessarily have all of them. It should be understood that some aspects of the present technology that have resulted from attempting to attain the above-mentioned object may not satisfy this object and/or may satisfy other objects not specifically recited herein.

[0062] Additional and/or alternative features, aspects and advantages of implementations of the present technology will become apparent from the following description, the accompanying drawings and the appended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0063] For a better understanding of the present technology, as well as other aspects and further features thereof, reference is made to the following description which is to be used in conjunction with the accompanying drawings, where:

[0064] Figure 1 depicts a schematic diagram of an electronic device in accordance with one or more non-limiting embodiments of the present technology.

[0065] Figure 2 depicts a schematic diagram of a system in accordance with one or more non-limiting embodiments of the present technology.

[0066] Figure 3 depicts a schematic diagram of a molecule generation procedure in accordance with one or more non-limiting embodiments of the present technology.

[0067] Figure 4A depicts a schematic diagram of an actor-critic architecture of the molecule generation procedure in accordance with non-limiting embodiments of the present technology.

[0068] Figure 4B depicts a schematic diagram of an option-critic architecture of the molecule generation procedure in accordance with one or more non-limiting embodiments of the present technology.

[0069] Figure 5A depicts a schematic diagram of a chemical transformation applied during the molecular generation procedure, in accordance with non-limiting embodiments of the present technology.

[0070] Figure 5B depicts trajectories taken by a learning agent of the molecule generation procedure during the optimization of molecular affinity towards the Dopamine receptor D2 in accordance with non-limiting embodiments of the present technology.

[0071] Figure 6 depicts a flow chart of a method of generating a molecule using a Markov decision process in accordance with non-limiting embodiments of the present technology.

[0072] Figure 7 depicts synthetic accessibility and drug-likeness score distributions of molecules optimized for an affinity to the D2 dopamine receptor (DRD2) in comparison to the starting molecular blocks in accordance with one or more nonlimiting embodiments of the present technology.

[0073] Figure 8 depicts sample molecules produced under a cLogP objective, a QED objective and an affinity to the DRD2 respectively by RL algorithms in accordance with one or more non-limiting embodiments of the present technology.

[0074] Figure 9 depicts plots of reward progression as the number of optimization objectives increases in accordance with one or more non-limiting embodiments of the present technology.

[0075] Figure 10 depicts plots of trajectory initialization and episode termination steps of the agent of the present technology for each objective starting from the same building block, the plots being depicted in accordance with one or more non-limiting embodiments of the present technology.

[0076] Figure 11 depicts trajectories taken by the agent of the present technology from the same building block for a cLogP and aQED objective, the trajectories being depicted in accordance with one or more non-limiting embodiments of the present technology.

[0077] Figure 12 depicts plots of training time for each RL technique on each optimization task, the present approach converging faster compared to other approaches the plots being depicted in accordance with one or more non-limiting embodiments of the present technology.

[0078] Figure 13 depicts plots comparing performance across experiments and methods for multi-objective optimization when using Chebyshev and Linear Scalarization approaches in accordance with one or more non-limiting embodiments of the present technology.

DETAILED DESCRIPTION

[0079] The examples and conditional language recited herein are principally intended to aid the reader in understanding the principles of the present technology and not to limit its scope to such specifically recited examples and conditions. It will be appreciated that those skilled in the art may devise various arrangements which, although not explicitly described or shown herein, nonetheless embody the principles of the present technology and are included within its spirit and scope.

[0080] Furthermore, as an aid to understanding, the following description may describe relatively simplified implementations of the present technology. As persons skilled in the art would understand, various implementations of the present technology may be of a greater complexity.

[0081] In some cases, what are believed to be helpful examples of modifications to the present technology may also be set forth. This is done merely as an aid to understanding, and, again, not to define the scope or set forth the bounds of the present technology. These modifications are not an exhaustive list, and a person skilled in the art may make other modifications while nonetheless remaining within the scope of the present technology. Further, where no examples of modifications have been set forth, it should not be interpreted that no modifications are possible and/or that what is described is the sole manner of implementing that element of the present technology.

[0082] Moreover, all statements herein reciting principles, aspects, and implementations of the present technology, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof, whether they are currently known or developed in the future. Thus, for example, it will be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the present technology. Similarly, it will be appreciated that any flowcharts, flow diagrams, state transition diagrams, pseudo-

code, and the like represent various processes which may be substantially represented in computer-readable media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.

[0083] The functions of the various elements shown in the figures, including any functional block labeled as a "processor" or a “graphics processing unit”, may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. In some non-limiting embodiments of the present technology, the processor may be a general purpose processor, such as a central processing unit (CPU) or a processor dedicated to a specific purpose, such as a graphics processing unit (GPU). Moreover, explicit use of the term "processor" or "controller" should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read-only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage. Other hardware, conventional and/or custom, may also be included.

[0084] Software modules, or simply modules which are implied to be software, may be represented herein as any combination of flowchart elements or other elements indicating performance of process steps and/or textual description. Such modules may be executed by hardware that is expressly or implicitly shown.

[0085] With these fundamentals in place, we will now consider some non-limiting examples to illustrate various implementations of aspects of the present technology.

[0086] Electronic device

[0087] Referring to Figure 1, there is shown an electronic device 100 suitable for use with some implementations of the present technology, the electronic device 100 comprising various hardware components including one or more single or multi-core

processors collectively represented by processor 110, a graphics processing unit (GPU) 111, a solid-state drive 120, a random access memory 130, a display interface 140, and an input/output interface 150.

[0088] Communication between the various components of the electronic device 100 may be enabled by one or more internal and/or external buses 160 (e.g. a PCI bus, universal serial bus, IEEE 1394 “Firewire” bus, SCSI bus, Serial- ATA bus, etc.), to which the various hardware components are electronically coupled.

[0089] The input/output interface 150 may be coupled to a touchscreen 190 and/or to the one or more internal and/or external buses 160. The touchscreen 190 may be part of the display. In some embodiments, the touchscreen 190 is the display. The touchscreen 190 may equally be referred to as a screen 190. In the embodiments illustrated in Figure 1, the touchscreen 190 comprises touch hardware 194 (e.g., pressure-sensitive cells embedded in a layer of a display allowing detection of a physical interaction between a user and the display) and a touch input/output controller 192 allowing communication with the display interface 140 and/or the one or more internal and/or external buses 160. In some embodiments, the input/output interface 150 may be connected to a keyboard (not shown), a mouse (not shown) or a trackpad (not shown) allowing the user to interact with the electronic device 100 in addition or in replacement of the touchscreen 190.

[0090] According to implementations of the present technology, the solid-state drive 120 stores program instructions suitable for being loaded into the random-access memory 130 and executed by the processor 110 and/or the GPU 111 for generating novel and optimized molecules based on pharmaceutical design criteria that also satisfy constraints of synthetic feasibility by using reinforcement learning techniques. For example, the program instructions may be part of a library or an application.

[0091] The electronic device 100 may be implemented as a server, a desktop computer, a laptop computer, a tablet, a smartphone, a personal digital assistant or any device that may be configured to implement the present technology, as it may be understood by a person skilled in the art.

[0092] System

[0093] Referring to Figure 2, there is shown a schematic diagram of a system 200, the system 200 being suitable for implementing non-limiting embodiments of the present technology. It is to be expressly understood that the system 200 as depicted is merely an illustrative implementation of the present technology. Thus, the description thereof that follows is intended to be only a description of illustrative examples of the present technology. This description is not intended to define the scope or set forth the bounds of the present technology. In some cases, what are believed to be helpful examples of modifications to the system 200 may also be set forth below. This is done merely as an aid to understanding, and, again, not to define the scope or set forth the bounds of the present technology. These modifications are not an exhaustive list, and, as a person skilled in the art would understand, other modifications are likely possible. Further, where this has not been done (i.e., where no examples of modifications have been set forth), it should not be interpreted that no modifications are possible and/or that what is described is the sole manner of implementing that element of the present technology. As a person skilled in the art would understand, this is likely not the case. In addition, it will be appreciated that the system 200 may provide in certain instances simple implementations of the present technology, and that where such is the case they have been presented in this manner as an aid to understanding. As persons skilled in the art would appreciate, various implementations of the present technology may be of a greater complexity.

[0094] The system 200 comprises inter alia a server 220 connected to a database 230 over a communications network 250.

[0095] Server

[0096] The server 220 is configured to inter alia : (i) receive an indication of a molecule; (ii) select a reaction from a plurality of reactions on the molecule (ii) select, based on the reaction, a set of reactants; (iii) apply the reaction using the reactant to obtain a second molecule; (iv) estimate a set of properties of the molecule; and (v) learn a function for generating molecules based on the reaction, the reactants and the set of properties.

[0097] How the server 220 is configured do so will be explained in more detail herein below.

[0098] The server 220 can be implemented as a conventional computer server and may comprise some or all of the features of the electronic device 100 depicted in Figure 1. In an example of an embodiment of the present technology, the server 220 can be implemented as a Dell™ PowerEdge™ Server running the Microsoft™ Windows Server™ operating system. Needless to say, the server 220 can be implemented in any other suitable hardware and/or software and/or firmware or a combination thereof. In the depicted non-limiting embodiment of present technology, the server 220 is a single server. In alternative non-limiting embodiments of the present technology, the functionality of the server 220 may be distributed and may be implemented via multiple servers (not depicted).

[0099] The implementation of the server 220 is well known to the person skilled in the art of the present technology. However, briefly speaking, the server 220 comprises a communication interface (not depicted) structured and configured to communicate with various entities (such as the database 230, for example and other devices potentially coupled to the network) via the network. The server 220 further comprises at least one computer processor (e.g., the processor 110 and/or the GPU 11 of the electronic device 100) operationally connected with a non-transitory storage medium (e.g., the solid-state drive 120 and/or the random-access memory 130) and with the communication interface and structured and configured to execute various processes to be described herein.

[0100] Database

[0101] A database 230 is communicatively coupled to the server 220 via the communications network 250 but, in alternative implementations, the database 230 may be communicatively coupled to the server 220 without departing from the teachings of the present technology. Although the database 230 is illustrated schematically herein as a single entity, it is contemplated that the database 230 may be configured in a distributed manner, for example, the database 230 could have different components, each component being configured for a particular kind of retrieval therefrom or storage therein.

[0102] The database 230 may be a structured collection of data, irrespective of its particular structure or the computer hardware on which data is stored, implemented or otherwise rendered available for use. The database 230 may reside on the same hardware as a process that stores or makes use of the information stored in the database 230 or it may reside on separate hardware, such as on the server 220. The database 230 may receive data from the server 220 for storage thereof and may provide stored data to the server 220 for use thereof.

[0103] The database 230 is configured to store inter alia indications of: (i) plurality of molecules 240; (ii) a plurality of transformations 250; (iii) a plurality of reactants 260; (iv) a set of properties 270; and (v) a set of training objects 280.

[0104] The database 230 is configured to store indication of a plurality of molecules 240. The nature of the molecules 240 is not limited. In one or more embodiments, the molecules 240 are lead compounds, i.e. molecules having pharmacological or biological activity likely to be therapeutically useful but may nevertheless have suboptimal structure that requires modification to fit better to a desired target.

[0105] In one or more embodiments, the database 230 stores feature vectors associated with the plurality of molecules 240.

[0106] The database 230 is configured to store the plurality of transformation 250. A transformation is a change in the chemical state of a molecule, including but not limited to substitutions, addition or elimination of part or the entirety of the chemical composition of the molecule. A transformation may comprise a chemical reaction, i.e. reacting with other substances. In one or more embodiments, a transformation may comprise a computational transformation.

[0107] It will be appreciated that one or more chemical transformation of the plurality of transformations 250 may be applied on any given molecule of the molecules 240 to obtain a new molecule, which may consist of a deletion, an addition, or a modification of fragments of the given molecule. Non-limiting examples of transformations and/or reactions may include: a combination or synthesis, a decomposition, a single displacement or substitution, a double displacement or metathesis, and a combustion.

[0108] The database 230 is configured to store the plurality of reactants 260. A reactant of the plurality of reactants 260 is added to a molecule to cause a reaction of the plurality of reactions 250. A set of reactants, i.e. a subset of the plurality of reactants 260, may be associated with a given reaction of the plurality of transformations 250, but this does not need to be so in every embodiment of the present technology.

[0109] The database 230 is configured to store a set of properties 270. The set of properties 270 may include properties which are indicative of pharmacological activity of molecules. As a non-limiting example, the set of properties 270 may include one or more of: ADMET or ADME-Tox, liberation, bioavailability, ligand efficiency and lipophilic efficiency, potency at the biological target, solubility and the like.

[0110] The database 230 is configured to store a set of training objects 280 which will be used to train a reinforcement learning agent (not depicted in Figure 2). How the reinforcement learning agent is trained on the set of training objects will be explained in more detail herein below.

[0111] Communication Network

[0112] In some embodiments of the present technology, the communications network 250 is the Internet. In alternative non-limiting embodiments, the communication network 250 can be implemented as any suitable local area network (LAN), wide area network (WAN), a private communication network or the like. It should be expressly understood that implementations for the communication network 250 are for illustration purposes only. How a communication link (not separately numbered) between the server

220, the database 230 and/or another electronic device (not depicted) and the communications network 250 is implemented will depend inter alia on how each electronic device is implemented.

[0113] Molecule Generation Procedure

[0114] Reference is now made to Figure 3 to Figure 5 which depict a molecule generation procedure 300 in accordance with non-limiting embodiments of the present technology.

[0115] The molecule generation procedure 300 is executed by the server 220.

[0116] The molecule generation procedure 300 comprises inter alia a reinforcement learning agent 320, an environment 340, and property estimators 360.

[0117] Markov Decision Process (MDP!

[0118] The molecule generation procedure 300 uses reinforcement learning to generate molecules. More specifically, the molecule generation procedure 300 implements a Markov decision process (MDP), which is expressed as a tuple (S, A, Pa, Ra) where S is a state space comprising a finite set of states within the environment 340, A is an action space comprising a finite set of actions that can be performed by the agent 320 within the environment 340, Pa is a transition function Pa: S X A X S ® R defining the dynamics of the environment 340, and Ra is a reward function Ra: A X S ® R defining the reward distribution of the environment 340. In the molecule generation procedure 300 implementing a MDP, the environment 340 comprises the chemical space, a state 345 corresponds to a molecule, and an action 325 comprises a chemical transformation associated with a respective set of reactants.

[0119] The MDP assumes the properties expressed as equation (1):


[0120] where t denotes discrete time steps.

[0121] It will be appreciated that all prior history of a decision trajectory can be encapsulated within the preceding state, allowing an agent operating within an MDP to make decisions based solely on the current state of the environment 340. This assumption provides the basis for efficient learning.

[0122] State Space

[0123] Any valid molecule comprises a state 345 in the present MDP. Practically, the state space is defined as with f a feature extraction function, M the

space of molecules reachable given a set of chemical reactions, initialization molecules, and available reactants. In one or more embodiments, Morgan Fingerprints are used with bit-length 2048 and radius 2 to extract feature vectors from molecules. In one or more embodiments, these representations have been shown to provide robust and efficient featurizations.

[0124] Action Space

[0125] The action space A is defined hierarchically, enabling the potential application of novel approaches for hierarchical reinforcement learning. A set of higher-level actions A0 may be defined as a curated list of chemical reaction templates, taking the form expressed by equation (2):

[0126] where each ri corresponds to a reactant, while each is a product of the

reaction.

[0127] A reaction template is a computational representation (as line notation) of a chemical reaction that indicates which category of substances might react together, the result of such reaction, as well as the reacting parts. As a non-limiting example, in Figure 5 A, the first row correspond to the reaction template and the bottom row corresponds to example of molecules that could react (reactant) and the resulting molecule (product), given that reaction template.

[0128] In one or more embodiments, the SMARTS syntax is used to represent the objects as regular expression. At step t, the state 345 st corresponds to a single reactant in any given reaction. It is necessary to select which molecular blocks should fill in the remaining pieces for a given state and reaction selection. This gives rise to a set of primitive actions Ai corresponding to a list of reactants derived from the reaction templates, which are referred to as chemical building blocks.

[0129] In contrast with methods which establish a deterministic start state such as an empty molecule or carbon atom, the environment 340 is initialized with a randomly-sampled building block which matches at minimum one reaction template, which may be acquired from the database 230. This ensures that a trajectory can take place and encourages the learned policies to be generalizable across different regions in chemical space.

[0130] As a non-limiting example, two-reactant reaction templates may be used and missing reactants may be selected based on those which will most improve the next state’s reward. The chemical product may also selected in this manner when more than one product is generated. It will be appreciated that doing so collapses the hierarchical formulation into a standard MDP formulation, with the reaction selection being the only decision point. Additionally, it is likely that for any step t, the set of possible reactions is smaller than the full action space. In order to increase both the scalability of the framework (by allowing for larger reaction lists) and the speed of training, a mask is used over unfeasible reactions. This avoids the need for the agent to learn the chemistry of reaction feasibility, and reduces the effective dimension of the action space at each step. The policy takes the form with M the environment’s masking function.


[0131] Reward Distribution

[0132] In the framework of the present technology, the separation between the agent 320 and the environment 340 enables to maintain property-focused rewards that guide optimization while ensuring chemical constraints are met via environment design. In one or more embodiments, a deterministic reward function may be used based on the property to be optimized. In order to avoid artificially biasing the agent 320 towards

greedy policies, intermediate rewards are removed and provide evaluative feedback only at termination of an episode. It is contemplated that using an intermediate reward discounted by a decreasing function of the step t may improve learning efficiency. When molecules exceed the maximum number of heavy atoms (38), the agent observes a reward of zero.

[0133] The reward function may be expressed using equation (3):


[0134] where Ra is the expected reward received after transition from s to state s' due to action a.

[0135] Transition Model

[0136] In the template-based framework, state transitions are deterministic, where according to the choice of reaction and the subsequent reactant-

product selection. When modifying the reactant-selection policy, either via a stochastic heuristic such as an epsilon greedy reactant selection, or learned hierarchical policies, state transitions over the higher level actions A0 become stochastic according to the internal policy’s dynamics.

[0137] The transition function may be expressed using equation (4):

[0138] where P is the probability that action a in state s at time t will lead to state s' at time t + 1.

[0139] The current state 315 may be an initial state 315 corresponding to an initial molecule, and may be generated or received based on one or more indication of molecules 240 in the database 230.

[0140] During the MDP, the learning agent 320 interacts with the environment 340 at a discrete time scale t = 0,1,2, .... At each time step t, the agent 320 perceives the state of the environment 340 and, based on the state of the environment st , the

agent 320 chooses a primitive action 325
which is a chemical transformation. In response to executing each action at, a new state 345, s t + 1 is produced and the corresponding reward 365, r t + 1 is computed by the property estimators 360.

[0141] Thus, at each time step, the agent 320 applies the action or chemical transformation to a molecule corresponding to the current state, where the chemical transformation comprises one of a deletion, an addition, and a modification of fragments to the molecule, which may introduce new functional groups to form a new molecule or state. It should be noted that a chemical transformation may include a chemical reaction that may be chosen randomly from a plurality of reactions in the database 230, or a control method, such as q-leaming, may be applied to choose the chemical transformation. In one or more embodiments, the chemical transformation may include a chemical reaction that may be chosen based on the initial molecule. In one or more embodiments, the number of steps may be limited by a threshold number of steps.

[0142] Policy Optimization

[0143] The control policy 324 of the agent 320 is a decision-making function mapping states to actions.

[0144] The property estimators 360 are configured to estimate properties of a state corresponding to a molecule based on a set of properties 270 and output a reward value. The property estimators 360 may execute one or more property estimation functions to estimate the properties of the states corresponding to molecules. In one or more embodiments, the property estimators 270 are configured to optimize for different properties of molecules simultaneously, such as one or more of: ADMET or ADME-Tox, liberation, bioavailability, ligand efficiency and lipophilic efficiency, potency at the biological target, solubility and the like. In one or more embodiments, the property estimators 360 may output a reward value for each of the properties, or may output a reward value calculated based on a scalarization of the molecular property vector. In one or more embodiments, the property estimators 360 may execute machine-learning algorithms such as neural networks to estimate properties of the molecules.

[0145] The critic 328 receives the reward value for a given molecule and computes an estimate of the expected return. This estimate is used to stabilize updates to the control policy 324. The critic 328 is an estimator trained to estimate state-action values using error/corrections.

[0146] Thus, the objective of the agent 320 is to learn a control policy 324 in the form of a Markov policy p, which is a mapping from states to probabilities of taking each primitive action that maximizes the expected discounted future

reward from each state s. In discounted problems, the state-value function of a policy p is defined as the expected return following policy expressed using equation (6):


[0147] The action-value function is the expected return starting from state s taking action a and following policy p may be expressed using equation (7):


[0148] where
is a discount factor. The discount factor weights the future rewards in the value function.

[0149] It will be appreciated that several approaches exist for learning a policy that maximizes this quantity. In value-based approaches, Q-values of the form

are trained to estimate the scalar value of action- value pairs as estimates of the expected return. A policy is then derived from these values through strategies such as e- greedy control. Alternatively, policy-based approaches attempt to parameterize the agent s behavior directly, for example through a neural network, to produce The
framework of the present technology is agnostic with regard to the specific algorithm used for learning. Developers of the present technology have validated the approach using the actor-critic architecture, which combines the benefits of learning a policy directly using a policy network with a variance-reducing value network The

Advantage Actor-Critic (A2C) objective function at time t is given by equation (8):

[0150] The maximization of equation’s (8) first term involves adjusting the policy parameters to align high probability of an action with high expected return, while the second term serves as an entropy regularizer preventing the policy from converging too quickly to sub-optimal deterministic policies or more collapsing.

[0151] In one or more embodiments, the molecule generation procedure 300 is configured to use deep reinforcement learning algorithms, such as, but not limited to: actor-critic architecture, option-critic architecture, and the like.

[0152] Figure 4 A depicts a schematic diagram of an actor-critic architecture 400 of the molecule generation procedure 300 in accordance with non- limiting embodiments of the present technology.

[0153] Each episode is initialized with a molecular building block to obtain an initial state 315. At each step, the current state is converted to its fingerprint representation in the form of a feature vector, and the policy model selects a transformation 420 comprising a reaction to be performed. A reactant selection heuristic completes the reaction to generate the product state 345 or next state 345 in the episode, while a reward of 0 is returned. Instead, if the terminal action is selected, the current state is considered as the terminal state 345 comprising the final molecule and its reward is used to update the policy’s parameters.

[0154] Figure 4B depicts a schematic diagram of an option-critic architecture 450 during the molecule generation procedure 300 in accordance with one or more nonlimiting embodiments of the present technology.

[0155] The options-critics architecture 450 is a hierarchical reinforcement learning framework where options are higher level policies over lower level policies, which generalize primitive actions to include temporally extended courses of action, i.e. closed-loop policies for taking actions over a period of time. The option-critic architecture 450 enables the agent 320 to discover options autonomously while interacting with the environment 340, where options are not predetermined. A policy gradient method is used to find an optimal policy by performing stochastic gradient descent to optimize a performance objective over a given family of parametrized stochastic policies.

[0156] In one or more embodiments of the molecule generation procedure 300, the options-critic architecture 450 is used to select transformations 420 comprising chemical reactions, and associated reactants 430.

[0157] A Markovian option
is an initiation set which contains all the states that an option can start from, is an intra
option policy which is a policy specific to an option, and bw : S ® [0, 1] is a termination function which tells if an option terminates or not at a given state.

[0158] It is assumed that


[0159] An option (Iw, pw, bw) is available in state st if and only if the state If the option is taken, then actions are selected according to its internal policy p

until the option terminates stochastically according to b. An option w is selected according to a policy over options where W is the set of possible options.


[0160] In one or more embodiments, the option w corresponds to a transformation 420 comprising a reaction, and its intra-option policy pw selects a respective set of reactants 430.

[0161] During the execution of an option, an action at is selected according to probability distribution
The environment 340 then makes a transition to state st+1, where the option either terminates, with probability or else continues,

determining at+1 according to possibly terminating in st+2 according to

and so on to the next state 345. When the option terminates and reaches the

terminal state 465 for the initial molecule 315, the agent 320 has the opportunity to select another option.

[0162] The initiation set and termination condition of an option restrict its range of application, which limits the range over which the option policy is defined.

[0163] In other words, an option w is initiated at some time t, which determines the actions selected for k steps, and the option terminates at st+k. At each intermediate time
the decisions of a Markov option depends on where decisions of

a semi-Markov option may depend on the entire preceding sequence or history from 1 to For each state s corresponding to a molecule, there is a

set of available options, corresponding to reactions 420 and reactants 430 implicitly defined by the initiation set. Each action corresponds to an option w that is available whenever a is feasible. The choice of the agent 320 at each time step is entirely among options, which can be temporally extended.

[0164] The termination condition of an option is based on the number of reactants that is required to run the corresponding reaction. In contrast, the termination condition of complete optimization process may be based as a non-limiting example on a threshold number of steps, and threshold values in the set of properties for which the molecules are optimized.

[0165] The state-option value function is expressed using equation (9):


[0166] where
is the state-option-action value function, i.e. the value of executing an action or chemical transformation in the context of a state-option pair, is expressed using equation (10):


[0167] The utility term, or option-value function upon arrival, determines the value of executing option w upon entering state s' is expressed using equation (11):


[0168] The state value function (max of Q) when a greedy policy is used in option selection, represents the total discounted return expected when starting in state s' and performing optimal options, is expressed using equation (12):


[0169] The advantage function, which represents the degree to which the expected total discounted reinforcement is increased by performing option w relative to the option considered the best is expressed using equation (13):


[0170] If option
has been initiated or is executing at time t in state st , the probability of transitioning to in one step is expressed using equation (14):


)

[0171] Each intra-option policy is learned by policy gradients, which increases the probability of choosing good actions that lead to reward, which is done for every option independently. The gradient for the intra-option policy is expressed using equation

(15):


[0172] Given a set of Markov options with stochastic intra-option policies differentiable in their parameters q, the gradient of the expected discounted return with respect to q and initial condition (s0, w0) is expressed using equation (16):

[0173] where
is a discounted weighting of state-option pairs along trajectories starting from

[0174] The gradient describes the effect of a local change at the primitive level on the global expected discounted return.

[0175] In the option-critic framework, each of the options which contain pw (a|s) and are treated as actors, and the critic 328, which contains Qu(s, w, a) and

send gradient information to the actors.


[0176] Thus, during the molecule generation procedure 300 using the option- critics architecture 450, the agent 320 picks option w according to its control policy 324 over options pW and follows the intra-option policy until termination dictated by

bw, at which point the procedure is repeated. The intra option-policy pw (a|s) of option w is parameterized by q and the termination function parametrized by w and

[0177] It is contemplated that the option and primitive action can be collapsed together into a single decision making process (action selection), resulting in a non- hierarchical model.

Training Procedure

[0178] During the training procedure, the agent 320 is trained to learn an optimal control policy 324 based on a set of training objects acquired from the database 230. During training, the number of episodes required to learn the policy, as well as the parameters for the control policy network and in one or more embodiments the critic network may be specified by an operator. It will be appreciated that one or more neural networks may be used to learn the policy.

[0179] Each training object comprises an initial molecule corresponding to an initial state, a final molecule corresponding to a final state, and a set of intermediate molecules corresponding to intermediate states, with reactants used at each of the states. Each final molecule is associated with a set of properties, which are indicative of pharmacological activity of a molecule. Thus, each training object details the synthesis path of an initial molecule to a final molecule which has known pharmacological activity.

[0180] As a non-limiting example, the agent 320 may be trained on a set of training objects comprising a thousand training objects. It will be appreciated that the number of training objects is not limited and may comprise more or less than one thousand training objects.

Multi-Objective Optimization

[0181] The control-policy 324 of agent 320 may be optimized over several properties indicative of pharmacological activity of a molecule. The properties may include: absorption, distribution, metabolism, excretion, and toxicity (ADMET or ADME-Tox), liberation, toxicity (median lethal dose LD50, therapeutic index), bioavailability, ligand efficiency and lipophilic efficiency, potency at the biological target, solubility, Lipinski rule of five, and Ghose filter.

[0182] Figure 6 depicts a flowchart of a method 600 of generating a molecule using an option-critic framework in a Markov decision process in accordance with nonlimiting embodiments of the present technology.

[0183] In one or more embodiments, the method 600 is executed within the system 200 of Figure 2 by the server 220.

[0184] In one or more embodiments, the server 220 comprises a processing device such as the processor 110 and/or the GPU 111 and a non-transitory computer readable storage medium such as the solid-state drive 120 and/or the random-access memory 130 storing computer-readable instructions. The processor 110, upon executing the computer-readable instructions, is configured to or operable to execute the method 600.

[0185] The server 220 may access the agent 320, the environment 340, and the property estimators 360 to execute the method 600.

[0186] The agent 320 has been trained during a training procedure such that the control-policy 324 of the agent 320 is optimized over several properties indicative of pharmacological activity of a molecule.

Method Description

[0187] The method 600 begins at step 602. In another alternative embodiment, the method 600 may begin at step 604.

[0188] STEP 602: receiving an indication of a molecule

[0189] At step 602, the server 220 receives an indication of a molecule from the database 230 on which one or more chemical transformations will be applied to generate a final molecule having pharmacological properties. In one or more embodiments, the server 220 receives the indication from an electronic device (not illustrated) communicatively coupled to the server 220 via the communications network 250.

[0190] In one or more embodiments, the indication of the molecule is sampled randomly from a set of molecules matching at least one reaction template in the database 230.

[0191] The method 600 advances to step 604.

[0192] STEP 604: generating an initial state

[0193] At step 604, the server 220 generates, based on the indication of the molecule, the initial state 315. The initial state 315 corresponds to an initial molecule.

[0194] In one or more embodiments, the server 220 generates the initial state by generating and/or receiving a feature vector for the indication of the molecule. In one or more embodiments, the server 220 uses molecular fingerprints such a Morgan Fingerprints to obtain the feature vector from which the initial state 315 is generated.

[0195] The method 600 advances to step 606.

[0196] STEP 606: selecting an action to apply on the initial state, the action comprising a chemical transformation

[0197] At step 606, the server 220 accesses the database 230 to select an action, the action comprising a given one of a plurality of chemical transformations 250. In one or more embodiments, the one of the plurality of chemical transformations 250 comprises one or more chemical reactions. In one or more embodiments, the given one of the plurality of chemical transformations 250 is selected based on the initial state 315.

[0198] In one or more other embodiments, the server 220 accesses the intra-option policy of the option to select at least one reactant in the plurality of reactants 260 stored in the database 230.

[0199] In one or more embodiments, where an option architecture is adopted, based on at least one of the initial state 315 and the action, the server 220 selects an option corresponding to a transformation of the plurality of transformations 250. The server 220 accesses the intra-option policy of the option to select at least one reactant in the plurality of reactants 260 stored in the database 230.

[0200] In one or more embodiments, the server 220 selects the intra-option policy of the option to perform additional primitive actions required for the execution of the option. In one or more embodiments, the primitive action corresponds to the selection of at least one reactant in the plurality of reactants 260 stored in the database 230.

[0201] In one or more embodiments, the server 220 executes step 606 a plurality of times to select options corresponding to reactions and reactants and obtain a new state or molecule until a termination condition is reached.

[0202] The method 600 advances to step 608.

[0203] STEP 608: generating, based on the initial state and the action, the product state

[0204] At step 608, the server 220 generates, based on the initial state 315 and the action comprising the transformation, the product state 345, the product state 345 corresponding to a new molecule. The server 220 generates the product state 345 by applying the selected action 325 comprising the transformation on the initial state 315 to obtain the product state 345.

[0205] In one or more embodiments, the server 220 determines a reward value of the product state 345 using a reward function. The reward function may be a deterministic reward function.

[0206] In one or more embodiments, the server 220 determines the reward value based on the set of properties 270. The set of properties 270 may comprise one or more of: an absorption distribution metabolism and excretion (ADME), an ADME-toxicity, a liberation, a bioavailability, a ligand efficiency, a lipophilic efficiency, a potency at a biological target, and a solubility.

[0207] In one or more embodiments, the reward value is calculated by scalarizing a molecular property vector of the product state 345. In one or more embodiments, the molecular property vector comprises the set of properties 270. In one or more other embodiments, the molecular property vector may be obtained from the database 230.

[0208] In one or more embodiments, the property estimators 360 estimate properties of the product state 345 based on a set of properties 270 and transmits a reward value to the critic 328. The critic 328 estimates the expected return and transmits the expected return to the control policy 324.

[0209] The method 600 advances to step 610 or to step 612.

[0210] STEP 610: in response to the product state not corresponding to a terminal state:

selecting an other action to apply on the product state, the other action comprising another chemical transformation; and

generating, based on the other action the product state, another product state

[0211] At step 610, the server 220 determines if the product state 345 corresponds to a terminal state, and in response to the product state 345 not corresponding to the terminal state, the server 220 accesses the database 230 to select an action, the action corresponding to another one of a plurality of transformations 250. In one or more embodiments, the other action is selected based on the product state 345.

[0212] In one or more embodiments, based on at least one of the product state 345 and the action, the server 220 selects another action comprising another chemical transformation. The server 220 accesses the intra-option policy of the option to select at least one reactant in the plurality of reactants 260 stored in the database 230. The server 220 applies the other action on the product state 345, the other action comprising a missing reactant, to obtain an other product state.

[0213] The server 220 determines a reward value of the other product state as described above.

[0214] In one or more embodiments, the server 220 executes step 610 a plurality of times to generate further product states from the current product state and determines a reward value of the further product states until reaching a terminal state.

[0215] In one or more embodiments, the server 220 executes step 610 a plurality of times to select options corresponding to reactions and reactants and obtain a new product state or molecule until a termination condition is reached.

[0216] In one or more embodiments, the method 600 advances to step 612.

[0217] STEP 612: in response to the product state corresponding to the terminal state:

outputting a synthetically accessible molecule

[0218] At step 612, the server 220 determines if the product state 345 corresponds a terminal state, i.e. a final molecule that is synthetically accessible and which may be potentially pharmacologically active. In one or more embodiments, the server 220 determines that the product state corresponds to a terminal state based on the properties of the product state 345.

[0219] The server 220 determines that the product state 345 corresponds to a terminal state after a termination condition is reached. In one or more embodiments, the termination condition may be reached after a predetermined number of steps. In one or more embodiments, the termination condition may be reached if the set of properties of the terminal state 345 estimated by the property estimators 360 are above a predetermined threshold.

[0220] The server 220 outputs a synthetically accessible molecule.

[0221] The method 600 then ends.

[0222] Thus, by executing method 600, the server 220 may obtain the molecular synthesis routes in terms of reactions, reactants and intermediate molecules for generating a potentially pharmacologically active molecule from an initial molecule based on the set of properties for which the policies of the agent 320 have been optimized.

[0223] It should be expressly understood that not all technical effects mentioned herein need to be enjoyed in each and every embodiment of the present technology. For example, embodiments of the present technology may be implemented without the user enjoying some of these technical effects, while other non-limiting embodiments may be implemented with the user enjoying other technical effects or none at all.

[0224] Some of these steps and signal sending-receiving are well known in the art and, as such, have been omitted in certain portions of this description for the sake of simplicity. The signals can be sent-received using optical means (such as a fiber-optic connection), electronic means (such as using wired or wireless connection), and mechanical means (such as pressure-based, temperature based or any other suitable physical parameter based).

[0225] Modifications and improvements to the above-described implementations of the present technology may become apparent to those skilled in the art. The foregoing description is intended to be exemplary rather than limiting.

[0226] Experimental Results

[0227] To validate the framework of the present technology labeled REACTion-driven Objective Reinforcement (REACTOR) herein below, its performance has been benchmarked on goal-directed design tasks, focusing primarily on predicted activity for the D2 Dopamine Receptor. To maintain consistency with experiments done in prior work, additional experiments have been performed on penalized cLogP and QED, with the results presented in Table 3 and Figures 12 and 13. In order to better understand the exploration behaviour of the present approach, the nature of the trajectories generated by the REACTOR policies have been investigated showing that policies retain drug-likeness across all optimization objectives, while also exploring distinct regions of chemical space.

[0228] Experimental Setup Baselines

[0229] The present approach has been compared to two recent methods in deep generative molecular modelling, i.e. the JT-VAE method and the ORGAN method [14, 15] Each of these approaches was first pre-trained for up to 48h on the same compute facility, a single machine with 1 NVIDIA Tesla K80 GPU and 16 CPU cores. Property optimization was then performed using the same procedures as described in the original papers. The present approach has also been compared with two state-of-the-art reinforcement learning approaches, i.e. the Graph-Convolutional Policy Networks approach and the MolDQN approach. Each algorithm was run using the open-sourced code from the authors, the same reward function implementation was enforced across methods to ensure consistency. GCPN was run sing 32 CPU cores for approximately 24 hours (against 8 hours in the original paper), and MolDQN for 20000 episodes (against 5000 episodes in the original paper). In addition, a steepest-ascent hill-climbing baseline using the REACTOR environment has been added to demonstrate that for simple, mostly greedy objectives such as cLogP and QED, simple search policies may provide reasonable performance. In contrast, learned traversals of space become necessary for complex tasks such as DRD2.

[0230] Evaluation

[0231] Given the inherent differences between generative and reinforcement learning models, evaluation was adapted to remain consistent within each class of algorithms. JT-VAE and ORGAN were evaluated based on decoded samples from their latent space, using the best results across training checkpoints, with statistics for JT-VAE computed over 3 random seeds. Given the prohibitive cost of training ORGAN, results are given over a single seed. Other baselines were compared based on three sets of 100 building blocks used as starting states. Statistics are reported over sets, while the statistics of the initial states are shown by BLOCKS. Evaluation of each method was prioritized based on mean reward, given that this corresponds to the underlying optimization objectives for reinforcement learning methods. This is denoted by Activity in Table 1, which corresponds to the percentage of generated molecules which are predicted active for the DRD2 receptor. In both Table 1 and Table 3, mean reward was computed based on the set of unique molecules generated by each algorithm, in order to avoid artificially favouring methods which often generate the same molecule. Diversity corresponds to the average pairwise Tanimoto distance among generated molecules, while "Scaff Similarity" corresponds to the average pairwise similarity between the scaffolds of the compounds, as known in the art. Finally, the number of atoms was limited to 38 for all single-objective tasks, and to 50 for the multi-objective tasks.

[0232] Goal-Directed De Novo Design

[0233] Results on the unconstrained design task show that REACTOR achieves highest mean reward for the DRD2 objective. We also observe that REACTOR maintains

high diversity and uniqueness in addition to robust performance. This characteristic implies that the agent is able to optimize the space surrounding each starting molecule, without reverting to the same molecule to optimize the scalar reward signal. In Table 3, REACTOR also achieves higher reward on QED, while remaining competitive on penalized cLogP despite the simplistic nature of this objective favouring atom-by-atom transitions. Training efficiency is an important practical consideration while deploying methods for de novo design. Generative models first require learning a mapping of molecules to the latent space before training for property optimization. During the experiments, this resulted in more than 48h of training time. Reinforcement learning methods trained faster, but generally failed to converge within 24 hours. In contrast, as shown in Figure 12, the present approach converges within approximately two hours of training.

[0234] Synthetic Tractabilitv and Desirability of Optimized Compounds

[0235] In one embodiment, given the narrow perspective offered by at least some quantitative benchmarks for molecular design models, it is equally important to qualitatively assess the behaviour of these models by examining generated compounds. Figure 8 provides samples generated by each RL method across all objectives. Since the computational estimation of cLogP relies on the Wildman-Crippen method, which assigns a high atomic contribution to Halogens and Phosphorous, the atom-based action space of MolDQN produces samples that are heavily biased towards these atoms, resulting in molecules that are well optimized for the task but neither synthetically-accessible nor drug-like. This generation bias was not observable in previously reported benchmarks where atom types were only limited to Carbon, Oxygen, Nitrogen, Sulfur and Halogens. Furthermore, MolDQN samples for the DRD2 task lack a ring system, and whereas molecules from GCPN have one, none adequately optimize for the objective.

[0236] In contrast, REACTOR appears to produce more pharmacologically desirable compounds, without explicitly considering this as an optimization objective. This is supported by Figure 7, which shows that REACTOR is the only approach able to simultaneously solve the DRD2 task while maintaining favourable distributions for synthetic-accessibility and drug-likeness.

[0237] Further, as shown in Figures 5A-5B and Figure 11, optimized compounds are provided along with a possible route of synthesis. While such trajectories may not be optimal, given that they are limited by the reward design, the set of reaction templates used, their specificity, as well as the availability and cost of reactants, they provide a crucial indication of synthesizability. In Wenhao Gao and Connor W. Coley; The synthesizability of molecules proposed by generative models; arXiv: 2002.07007 [cs, q-bio, stat]; Feb 2020; URL http://arxiv.org/abs/2002.07007; arXiv: 2002.07007, the authors detail the lack of consideration for synthetic tractability in current molecular optimization approaches, highlighting that this is a necessary requirement for application of these methods in drug discovery. While alternate ideas aiming to embed synthesizability constraints into generative models have recently been explored, REACTOR is the first approach which explicitly addresses synthetic feasibility by optimizing directly in the space of synthesizable compounds using reinforcement learning.

[0238] Multi-Objective Optimization

[0239] Practical methods for computational drug design must be robust to the optimization of multiple properties. Indeed, beyond the agonistic or antagonistic effects of a small molecule, properties such as the selectivity, solubility, drug-likeness and permeability of a drug candidate must be considered. To validate the REACTOR framework under this setting, the task of optimizing for selective DRD2 ligands has been considered. Dopamine receptors are grouped into two classes: Dl-like receptors (DRDl, DRD5) and D2-like receptors (DRD2, DRD3 and DRD4). Although these receptors are well studied, design of drugs selective across subtypes remains a considerable challenge. In particular, as DRDl and DRD3 share 78% structural similarity in their transmembrane region, it is very challenging to identify small molecules that can selectively bind to and modulate their activity. The performance was assessed both on selectivity across classes (using DRDl as off-target) and within classes (using DRD3 as off-target). The performance of the framework as the number of design objectives increases. For these

experiments, the comparison has been made with MolDQN, as it strongly outperforms other methods on the single-objective tasks. Its training was increased to 25,000 episodes and reward scalarization was used to combine multiple objectives. Formally, a vector of reward signals is aggregated via a mapping thus collapsing the
multi-objective MDP into a standard MDP formulation. While the simplest and most common approach to scalarization is to use a weighted sum of the individual reward signals, a Chebyshev scalarization scheme was used, whereby reward signals are aggregated via the weighted Chebyshev metric:


[0240] where
is a utopian vector,
assigns the relative preferences for each objective, and i indexes the objectives. For the experiments, binary rewards were considered, such that the utopian point is always
rendering the dynamics of each reward signal more similar, and assign uniform preferences to the objectives. While Chebyshev scalarization was introduced for tabular Reinforcement Learning, it may interpreted in the function of approximation setting as defining an adaptive curriculum, whereby the optimization focus shifts dynamically according to the objective most distant from In Figure 13, it can be appreciated that this approach for combining reward signals is significantly more robust as the number of objectives increases.

[0241] DRD2 Selectivity

[0242] Rewards in Table 2 and Figure 13 are computed as the proportion of evaluation episodes for which the algorithms optimize all desired objectives. In Table 2, we find that REACTOR maintains strong performance on the selectivity tasks, optimizing for DRD2 while avoiding off-target activity on the D1 and D3 receptors. Further, it is able to outperform MolDQN while maintaining very low scaffold similarity among generated molecules.


[0243] Robustness to Multiple Objectives

[0244] In addition to off-target selectivity, the robustness of each method’s performance was assessed as the number of pharmacologically-relevant property objectives to optimize were increased. Specifically, the following combinations of rewards have been compared:

• DRD2 with range-targeted cLogP (2 objectives) according to a Ghose filter;

• DRD2, range cLogP, and a molecular weight ranging from 180 to 600 (3 objectives); and

• DRD2, range cLogP, target molecular weight, and drug absorption, as indicated by a model trained on data for the Caco-2 permeability assay (4 objectives).

[0245] Figure 9 shows that REACTOR demonstrates greater robustness to an increasing number of design objectives. Specifically, it maintains a global success rate of approximately 88% when optimizing for the 4 objective task. Success against DRD2 activity drops more sharply for MolDQN, while the uniqueness of generated molecules drops to below 20% for 3 and 4 objectives.

[0246] Goal-Directed Exploration

[0247] In order to gain further insight into the nature of the trajectories generated by the REACTOR agent, two alternative views of optimization routes generated for the same building block across each single-property objective were plotted. In Figure 6, a Principal Components Analysis (PCA) on the space of building blocks has been fitted to identify the location of the initial state, and subsequently transform the next states generated by the RL agent onto this space. It was found that the initial molecule is clearly shifted to distinct regions in space, while the magnitude of the transitions suggest efficient traversal of the space. This provides further evidence that exploration through space is a function of reward design, and is mostly unbiased by the data distribution of initialization states. Figure 11 shows the same trajectories with their corresponding reactions and intermediate molecular states. It was found that optimized molecules generally contain the starting structure, which is believed to be a desirable property given that real-life design cycles are often focused on a fixed scaffold or set of core structures. It should be noted that the policy learned by the framework is able to generalize over different starting blocks, suggesting that it achieves generation of structurally diverse and novel compounds.