Currently, the inorganic molecule base contains hundreds of millions of substances, and only a small fraction of them are used in medicinal drugs. The pharmacological methods of making drugs generally have a hereditary nature. For example, pharmacologists might continue to research aspirin that has already been in use for many years, perhaps adding something into the compound to reduce side effects or increase efficiency, yet the substance still remains the same.
Generative Adversarial Autoencoder (AAE) architecture, an extension of Generative Adversarial Networks, was taken as the basis, and compounds with known medicinal properties and efficient concentrations were used to train the system. Information on these types of compounds was input into the network, which was then adjusted so that the same data was acquired in the output. The network itself was made up of three structural elements: an encoder, decoder and discriminator, each of which had its own specific role in "cooperating" with the other two. The encoder worked with the decoder to compress and then restore information on the parent compound, while the discriminator helped make the compressed presentation more suitable for subsequent recovery. Once the network learned a wide swath of known molecules, the encoder and discriminator "switched off", and the network generated descriptions of the molecules on its own using the decoder.
Developing Generative Adversarial Networks that produce high-quality images based on textual inputs requires substantial expertise and lengthy training time on high-performance computing equipment. But with images and videos, humans can quickly perform quality control of the output. In biology, quality control cannot be performed by the human eye and a considerable number of validation experiments will be required to produce great molecules.
All the molecules are represented as "SMILEs", or graphical annotations of chemical substances that allow their structure to be restored. The standard registration taught in schools does not fit for network processing, but SMILEs do not do the job very well either, as they have a random length from one symbol to 200. Neural network training requires an equal description length for the vector.
Andrei Kazennov, one of the authors of the study and an MIPT postgraduate who works at Insilico Medicine, comments, "We've created a neuronal network of the reproductive type, i.e. capable of producing objects similar to what it was trained on. We ultimately taught this network model to create new fingerprints based on specified properties."
The anticancer drug database was used to check the network. First the network was trained on one half of the medicinal compounds, and then checked on the other part. The purpose was to predict the compounds already known but not included in the training set. A total of 69 predicted compounds have been identified, and hundreds of molecules developed using a more powerful extension of the method are on the way.
According to one of the authors of the research, Alex Zhavoronkov, the founder of Insilico Medicine and international adjunct professor at MIPT, "Unlike the many other popular methods in deep learning, Generative Adversarial Networks (GANs) were proposed only recently, in 2014, by Ian Goodfellow and Yoshua Bengio's group and scientists are still exploring its power in generating meaningful images, videos, works of art and even music.
"GANs are very much the frontline of neuroscience. It is quite clear that they can be used for a much broader variety of tasks than the simple generation of images and music. We tried out this approach with bioinformatics and obtained great results," concludes Artur Kadurin, Mail.Ru Group lead programmer of the search optimizing team and Insilico Medicine independent science advisor.
Contacts and sources:
Moscow Institute Of Physics And Technology