Machine learning method creates learnable chemical grammar to build synthesizable monomers and polymers


Credit: Pixabay/CC0 Public domain

Chemical engineers and materials scientists are constantly searching for the next breakthrough material, chemical, and drug. The rise of machine learning approaches is accelerating the discovery process, which could otherwise take years. “Ideally, the goal is to train a machine learning model on a few existing chemical samples, then allow it to produce as many manufacturable molecules of the same class as possible, with predictable physical properties,” says Professor Wojciech Matusik. of electrical engineering. and computer science at MIT. “If you have all of these components, you can create new molecules with optimal properties, and you also know how to synthesize them. That’s the big picture that people in this space want to achieve.”

However, current techniques, primarily deep learning, require large datasets for training models, and many class-specific chemical datasets contain a handful of example compounds, limiting their ability to generalize. and to generate physical molecules that could be created in the real world.

Now, a new paper from researchers at MIT and IBM addresses this problem by using a generative graph model to construct new synthesizable molecules in the same chemical class as their training data. To do this, they treat the formation of atoms and chemical bonds as a graph and develop a grammar of graphs – a linguistic analogy of systems and structures for word order – which contains a sequence of rules for the construction of molecules , such as monomers and polymers. Using the grammar and production rules that have been inferred from the training set, the model can not only reverse engineer its examples, but can also create new compounds in a systematic and data-efficient way. “We basically built a language to create molecules,” says Matusik. “This grammar is essentially the generative model.”

Matusik’s co-authors include MIT graduate students Minghao Guo, who serves as lead author, and Beichen Li as well as IBM Research research staff Veronika Thost, Payal Das, and Jie Chen. Matusik, Thost and Chen are affiliated with the MIT-IBM Watson AI Lab. Their method, which they call Data Efficient Graph Grammar (DEG), will be presented at the International Conference on Learning Representations.

“We want to use this grammatical representation for the generation of monomers and polymers, because this grammar is explainable and expressive,” Guo explains. “With just a few production rules, we can generate many types of structures.”

A molecular structure can be thought of as a symbolic representation in a graph – a chain of atoms (nodes) connected together by chemical bonds (edges). In this method, researchers allow the model to take the chemical structure and reduce a substructure of the molecule to a node; it can be two atoms joined by a bond, a short sequence of bonded atoms, or a ring of atoms. This is done repeatedly, creating the production rules as you go, until only one node remains. The rules and grammar could then be applied in reverse order to recreate the training set from scratch or combined in different combinations to produce new molecules of the same chemical class.

“Existing graph generation methods would produce one node or edge sequentially at a time, but we are looking at higher level structures and, in particular, exploiting knowledge of chemistry, so we are not dealing with atoms and the individual links as the unit. This simplifies the generation process and also makes learning the data more efficient,” says Chen.

Additionally, the researchers optimized the technique so that the ascending grammar is relatively simple and straightforward, so it makes molecules that could be made.

“If we reverse the order of applying these production rules, we will get another molecule; moreover, we can enumerate all the possibilities and generate tons of them,” says Chen. “Some of these molecules are valid and some are not, so learning the grammar itself is really about determining a minimal set of production rules, so that the percentage of molecules that can actually be synthesized is maximized. ” While the researchers focused on three formation sets of less than 33 samples each – acrylates, chain extenders and isocyanates – they note that the process could be applied to any chemical class.

To see how their method worked, the researchers tested DEG against other state-of-the-art models and techniques, looking at the percentages of chemically valid and unique molecules, the diversity of those created, the success rate of retrosynthesis, and the percentage of molecules belonging to the monomer class of the training data.

“We clearly show that, for synthesizability and adhesion, our algorithm outperforms all existing methods by a very large margin, while it is comparable for some other widely used metrics,” Guo says. Moreover, “what is amazing with our algorithm is that we only need about 0.15% of the original dataset to get very similar results compared to state-of-the-art approaches. which train on tens of thousands of samples. Our algorithm can specifically handle the data sparse problem.”

In the immediate term, the team plans to address stepping up this process of learning grammar so that they can generate large graphs, as well as produce and identify chemicals with desired properties.

Down the road, the researchers see many applications for the DEG method, as it is adaptable beyond generating new chemical structures, the team points out. A graph is a very flexible representation and many entities can be symbolized in this form, for example robots, vehicles, buildings and electronic circuits. “Essentially, our goal is to develop our grammar, so that our graphical representation can be widely used in many different fields,” Guo explains, because “DEG can automate the design of new entities and structures,” Chen explains.

Finding a materials grammar to aid in the discovery of catalysts

More information:
Minghao Guo et al, Efficient learning of graph grammar for molecular generation.

Provided by Massachusetts Institute of Technology

This story is republished courtesy of MIT News (, a popular site that covers news about MIT research, innovation, and education.

Quote: Machine Learning Method Creates Learnable Chemical Grammar to Build Synthesizable Monomers and Polymers (April 4, 2022) Retrieved April 4, 2022 from -learnable-chemical-grammar.html

This document is subject to copyright. Except for fair use for purposes of private study or research, no part may be reproduced without written permission. The content is provided for information only.


Leave a Reply

Your email address will not be published.

Back to top