Proteins are the workhorses of biology, performing a wide range of functions ranging from fighting viruses in the human body to degrading plastics. Proteins are fundamentally chains of amino acid molecules that have evolved naturally over billions of years via mutation.
A computer-based natural language processing model has been successfully applied to protein research by researchers. Artificial intelligence (AI) has opened up new avenues for designing custom proteins to address issues ranging from medical to environmental.
A research team at the University of Bayreuth led by Prof. Dr. Birte Höcker has now successfully applied a computer-based natural language processing model to protein research. Completely independently, the ProtGPT2 model designs new proteins that are capable of stable folding and could take over defined functions in larger molecular contexts. The model and its potential are detailed scientifically in Nature Communications.
Natural languages and proteins are structurally similar. Amino acids combine in a variety of ways to form structures that serve specific functions in living organisms, similar to how words combine to form sentences that express specific facts. As a result, numerous approaches have been developed in recent years to use principles and processes that control computer-assisted natural language processing in protein research.
Our new model demonstrates the systemic affinity of protein design and natural language processing yet again. Artificial intelligence opens up a world of intriguing and promising possibilities for using language processing methods to create customized proteins. We at the University of Bayreuth hope to contribute to the development of innovative solutions to biomedical, pharmaceutical, and environmental problems in this way.
Prof. Dr. Birte Höcker
“Natural language processing has made extraordinary progress thanks to new AI technologies. Today, models of language processing enable machines not only to understand meaningful sentences but also to generate them themselves. Such a model was the starting point of our research. With detailed information concerning about 50 million sequences of natural proteins, my colleague Noelia Ferruz trained the model and enabled it to generate protein sequences on its own. It now understands the language of proteins and can use it creatively. We have found that these creative designs follow the basic principles of natural proteins,” says Prof. Dr. Birte Höcker, Head of the Protein Design Group at the University of Bayreuth.
“ProtGPT2” is the name given to the language processing model that has been applied to protein evolution. It can now be used to create proteins that fold into stable structures and remain functional in this state. Furthermore, the Bayreuth biochemists discovered, through extensive research, that the model can generate proteins that do not exist in nature and may have never existed in the history of evolution. These findings shed light on the infinite world of possible proteins and open the door to designing them in novel and previously unknown ways.
In this study, we look to artificial intelligence (AI) for help and build on the success of AI language models in generating highly realistic natural language sentences. We show that our language model, named ProGen, can learn the language of proteins to generate artificial protein sequences across multiple families.
Another advantage is that most proteins that have been designed from scratch so far have idealized structures. Before such structures can be used, they must usually go through an elaborate functionalization process, such as inserting extensions and cavities, so that they can interact with their surroundings and perform precisely defined functions in larger system contexts. ProtGPT2, on the other hand, naturally produces proteins with such differentiated structures and is thus already operational in their respective environments.
“Our new model demonstrates the systemic affinity of protein design and natural language processing yet again. Artificial intelligence opens up a world of intriguing and promising possibilities for using language processing methods to create customized proteins. We at the University of Bayreuth hope to contribute to the development of innovative solutions to biomedical, pharmaceutical, and environmental problems in this way” Prof. Dr. Birte Höcker says.