A deep learning approach to protein design • Baker Lab

Together with scientists at the Ovchinnikov lab at Harvard, we have applied deep learning to the challenge of protein design, yielding a new way to quickly create protein sequences that fold up as desired. This breakthrough has broad implications for the development of protein-based medicines and vaccines.

Computational protein design has primarily focused on finding amino acid sequences that encode very low energy target structures. However, what is most relevant during folding is not the absolute energy of the folded state, but the energy difference between the folded state and the lowest-lying alternative states. In a new publication in PNAS, a team led by IPD postdoctoral scholars Christoffer Norn and Basile Wicky describe a deep learning approach that captures the entire folding landscape. They also show that it can enhance current protein design methods.

Read the paper: Norn C, Wicky BIM, et al. Protein sequence design by explicit energy landscape optimization. PNAS. March 2021 [PDF]

This work builds on a recently described convolutional neural network called trRosetta that predicts residue-residue distances and orientations from input sets of aligned sequences. Combining these predictions with Rosetta energy minimization yielded excellent predictions of structures in benchmark cases. While co-evolution between pairs of positions in the input multiple sequence alignments is critical for accurate prediction of structures of naturally occurring proteins, it is not necessary for more “ideal” designed proteins: accurate predictions of the latter can be obtained from single sequences.

Because distance and orientation predictions are probabilistic, the team rationalized that they might inherently contain information about alternative conformations, and thus provide more information about design success than classical energy calculations. Moreover, because these predictions can be obtained rapidly for an input sequence on a single GPU, they reasoned that it should be possible to use the network to directly design sequences that fold into a desired structure by maximizing the probability of the observed residue-residue distances and orientations versus all others.

In stark contrast to energy-based sequence design approaches which have characterized the field to date, sequence design using trRosetta has the remarkable ability to capture properties of the entire energy landscape and consider alternative states that can reduce the occupancy of the desired target structure. Such implicit considerations of the full landscape are almost impossible to achieve with atomistic models without employing extremely CPU-intensive calculations, including large-scale Rosetta ab initio structure prediction and molecular dynamics simulations on very long time-scales. On the other hand, because of the lower resolution of the trRosetta method, it is less accurate in the immediate vicinity of the folded structure. Integration of trRosetta design into Rosetta all-atom calculations combines the strong features of both approaches. More generally, this work demonstrates how deep-learning methods can complement detailed physically based models by capturing higher-level properties normally only accessible through large-scale simulations.

This project was supported by the National Institutes of Health, Howard Hughes Medical Institute, Audacious Project, and by Eric and Wendy Schmidt, by recommendation of the Schmidt Futures program. Christoffer Norn is supported by a grant from the Novo Nordisk Foundation. Basile Wicky is an EMBO long-term fellow. The source code for the study is available at https://github.com/gjoni/trDesign