Data-driven protein design • Baker Lab

This summer saw a major advance in protein science: data-driven design.

Gabe Rocklin, a California native, is a Merck Postdoctoral Fellow of the Life Sciences Research Foundation. In addition to fathering thousands of new proteins, Gabe is also a proud new father of Natalie, born five days before his paper reviews arrived.

For decades, researchers have been trying to decode the rules of protein folding by studying how the complex, highly specialized proteins in nature hold their shapes. It’s a bit like trying to figure out how an airplane works – at its most fundamental level – by watching the Blue Angels. Not an easy task.

Here at the IPD, postdoctoral fellow Gabe Rocklin flipped this decades-old problem on its head. Rather than observing thousands of complex natural proteins to try to deduce their folding rules, Gabe and his team built their own set of over 15,000 new, simpler proteins – all designed using Rosetta – and studied how those molecules behaved.

At first only a tiny fraction of the team’s designer proteins actually folded. But by analyzing the winners and losers, the team was able to learn new rules of protein folding and incorporate winning features into their design pipeline. By the fourth round of design, close to half of all their computer-generated proteins were folding. Their new set of 2,788 folded de novo proteins includes topologies that have never been observed in nature. These “miniproteins” are already being used by researchers here at the IPD as platforms for future medicines.

Illustration by Gabe Rocklin

Three key technologies were brought together to make this ambitious project possible: The ever-improving Rosetta software suite, which was used to design from scratch the thousands of new proteins needed for this type of analysis; new advances in parallel DNA synthesis, which let the team express a huge number of designer proteins; and massively parallel protease-based screens coupled with deep sequencing, which let the team measure stability of their complete library of sequences.

As their new research article in Science reports, the rules governing protein folding are complex. Even small proteins are held together by thousands of weak interactions that collectively overcome the entropic cost of folding. But by studying both successful and failed designs, subtle sequence-level patterns were discovered and employed to improve the protein design process.

This work achieves the long-standing goal of a tight feedback cycle between computation and experiment and has the potential to transform computational protein design into a data-driven science.

Global analysis of protein folding using massively parallel design, synthesis, and testing | PDF

See coverage in Science, UW Newsbeat, Science Daily, Chemistry World, Phys.org, GEN, and C&E News.