Guest post from Dr. Gregory Bowman, UC Berkeley
Two general objectives of the Folding@home project are (1) to explain the molecular origins of existing experimental data and (2) to provide new insights that will inspire the next generation of cutting edge experiments. We have made tremendous progress in both areas, but particularly in the first area. Obtaining new insight is even more of an art and, therefore, less automatable.
To help facilitate new insights, I recently developed a Bayesian algorithm for coarse-graining our models. To explain, when we are studying some process—like the folding of a particular protein—we typically start by drawing on the computing resources you share with us to run extensive simulations of the process. Next, we build a Markov model from this data. As I’ve explained previously, these models are something like maps of the conformational space a protein explores. Specifically, they enumerate conformations the protein can adopt, how likely the protein is to form each of these structures, and how long it takes to morph from one structure to another. Typically, our initial models have tens of thousands of parameters and are capable of capturing fine details of the process at hand. Such models are superb for making a connection with experiments because we can capture all the little details that contribute to particular experimental observations. However, they are extremely hard to understand. Therefore, it is to our advantage to coarse-grain them. That is, we attempt to build a model with very few parameters that is as close as possible to the original, complicated model. If done properly, the new model can capture the essence of the phenomenon in a way that is easier for us to wrap our minds around. Based on the understanding this new model provides, we can start to generate new hypotheses and then test them with our more complicated models and, ultimately, via experiment.
Statistical uncertainty is a major hurdle in performing this sort of coarse-graining. For example, if we observe 100 transitions between a pair of conformations and each of these transitions is slow, then we can be pretty sure this is really a slow transition. However, if we only observe another transition once and it happens to occur slowly, who knows? It could be that it is really a slow transition. On the other hand, it could be we just got unlucky.
Existing methods for coarse-graining our Markov models assume we have enough data to accurately describe each transition. Therefore, they often pick up these poorly characterized transitions as being important (for protein folding, we typically care most about the slow steps, so slow and important are synonymous). The new method I’ve developed (described here) explicitly takes into account how many times a transition was observed. Therefore, it can appropriately place emphasis on the transitions we observed enough times to trust while disregarding the transitions we don’t trust. To accomplish this, I draw on Bayesian statistics. I can’t do this subject justice here, but if you’re ever trying to make sense of data that you have varying degrees of faith in, I highly recommend you look into Bayesian statistics.