May Manifesto addendum

Just added another statement to my manifesto. Here is the full text:

Interpret or predict. Pick one. There is an inescapable tradeoff between models which are easy to interpret and those which make the best predictions. The larger the data set, the higher the dimensions, the more interpretability needs to be sacrificed to optimize prediction quality. This puts modern science at a crossroads, having now exploited all the low hanging fruit of simple models of the natural world. In order to move forward, we will have to put ever more confidence in complex, uninterpretable “black box” algorithms, based solely on their power to reliably predict new observations.

Since you can’t comment to WordPress pages, you can post any comments about my latest addition here. First, though, here is an example that might help explain the difference between interpreting and predicting. Suppose you wanted to say something about smoking and its effect on health. If your focus is on interpretability, you might create a simple model (perhaps using a hazards ratio) that leads you to make the following statement: “Smoking increases your risk of developing lounge cancer by 100%”.

There may be some broad truth to your statement, but to more effectively predicts whether a particular individual will develop cancer, you’ll need to include dozens of additional factors in your model. A simple proportional hazards model might be outperformed by an exotic form of regression, which might be outperformed by a neural network, which would probably be outperformed by an ensemble of various methods. At which point, you can no longer claim that smoking makes people twice as likely to get cancer. Instead, you could say that if Mrs. Jones —a real estate agent and mother of two, in her early 30s, with no family history of cancer — begins smoking two packs a day of filtered cigarettes, your model predicts that she will be 70% more likely to be diagnosed with lounge cancer in the next 10 years.

The shift taking place right now in how we do science is huge, so big that we’ve barely noticed. Instead of seeing the world as a set of discrete, causal linkages, this new approach sees rich webs of interconnections, correlations and feedback loops. In order to gain real traction in simulating (and making predictions about) complex systems in biology, economics and ecology, we’ll need to give up on the ideal of understanding them.



  1. What about generalizability? If you figure out that smoking causes cancer and the mechanism behind it, you can more easily understand other carcinogens. Is the same true if you build a statistical model that makes good predictions but does not represent causality!

  2. Hi Morgan,

    You raise a really good point. Generalizability is another potential “victim” of greater model complexity and higher precision in the predictions we make. It’s my suspicion that to make the next great leaps forward we’ll have to let our algorithms themselves (and not our understanding of the mechanisms we’ve tried to embody in those algorithms) do the generalizing. In other words, the best way to predict the weather in 2 days isn’t to understand that certain clouds presage rain or to know why winds must travel from areas of high pressure to low pressure. In fact, to build the best weather predictor you may actually have to “unlearn” previous beliefs about what causes what, and begin building your prediction engine from scratch with as little built-in domain knowledge (bias?) as possible. Factors like cloud covering and pressure will no doubt be in this model, but the best predictions might come from using these factors in highly convoluted, indirect ways. In other words, Yes, you just lost generalizability and any claim to causality.

  3. Hi, thx for ever-interesting writings. What I see in your point is sort of a problem with reductionism, where the understanding in a few smaller parts does not necessarily lead to equivalent or better understanding in a macroscopic picture. In order to make quality prediction, one need to include as many relevant variables as possible, whereas for a causality explanation(interpretation), one need to cut them down to a few key determining factors. This is where the dilemma kicks in, right?

    And I have a question

    -I did not quite understand what you meant by “indirect” ways, above mentioned. Can you explain how prediction-making process would work on that “convoluted indirect way” part with an example of weather prediction? Is it really any possible in practice to predict other than mechanically aggregating the causing powers of each variable pertinent to a particular outcome?

  4. Hi, thx for an interesting writing. I have a question. What do you mean by “highly convoluted, indirect ways? Is it any possible to process prediction “in reality” other than aggregating semi-mechanically the causing powers(how much a particular variable contributes to making a certain outcome happen) of variables pertinent to the issue(hopefully as many as possible) This aggregation approach intrinsically involves risks of causing a problem of reductionism, but do we have other better means?

  5. Matt,

    I very much like the tension you point out in trade-offs between prediction and interpretation. It reminds me a bit of what Richard Levins writes about when he describes formal models as necessarily having to sacrifice one of precision, generality (not generalizability!), or realism.

    But you write something with which I disagree: “having now exploited all the low hanging fruit of simple models of the natural world.” Yeah, like, not even close. You point out, rightly so, that we have new tools which, if we abandon interpretation, get us scads of ways farther on the prediction front. But your assertion that the low hanging fruit—presumably the non-data mining tools have run their course on easy questions—is utterly blind to the agendas driving which science gets done that have left very basic questions unexamined. For example, in epidemiology (the study of health and disease and the determinants of same in populations, also my discipline), the fact that we have not yet bothered to put certain populations in the denominators of our statistics on risk and prevalence (e.g. transgender populations), means that we haven’t even begun to ask or answer basic questions.

    On a related note, the ontological framing of our questions (i.e. the categories and measures we even think of to use) and the way they are linked to theories are constantly evolving. For example, we’ve lots of theories on individual behavior in the health sciences… but what becomes apparent in light of our current networked age, is that concepts of attention (and its limits) and decisions as a finite, time-bound resource are lacking. When the theories of health behavior are inevitably updated to include such concepts, there will be new research. And that new research may get along just fine using both old-fashioned interpretive models (or new-fangled predictive models).

    “Low hanging fruit” is a misdirect. There’s plenty of it lying around.

  6. Hi Alexis,

    Thanks for your thoughts. Perhaps there is much more low-hanging fruit out there than I though. I should clarify though that by the “natural world” I meant physics, chemistry, biology. I wasn’t thinking so much about the social sciences. I like your point (which I understand as saying) that if we re-frame our questions this opens up new territory to explore, territory that we might be able to map initially with simple models.

  7. Just to add. I did a first pass of a paper about Levins (by Odenbaugh). I like that Levins presents a “pick 2 out of 3” tradeoff model. In particular, I think the greatest advancements will come in the generalizability+precision quadrant. This is, not coincidentally, what’s needed for good AI and a useful (but incomprehensibly complex) brain.