May, 2012

May 12

Early May Link roundup

Naomi Robbins looks at using pie charts to represent women and men in publishing. Her piece is here. The charts in question are here. Warning: Don’t click if you hate pie charts, you just might have a meltdown or wonder why all the info couldn’t be put into a single bar chart.

Discussion at Cross Validated about How best to communicate uncertainty with way too few responses. My feeling is that as news filters from scientific realms to pop culture outlets, it becomes more and more “certain” in terms of how it’s presented. Can that be fixed?

Master of data visulization Hans Rosling made Time magazines list of 100 most influential people in the world.

May 12

May Manifesto addendum

Just added another statement to my manifesto. Here is the full text:

Interpret or predict. Pick one. There is an inescapable tradeoff between models which are easy to interpret and those which make the best predictions. The larger the data set, the higher the dimensions, the more interpretability needs to be sacrificed to optimize prediction quality. This puts modern science at a crossroads, having now exploited all the low hanging fruit of simple models of the natural world. In order to move forward, we will have to put ever more confidence in complex, uninterpretable “black box” algorithms, based solely on their power to reliably predict new observations.

Since you can’t comment to WordPress pages, you can post any comments about my latest addition here. First, though, here is an example that might help explain the difference between interpreting and predicting. Suppose you wanted to say something about smoking and its effect on health. If your focus is on interpretability, you might create a simple model (perhaps using a hazards ratio) that leads you to make the following statement: “Smoking increases your risk of developing lounge cancer by 100%”.

There may be some broad truth to your statement, but to more effectively predicts whether a particular individual will develop cancer, you’ll need to include dozens of additional factors in your model. A simple proportional hazards model might be outperformed by an exotic form of regression, which might be outperformed by a neural network, which would probably be outperformed by an ensemble of various methods. At which point, you can no longer claim that smoking makes people twice as likely to get cancer. Instead, you could say that if Mrs. Jones —a real estate agent and mother of two, in her early 30s, with no family history of cancer — begins smoking two packs a day of filtered cigarettes, your model predicts that she will be 70% more likely to be diagnosed with lounge cancer in the next 10 years.

The shift taking place right now in how we do science is huge, so big that we’ve barely noticed. Instead of seeing the world as a set of discrete, causal linkages, this new approach sees rich webs of interconnections, correlations and feedback loops. In order to gain real traction in simulating (and making predictions about) complex systems in biology, economics and ecology, we’ll need to give up on the ideal of understanding them.