Manifesto « Probability and statistics blog

Manifesto

Having no useful door to nail these to, I present them here for general public digestion:

Probability is math. Statistics is (applied) epistemology. The biggest questions in statistics revolve around the limits of our knowledge. What conclusions are justified based on which processes? How should we interpret the results of an experiment? When is data bad or good? These are philosophical questions. Math can help you answer them, but only in the sense that knowledge of mechanical engineering helps you drive a taxi.

In Monte Carlo we trust. Studying equations which converge as “n” goes to infinity can provide great theoretical insight. If you want to know how things work, or might work, in the real world of complicated models and limited trials, you need to run an experiment. Monte Carlo simulations, being the purest of all possible experiments, are often the best map we have between theory and reality.

If your system is chaotic or reflexive, your model must be recursive or iterative. The distribution itself, or at a minimum its parameters, should evolve along with the data, leading to sensitivity to initial conditions and the possibility of emergent patterns and unexpected outcomes. All models predicting human behavior, especially in groups, should be reflexive or iterative. Non-independence of outcomes must be assumed.

There are no “outliers”, only extreme results. If you remove data from your analysis for reasons other than known error in data collection or transcription, you are no longer doing science. You are ignoring evidence, very often strong evidence. Whenever you hear someone (usually from the world of finance) speak of a “12 sigma event” or some other occurrence that should happen once in a stega-godzillion years, that’s not an outlier. Its a sign they are using a dangerously inappropriate model.

“N” is always finite. Probability theory gives us many powerful limit theorems. These tell us what to expect, mathematically, as the number of trials goes to infinity. That fine, but we don’t live our lives at the limit. In the real world, asymptotics are much more important, as are non-parametric inference and tools like bootstrap to maximize the information extracted from small sample sizes. Even in those real life situations when N could increase without limit, the growth of the sample size itself can change the problem. Toss a die enough times and the edges will start to wear off, changing the frequency that each side comes up.
“All models are wrong, but some are useful.” Attributed to statistician George Box. Models are maps, imperfect simplifications of a more complicated underlying reality. In addition to being useful, they can also be powerful, insightful, and lucrative. But if you start saying things like “the data prove that my model is right”, then you’ve failed to understand statistics (see item #1).

Data is information. Useful models reduce entropy. Data isn’t just numbers, or categories. It’s not a series of zeros and ones. A stream of data is a stream of information. It has an information rate and a level of entropy. Better data lowers the effective entropy. Better models or procedures lowers the effective entropy.

Look closely enough, and everything has a distribution. View any data point with a strong enough microscope and it starts to look fuzzy. In math, 2 and 2 sum up to exactly 4. In the real world, there is always a margin of error. The true sample space can never be fully known or bounded or perfectly modeled as a sigma algebra. Real coins land on an their edges every once in a blue moon. Sometimes nothing happens. Sometimes, nothing happens. All categorization schemes break down at the margins, all empirical statements are built with words, defined by other words. Look closely enough at anything, and it starts to look fuzzy.

It’s all about the evidence. Studies and experiments neither prove nor disprove assertions. Instead, they provide evidence for or against them. Sometimes the evidence is strong, other times it’s weak or mixed. How we evaluate new evidence — judging its absolute strength and integrating it with prior evidence and belief — is therefore the very foundation of all statistical work. Unfortunately, radical attempts to establish the correct way to integrate new evidence have fallen out of favor or languished in obscurity (see Bruno di Finetti or Richard Royall or ET Jaynes). We are left only with the shadow of a debate, in the form of locked antlers between frequentests and Bayesians.

Morality needs probability. Doing ethics without probability is like performing surgery with a wooden spoon — it’s a blunt instrument capable of only the most basic operations, and more likely to kill the patient than heal them. Implicitly, we understand this need for probability in making ethical judgements, yet most people recoil when the calculus of probabilities is made explicit, because it seems cold, because the math frightens and confuses them, or because letting odds remain unestimated and unacknowledged allows people to confuse positive outcomes with moral behavior, sweeping hidden risks under the rug when things go well, or claiming ignorance when they don’t. It’s time to acknowledge — directly, explicitly, mathematically — that morality needs probability. For ethics to move forward it must be integrated with our knowledge of randomness and partial entailment.

Interpret or predict. Pick one. There is an inescapable tradeoff between models which are easy to interpret and those which make the best predictions. The larger the data set, the higher the dimensions, the more interpretability needs to be sacrificed to optimize prediction quality. This puts modern science at a crossroads, having now exploited all the low hanging fruit of simple models of the natural world. In order to move forward, we will have to put ever more confidence in complex, uninterpretable “black box” algorithms, based solely on their power to reliably predict new observations.

Correlation proves compatibility. Negative correlation implies incompatibility.

Revolution is in the air. Or, at the least, it should be. The statistical analyses and processes in current use by scientists were created for mathematical elegance and ease of computation. They are sub-optimal tools which encourage problematic claims (X has no relationship with Y because the data in our model failed to meet an arbitrary p-value cutoff) and questionable assumptions (Normality, independence, outliers can be removed). Meanwhile, ever larger datasets combined with massive increases in computational power require new ways to understand and model data. We can scan through gigabytes of data and test millions of model and parameter combinations in a single afternoon. We have dozens of exotic new data mining concepts (Lasso, Bayesian Classifiers, Simulated neural networks). We have tools of unfathomable power, complexity, and diversity, yet the foundations of our discipline have scarcely evolved since they were laid down many decades ago by one single biologist. Statistics is yet to have its Cantor, its Gödel or Turing. Our world is quantum, our mindset still classical.

Last updated June 19, 2012.