25
Nov 11

Interactive map of US road fatalities


Fantastic map which shows the location of every death on US roads from 2001 to 2009. Go take a look then come back and tell me….

What location did you zoom in on first, and why?


23
Nov 11

Monty Hall revisited

Chances are you’ve already heard about the Monty Hall problem. I wouldn’t be mentioning it at all, except that I keep reading descriptions of the problem that miss the absolutely critical point. For those who are new to the problem, here’s a summary:

Suppose you’re a contestant on a game show. The host, Monty Hall, shows you three numbered doors. Two of these doors hide goats, which you don’t want, and one of them hides a shiny new convertible, which you do. Pick the right door and you go home with the convertible, pick the wrong door and you get the goat (which I suspect they don’t even really give you). You make your best guess and choose a door. But before showing what’s behind it, Mr. Hall opens one of the other two doors to reveal a goat. “Now”, he asks, “do you want to stick with your original choice, or do you want to switch doors to the other one that hasn’t been opened yet?”

While you try desperately to remember the rules for conditional probability, the studio audience yells out suggestions and an attractive model smiles at you, making you wonder if you should ask if she comes with the car, but then you realize she probably gets that question all the time. Time is running out! Should you switch doors?

The correct decision, at least in terms of maximizing your chances of winning the car (but, alas, not the model), is to switch. IQ Test Grand Champion and writer Marilyn Vos Savant famously answered the question in one of her columns. Her answer, that you should switch, was widely controversial. The math behind the solution is surprisingly simple, though it rarely seems be presented in a simple way. Your first guess has a one-in-three chance of being right. That means your first guess has a two-in-three chance of being wrong. If your first guess was wrong, that means the car must be behind one of the other two doors. Since Monty just showed you the goat, the car must be behind the other door. Switch and you will get the car for sure. If you don’t switch, your chance of winning remains one-in-three. If you do switch, it jumps up to two-in-three. So ignore the studio audience and don’t get distracted by the model. Just call out the number of that other door!

But wait! Did you catch the missing assumptions needed to make this solution work? The big one, for me, is that Monty Hall will always follow the same procedure of opening up a door with a goat, regardless of what’s behind the door you picked. If you distrust Monty, you might suspect that he will only show you a goat when you’ve picked the car, in order to entice you to switch and loose the car. In that case you should stick with the door you have. Or perhaps Monty shows the goat more frequently when the car is picked first (but not all the time), in which case switching may or may not be the best strategy.

The part where I yell
The problem here is that the Monty Hall problem MAKES NO SENSE WITHOUT AN EXPLICIT PRIOR on Monty Hall’s behavior. Sorry for the yelling, but the point is too important to miss. In this case, the prior is your belief about the procedure Monty is using, and how strongly you hold that belief to be true. The notion of a “prior” might be difficult to explain to a general audience, but assuming a particular one without stating it directly is poisonous. The Monty Hall problem, like many others, can’t be turned into math without first assuming some kind of probability distribution for the inputs.

Usually, when one the distributions of an input isn’t specified, we tend to assume that every possible option has an equal chance of occurring; in other words that we have a uniform probability distribution. This makes sense for another hidden assumption in the problem — that either the game show contestant has made his first pick randomly, or that the prizes were placed behind doors randomly. Though even here I tend to agree with mathematical historian Byron Wall, who argues that our default assumption of a set of equally likely events is problematic. But in the case of the Monty Hall problem, there’s no uniform to even assume. The set of possible ways that Mr. Hall could decide to act is infinite and unknowable.

How does Hall pick between the goats?
Another hidden assumption is that Monty randomizes which door to reveal if the unpicked doors are both hiding goats. If he didn’t, and you knew for sure that Monty would always pick the door with the lower number if when he had a choice, then the math works out differently. Now, if you pick door number 1 and Monty shows you a goat behind door number 3, you know for sure car must be behind door number 2. Switching guarantees you a win! If you pick door number 1 and Monty opens door number 2, that could mean either a car or a goat is behind door number 3. To calculate your odds of winning by switching, you can use Bayes’ theorem to find the probability that a car is behind door number 3, given that Monty reveled a goat behind door 2.

Work out the math, and you should get one-half. In other words, if Monty shows you door number 2, and if he’s using the rule stated above, then switching doors gives you a one-half probability of winning, as does staying with the door you have. It doesn’t matter. No matter which door Monty reveals, switching your pick is never worse than not switching, and sometimes it’s better to switch. That means it’s what game theorists call a dominant strategy, one you would always want to employ. Even so, since Monty’s door revealing rules can change your odds of winning, this is another hidden assumption that should have been made explicit.

Back when goats were golden
When the Monty Hall problem was originally described to me, I assumed that Monty had chosen a door to reveal at random, and that this door just happened to contain a goat. Perhaps not the most reasonable assumption to make, but at the time I was still young enough to think that winning a goat might be cooler than winning some K-car convertible (hum… maybe I still believe that). At any rate, I didn’t have the skills to work out a solution under my assumptions back then, but doing it now takes just a little bit of work.

The probability that you will win after switching, given that Monty “accidentally” reveals a goat, is actually the sum of two other probabilities. The first probability, that you will win by switching if both of the other doors contain goats, is zero. The second is the probability that only one of the two others doors was hiding a goat, in which case you will win for sure, since we already assumed that Monty revealed a goat. Because we know that Monty picked a goat by accident, we gain no additional information about the door we picked or the alternative we might switch to. Each one is equally likely to have the car, so switch or not, our probability of winning is one-half.

If you find this explanation confusing, you might want to try Jeffrey Rosenthal’s explanation, which shows how to re-normalize probabilities of events within your target condition.

The Man Who Loved Only Bayes
After publishing her solution, Vos Savant was flooded with letters telling her she got it wrong. I suspect that many of those readers were ignorant of her assumptions, though Vos Savant says that most people fully understood the problem, and simply didn’t accept her solution. One of the few accounts to mention the importance of Monty Hall’s procedural rules, even though that part only comes after 8 pages of discussion, is in Paul Hoffman’s “The Man Who Loved Only Numbers”. To explain why so many people, many of them with advanced degrees, got it wrong, Hoffman quotes mathematician Andrew Vázsonyi:

“Physical scientists tend to believe in the idea that probability is attached to things. Take a coin. You know the probability of a head is one-half. Physical scientists seem to have the idea that the probability of one-half is fused with the coin. It’s a property. It’s a physical thing. But say I take that coin and toss it a hundred times and each time it comes up tails. You will say something is wrong. The coin is false. But the coin hasn’t changed. It’s the same coin that it was when I started to toss it. So why did I change my mind? Because my mind has been upgraded with information. This is the Bayesian view of probability. It took me much effort to understand that probability is a state of mind.”

I might view probability more in terms of degrees of (rational) belief, but the Vázsonyi quote highlights a key component missing in much of science: the direct recognition that you have a prior, and that this prior is a form of bias, very often baked right into the model you have chosen. There is no escape from this bias! The frequentest approach to probability is really just a special case within the world of Bayesian inference, where you have picked an uninformative (or minimally informative?) prior. But even here you have to model the prior. You have to know: how are we assuming that Monty Hall makes his decision about showing the contestant a goat? Is it based on some fixed probability regardless of which door the contestant picks? Does Monty consult the entrails of a chicken? As mentioned before, the world of possibilities is infinite, and no progress can be made in terms of our understanding until we delineate a space in which our prior beliefs will live. Only once we’ve done that, implicitly or (preferably!) explicitly, can we test out our beliefs, and update them based Monty Hall’s actions.


09
Nov 11

Manifesto update

I just got done tweaking some of the points in my Manifesto and added a new one about evidence. As before, the Manifesto is a work in progress; your feedback is welcome here since you can’t post comments to pages in WordPress.


20
Oct 11

Queueing up in R, continued

Shown above is a queueing simulation. Each diamond represents a person. The vertical line up is the queue; at the bottom are 5 slots where the people are attended to. The size of each diamond is proportional to the log of the time it will take them to be attended. Color is used to tell one person from another and doesn’t have any other meaning. Code for this simulation, written in R, is here. This is my second post about queueing simulation, you can find the first one, including an earlier version of the code, here. Thanks as always to commenters for their suggestions.

A few notes about the simulation:

  • Creating an animation to go along with your simulation can take a while to program (unless, perhaps, you are coding in Flash), and it may seem like an extra, unnecessary step. But you can often learn a lot just by “watching”, and animations can help you spot bugs in the code. I noticed that sometimes smaller diamonds hung around for much longer then I expected, which led me to track down a tricky little error in the code.
  • As usual, I’ve put all of the configuration options at the beginning of the code. Try experimenting with different numbers of intervals and tellers/slots, or change the mean service time.
  • If you want to run the code, you’ll need to have ImageMagick installed. If you are on a PC, make sure to include the full path to “convert”, since Windows has a built-in convert tool might take precedence. Also, note how the files that represent the individual animation cells are named. That’s so that they are added in the animation in the right order, naming them sequentially without zeros at the beginning failed.
  • I used Photoshop to interlace the animated GIF and resave. This reduced the file size by over 90%
  • The code is still a work in progress, it needs cleanup and I still have some questions I want to “ask” of the simulation.

13
Oct 11

Waiting in line, waiting on R

I should state right away that I know almost nothing about queuing theory. That’s one of the reasons I wanted to do some queuing simulations. Another reason: when I’m waiting in line at the bank, I tend to do mental calculations for how long it should take me to get served. I look at the number of tellers attending, pick an average teller session length (say one or two minutes), then come up with an average wait per person in line. For example, if there are 4 tellers and the average person takes 2 minutes to do her transaction, then new tellers should become available every 30 seconds. If I’m the 6th person in line, I should expect to wait 3 minutes before I’m attended.

However, based on my experience (the much maligned anecdotal), it often takes much longer than expected. My suspicion is that over time the teller’s get “clogged up” with the slowest people, so that even though an average person might take only 2 minutes, the people you actually see being attended right now are much more likely to be those who take a long time.

To explore this possibility, I set up a simulation in R (as usual, full source code provided at end of post). The first graph, at the beginning of this post, shows the length of queues over time for 4 runs of the simulator, all with the same configuration parameters. Often this graph was completely flat. Note though that when things get out of hand in terms of queue length, they can get way out of hand. To get a feel for how long you would have to wait, given that there is a line in front of you, I tracked how long the first person in line had to wait to be served. Given my configuration settings, this wait would be expected to last 5 intervals. It does seem to take longer than 5 intervals, though I want to tweak the model some and do more testing before I’m ready to quantify that wait.

There are, as with any models, things that make this one unrealistic. The biggest may be that people get in line with the same probability no matter how long the line is. What I need is some kind of tendency to abandon the line if it’s too long. That shouldn’t shorten the wait times for those already in line. I could make those worse. If you assume that slower people are prepared to wait longer in line, then the line is more likely to have slow people. Grandpa Jones is willing to spend an hour in line so he can chat for a while with the pretty young teller; but if the line is too long, that 50-year-old business guy will come back later to deposit his check. I would imagine that, from the bank’s perspective, this presents a tricky dilemma. The people whose time is worth the least are probably the most likely to be clogging up your tellers, upsetting the customers you care the most about (I know, I know, Bank of America cares about all of us equally, right?).

Code so far, note that run times can be very long for high intervals if the queue length gets long:

#### Code by Matt Asher. Published at StatisticsBlog.com ####
#### CONFIG ####
# Number of slots to fill
numbSlots = 5

# Total time to track
intervals = 1000

# Probability that a new person will show up during an interval
# Note, a maximum of one new person can show up during an interval
p = 0.1

# Average time each person takes at the teller, discretized exponential distribution assumed
# Times will be augmented by one, so that everyone takes at least 1 interval to serve
meanServiceTime = 25

#### INITIALIZATION ####
queueLengths = rep(0, intervals)
slots = rep(0, numbSlots)
waitTimes = c()
leavingTimes = c()
queue = list()
arrivalTimes = c()
frontOfLineWaits = c()


#### Libraries ####
# Use the proto library to treat people like objects in traditional oop
library("proto")

#### Functions ####
# R is missing a nice way to do ++, so we use this
inc <- function(x) {
  eval.parent(substitute(x <- x + 1))
}

# Main object, really a "proto" function
person <- proto(
	intervalArrived = 0,
	intervalAttended = NULL,
	
	# How much teller time will this person demand?
	intervalsNeeded = floor(rexp(1, 1/meanServiceTime)) + 1,
	intervalsWaited = 0,
	intervalsWaitedAtHeadOfQueue = 0,
)

#### Main loop ####
for(i in 1:intervals) {
	# Check if anyone is leaving the slots
	for(j in 1:numbSlots) {
		if(slots[j] == i) {
			# They are leaving the queue, slot to 0
			slots[j] = 0
			leavingTimes = c(leavingTimes, i)
		}
	}
	
	# See if a new person is to be added to the queue
	if(runif(1) < p) {
		newPerson = as.proto(person$as.list())
		newPerson$intervalArrived = i
		queue = c(queue, newPerson)
		arrivalTimes  = c(arrivalTimes, i)
	}
	
	# Can we place someone into a slot?
	for(j in 1:numbSlots) {
		# Is this slot free
		if(!slots[j]) {
			if(length(queue) > 0) {
				placedPerson = queue[[1]]
				slots[j] = i + placedPerson$intervalsNeeded
				waitTimes = c(waitTimes, placedPerson$intervalsWaited)
				# Only interested in these if person waited 1 or more intevals at front of line
				if(placedPerson$intervalsWaitedAtHeadOfQueue) {
					frontOfLineWaits = c(frontOfLineWaits, placedPerson$intervalsWaitedAtHeadOfQueue)
				}
				
				# Remove placed person from queue
				queue[[1]] = NULL
			}
		}
	}
	
	# Everyone left in the queue has now waited one more interval to be served
	if(length(queue)) {
		for(j in 1:length(queue)) {
			inc(queue[[j]]$intervalsWaited) # = queue[[j]]$intervalsWaited + 1
		}
		
		# The (possibley new) person at the front of the queue has had to wait there one more interval.
		inc(queue[[1]]$intervalsWaitedAtHeadOfQueue) # = queue[[1]]$intervalsWaitedAtHeadOfQueue + 1
	}
	
	# End of the interval, what is the state of things
	queueLengths[i] = length(queue);
}

#### Output ####
plot(queueLengths, type="o", col="blue", pch=20, main="Queue lengths over time", xlab="Interval", ylab="Queue length")
# plot(waitTimes, type="o", col="blue", pch=20, main="Wait times", xlab="Person", ylab="Wait time")

10
Oct 11

Drop in confidence

I’ve been paying a lot of attention lately to how statistics are released to the public. In particular, when are confidence intervals used, and when are they dropped? When are numbers presented as fact, and when are they acknowledged to be fuzzy?

The only time you consistently see confidence intervals reported, in the general press, is for poll results. As in: 66% of respondents believed this poll to be self-reflexive, with a margin of error of plus or minus 5%.

Weather reports often have percentages involved, but it would be a stretch to call these confidence intervals. For example, when the attractive meteorologist on Channel 7 tells you that there is an 80% chance of rain tomorrow, they are presenting that as a fact. Behind that number is a computer simulation that may or may not be able to estimate a confidence interval around that 80% number.

Government statistics and estimates, no matter how bad or biased, almost never come with confidence intervals attached. Gross Domestic Product of the U.S. in 2010? Estimated at $14.64 trillion. Confidence interval for that estimate? Probably so bad you don’t even want to know. Or maybe you do?

Are there times when you’ve been surprised to see a confidence interval reported or missing?


15
Jun 11

The dismal science

I’ve begun reading some of the recent works by what are called “behavioral economists”. A staple of their work seems to be research into how humans fail to be perfectly rational economic actors, the most famous book of this sort being Dan Ariley’s Predictably Irrational. No doubt there is a lot of value in understanding how humans tend to deviate from behavior we (or, perhaps, academics) might expect. At the same time, I see many of these experiments as deeply flawed, in a way directly related to these economists’ failures to understand probability. In particular, they seem incapable of understanding the (often very rational) role that uncertainty plays in the minds of the participants. You can think of this uncertainty is as subjective probability, related to our degrees of belief. As human beings, we all have invisible, imperfect Bayesian calculators in our heads which crunch the data from our world and make implicit judgments about the information we take in. Right now, as you read this, how much credibility does what I’m saying have in your mind? How would your “uncertainty” about my arguments change if I made a clear mistakeee?

To see where the economists fail, consider the following experiment: A stranger approaches you and offers to give you $100 in cash right now, or to pay you $1000 in exactly one year. I can say right away that I would pocket the $100. To an economist, this would mean that I have an (implied) internal rate of interest of 1000% per annum, since $100 right now is equal in my mind to $1000 in a year. From there, the economist could easily ask me a few other questions to show that my internal rate of return isn’t really 1000%, in fact it’s all over the pace. My preferences are fully inconsistent and therefore irrational.

But is my behavior really all that irrational? In taking the $100 now, what I’ve really done is an implicit probability calculation. What is the chance that I will actually get paid that $1000 in a year? $100 in my hand right now is simple. A payment I have to wait a year for is complicated. How will I receive it? Who will pay out? How many mental resources will I spend over the course of the year thinking (or worrying) about this $1000 payment from an unknown person? Complexity always adds uncertainty; the two cannot be disentangled. The economist has failed to understand people’s (rational) uncertainties, and has ignored the psychological cost of living with that uncertainty, especially over long periods of time.

Here’s another experiment. Imagine you asked your neighbor to look after your dog for you while you were gone for the weekend. How much compensation might she expect? If she’s particularly sociable, she might be willing to look after your dog for free, or be happy with a $50 payment. But now imagine you offered her $2000 right away, would she accept that? If not, than she has what economists call an downward sloping supply curve: giving her more money leads to less of the same service. Downward sloping supply curves, especially on the personal (micro) level, get economists all hot and bothered. They seem incoherent and rife with opportunities for exploitation.

Again though, is this really a case of irrational behavior? All things being equal (a deadly assumption that’s often made by economists), there’s no doubt that your neighbor would prefer $2000 to $50 for providing the same service. But of course in this example all things aren’t equal. The amount you offer her is a signal. It tells her something about your assessment of the underlying value of the service you are requesting. $50 tells her that you appreciate the minor inconvenience of caring for your poodle. $2000 tells her that something screwy is going on. Is your dog a terror? Will it chew up her furniture? Pee everywhere? Is there some kind of legal issue that she has no idea about? The key here is the importance of conditional probability. Specifically, what is the difference in the probability that taking care of this dog is equivalent to stepping into a mine field, given that $50 is being offered, versus given that $2000 is being offered? Human beings in general have incredibly sophisticated minds, capable of spotting hidden uncertainties and performing fuzzy, but essentially correct Bayesian updates of our prior beliefs given new information. Unfortunately, such skills seem to be lacking from many modern economists.


21
May 11

Problematic quote of the day

“Ellsberg offered several groups of people a chance to bet on drawing either a red ball or a black ball from two different urns, each holding 100 balls. Urn 1 held 50 balls of each color; the breakdown in Urn 2 was unknown. Probability theory would suggest that Urn 2 was split 50-50, for there was no basis for any other distribution.”

From page 280 of Against the Gods: The Remarkable Story of Risk.


08
Feb 11

Power, p-values, publication bias and statistical evidence

I created this video to show as part of a presentation I’ll be giving next week. Your comments welcome, either here or at Youtube.


12
Jan 11

R: Attack of the hair-trigger bees?

In their book “Complex Adaptive Systems”, authors Miller and Page create a theoretic model for bee attacks, based on the real, flying, honey-making, photogenic stingers. Suppose the hive is threatened by some external creature. Some initial group of guard bees sense the danger and fly off to attack. As they go, they lay down a scent. Other bees join in the attack if their scent sensitivity threshold (SST) is reached. When they join the attack, they send out their own warning scent, perhaps triggering an even bigger attack. The authors make the point that if the colony of bees were homogeneous, and every single one had the same attack threshold, then if that threshold was above the initial attack number, then no one else would join in. If it were below, then everyone goes all at once. Fortunately for the bees, they are a motley lot, which is to say a lot more diverse than you would imagine just from looking at the things. As a result, they exhibit much more complicated behavior. The authors describe a model with 100 bees and call their attack threshold “R”. By assuming a heterogeneous population of 100 with thresholds all equal spaced, they note:

“In the hetrogeneous case, a full-scall attack ensues for [latex]R \geq 1[/latex]. This latter result is easy to see, because once at least one bee attacks, then the bee with threshold equal to one will join the fray, and this will trigger the bee with the next highest threshold to join in, and so on…. It is relatively difficult to get the homogeneous hive to react, while the hetrogeneous one is on a hair trigger. Without explicity incorporating the diversity of thresholds, it is difficult to make any kind of accurate prediction of how a given hive will behave.”

I think this last sentence is their way of noting that the exact distribution of sensitivities makes a huge difference in how the bees behave, which indeed it does. I decided to put the bees to the test, so I coded a simulation in the language R (code at the end of this post). I gave 100 virtual apis mellifera a random sensitivity level, chosen from a Uniform(1,100) distribution, then assumed 10 guards decided to attack. How would the others respond? Would a hair-trigger chain reaction occur? The chart at the top shows the results from 1000 trials. Looks pretty chaotic, so here’s a histogram of the results:

As you can see, most of the time the chain reaction dies out quickly, with no more than 20 new bees joining in the attack. However, occasionally the bees do go nuts, sending the full on attack. This happened about 1 percent of the time. Perhaps most interestingly, all of the other attack levels were clearly possible as well, in the flat zone from about 30 to 95. You might want to try playing with the distribution of the sensitivities and see if you get any other interesting distributions. Meanwhile, if you decide to raid a hive, make sure to dip yourself in mud first, that way the bees will think you are nothing but an innocent black rain cloud.

Code in R:

trials = 1000

go = rep(0,trials)
initial = 10

for(i in 1:trials) {
  bees = sort(runif(100,1,100))
  
  # Everyone who's threshold is less than the inital amount goes.
  going = length(bees[bees initial) {
    more = length(bees[bees going) {
    # Keep doing this until it stops
      going = more
      more = length(bees[bees