Fake text generation the wrong way, and a contest

As part of a bigger project, I needed to simulate a text string based on a source document, but at the character level. Just in case people find the code useful, I’ve uploaded it to MCMCtext.r.

In my simulated text, each character is chosen based on the transition probabilities in the source text from one character to another. The result is (nearly complete) gibberish without much interest to anyone, except perhaps those looking for a replacement for the standard Lorum Ibsum dummy text. More interesting fake text could be generated by using two character (or more) transition probabilities, or by working at the level of words.

Before moving on, I thought it might be interesting to see if anyone can “reverse engineer” my fake text output to figure out which original text was used as a source to generate it. Got that? The source text comes from Project Gutenberg. Hint: some features of the (fake) text could help you narrow the field of candidates.

First person to post a correct guess in the comments gets a copy of my comic and an unlimited supply of Hotpockets*. Limit one guess per person please.

* Hotpockets offer only valid if you are currently saving the planet from destruction.

Tags: ,


  1. Lol and what about Xena tapes? Do I get those too?

  2. Webster’s Unabridged Dictionary.

    • Nice guess! That would have been a good one to use, but it wasn’t the one I did use.

      I’m not sure if this is an easy contest to solve or if it will send people down the (beautiful?) rabbit hole of support vector machines and similar. Good luck to all either way.

  3. My guess is The Life and Adventures of Robinson Crusoe by Daniel Defoe. I fitted a first order markov model for the generated text, scored the topp100 at project gutenberg against the fitted transition matrix and selected the one with highest log probability.

  4. Is it _Tacitus and Bracciolini_, by John Wilson Ross?


  5. Alice in Wonderland.

  6. Nobody else seems to have noticed Matt’s “rabbit hole” clue!

  7. I think I may have picked something too obscure.