Some Motivations for Bayesian Statistics

If you’ve been following my Twitter stream, you have probably seen that I’m doing some reading and study on Bayesian statistics lately. For a variety of reasons, I find the Bayesian model of statistics quite compelling and am hoping to be able to use it in some of my research.

Traditional statistics, encapsulating well-known methods such as t-tests, ANOVA, etc. are from the frequentist school of statistical thought. The basic idea of frequentist statistics is that the world is described by parameters that are fixed and unknown. These parameters can be all manner of things — the rotation rate of the earth, the average life span of a naked mole rat, or the average number of kittens in a litter of cats. It is rare that we can have access to the entire population of interest (e.g. all mature female cats) to be able to directly measure the parameter, so we estimate parameters by taking random samples from the population, computing some statistic over the sample, and using that as our estimate of the population parameter. Since these parameters are unknown, we do not know their exact values. Since they are fixed, however, we cannot discuss them in probabilistic terms. Probabilistic reasoning only applies to random variables, and parameters are not random — we just don’t know what their values are. Probabilities, expected values, etc. are only meaningful in the context of the outcome of multiple repeated random experiments drawn from the population.

The Bayesian says, “Who cares?”. Bayesian statistics applies probabilistic methods and reasoning directly to the parameters. This doesn’t necessarily mean that the Bayesian thinks the world is really random, though. It turns out that we can use probabilities not only to express the chance that something will occur, but we can also use them to express the extent to which we believe something and the math all still works. So we can use the algebra of probabilities to quantify and describe how much we believe various propositions, such as “the average number of kittens per litter is 47”.

One of the fundamental differences, therefore, is that the frequentist can only apply probabilities to the act of repeating an experiment. The Bayesian can apply probabilities directly to their knowledge of the world. There are other important differences as well — frequentistic statistics is primarily concerned with testing and falsifying hypotheses, while Bayesian focuses more on determining which of several competing models or hypotheses is most likely to be true — but those differences play less of a role in what I find compelling about Bayesian statistics, and it is also possible to apply falsificationist principles in a Bayesian framework.

Sales Tax and E-Commerce — Not a Simple Problem

In the age of e-commerce, sales taxes are a difficult problem. Currently, online retailers such as Amazon.com are under the same rule as mail-order sellers traditionally have been: they only have to collect sales tax from customers located in their state (or states in which they have a physical presence). Recently, this has gained greater attention due to several states passing measures to count affiliate program members (kickbacks for links on blogs, etc.) as a physical presence, so Amazon.com would be required to collect sales tax from customers in any state in which one of its affiliates resides. These affiliates are often private individuals and do no direct sales for Amazon — only referrals — but, in an effort to regain their tax base, states are wanting to see them as a physical presence.

I do not question that e-commerce is presenting significant problems for local economies. Money spent online does not stay in the local economy, and moving sales out-of-state does decrease the tax base for state income taxes. While most states require residents to pay sales tax themselves on out-of-state purchases, it is likely that few actually do so. Particularly in the present time of tight state budgets, this certainly isn’t helping matters.

It is tempting to just say, as many states are attempting to, that Amazon.com should collect sales tax sales to their residents. Citizens for Tax Justice recently published an article accusing Amazon.com of fostering tax evasion and calling the Supreme Court ruling instituting the current system “misguided”.

I think this is an overly simplistic analysis of the situation. There is a complex tangle of issues surrounding sales tax in the U.S., and placing local requirements on Internet-based sellers creates further problems that I think are likely to be worse than the current woes.

Getting Things Typed: External Trusted Systems for Programming

One of the major tenants of David Allen’s Getting Things Done methodology is the concept of an external trusted system — a system for storing information outside your brain so that it can be retrieved as needed and/or brought to your attention when appropriate. Our brains are often fickle, and we are apt to forget things. Further, by trying to remember them, we spend mental energy trying not to forget them so that, even if we do remember, our productivity is decreased by the stress of trying not to forget. Getting notes, appointments, tasks, and pretty much anything else we need to remember out of our heads and into a reliable external storage and retrieval system enables us to free up our minds to focus on what we really want to accomplish.

I’ve been realizing lately that robust static type and module systems fill a similar role when programming. I have better things to do with my brain cycles than remember the details of functions, what they require, and where they are used.

A module and interface system like OCaml’s makes it easy to refer to the function header — its summary — when I need t recall its usage. Documentation extractors do provide some of this benefit, and languages like Java provide similar benefits with their amenability to static analysis and good support enabling auto-completion and other IDE lookup features. In a static language, however, the type system explicitly delineates the permissible inputs and possible outputs for a function without requiring the programmer to list them manually. Therefore, the documentation just needs to describe behavior and any special requirements beyond those expressible in the type system (and the more expressive the type system, the fewer these requirements are likely to be). Therefore, the information necessary to call a function is retrievable when needed.

The type system also enables the remind-when-appropriate aspect of an external trusted system. If I get something wrong when calling a function, there’s a decent chance the compiler will remind me when I compile the code. If I change a function, I don’t have to worry about remembering where all it was used; the type system will catch a large set of errors next compile cycle.

Fixing the Dash Lights on a Dodge Caravan

We had a problem this last week with our ’03 Dodge Grand Caravan — the dash lights went out. Completely. Instrument panel, radio, heater controls — all unlit. My first thought, naturally, was a fuse.

However, when I looked at the fuse box, I couldn’t find any fuse that looked like it controlled the instrument panel backlighting. Web searching turned up a few things, including a fixya entry and a DodgeTalk.com forum post which document the same problem and an odd fix: disconnect the battery or otherwise cut power to the computer.

So, I went out and pulled the IOD fuse (Ignition Off Draw, controls the power drawn when the vehicle is off) for a couple hours. Disconnecting the negative cable on the battery would accomplish the same thing for this purpose. After putting the fuse back in, the dash lights worked.

Piecing things together, particularly with the insights from the DodgeTalk post, it seems that the issue is a computer problem — sometimes, for some reason, the computer will stop turning on the dash lights. Disconnecting power to it for a while resets the computer, allowing the dash lights to start working again. Weirdest van repair ever, but it works, and here it is documented so others can hopefully find that the solution does, indeed, work.

Tuning the OCaml memory allocator for large data processing jobs

TL;DR: setting `OCAMLRUNPARAM=s=4M,i=32M,o=150` can make your OCaml programs run faster. Read on for details and how to see if the garbage collector is thrashing and thereby slowing down your program.

In my research work with GroupLens, I do a most of my coding for data processing, algorithm implementation, etc. in OCaml. Sometimes I have to suffer a bit for this when some nice library doesn’t have OCaml bindings, but in general it works out fairly well. And every time I go to do some refactoring, I am reminded why I’m not coding in Python.

One thing I have found, however, is that the default OCaml garbage collector parameters are not very well-suited for much of my work — frequently long-running data processing tasks building and manipulating large, often persistent1 data structures. The program will run somewhat slow (although there usually isn’t anything to compare it against), but more importantly, profiling with `gprof` will reveal that my program is spending a substantial amount of its time (~30% or more) in the OCaml garbage collector (if memory serves, frequently in the function `caml_gc_major_slice`).