Friday, February 6, 2009

More on the Census Bureau

The problem I am working on for the Census Bureau is known as "unbiased controlled rounding." The problem is the following:

GIVEN:

- A two-dimensional array of values, each in the range [0,1).
- The marginal counts - i.e. the sum of each row, and the sum of each column (and the grand total.) We know that the row sums and column sums are all integers, and they of course are equal to what the values in each column add up to.

FIND:

- A randomized method of producing an array of zeros and ones the same size as the original array such that:

-- The expected value in each cell is the same as the corresponding value in the original array.

-- The row sums and column sums are the same as those in the original array (all the time, not just on the average.)

---

The main application of this problem is to reduce the risk of disclosure when releasing "microdata" - data with very small sample sizes (for example, all the businesses within a particular area that's ten blocks wide) Basically you want to round off the answers that the respondents gave so that an adversary cannot identify an individual respondent by their answers, but still preserve all the aggregate statistical properties (like mean, variance etc.) and avoid introducing bias.

As it turns out, the two-dimensional case has a known solution. The three-dimensional case (where in addition to row and column sums, you have sums going in the third direction, and you also have sums through planar "slices" of the cubical "array") does not, and in fact it is not always possible to solve.However, the goal is to get as close as possible to a solution (say, make sure that the marginal counts are off by at most 2) and do this in a reasonably computationally efficient way.

---------------

BONUS QUESTION: Solve the problem given above for the one-dimensional case (i.e., when the array just has one row in it, and the only marginal sum there is is the grand total). The solution to this problem gives an idea of some of the techniques used in higher dimensional solutions.

4 comments:

Dan Mont said...

You pose all these problems, but you never pose solutions, so those of us who are too lazy to think about this for more than five minutes, never get any satisfaction.

Is this some sort of planned punishment on your part, or are you honestly thinking your readers can come up with the solution in a reasonable amount of time?

Alexander Mont said...

I thought I did post solutions to most of the other problems.

But anyway, here is the solution to this one:

You have, essentially, a vector of values in the range [0,1]. Choose any two of these values that are not equal to 0 or 1. Call the first one A and the second one B.

Let X = min (A, 1-B)
Let Y = min (B, 1-A)

With probability X/(X+Y), increase the value of A by Y and decrease the value of B by Y. If this happens, then if B <= 1-A, B becomes equal to zero, while if B >= 1-A, A becomes equal to one.

With probability Y/(X+Y), increase the value of B by X and decrease the value of A by X. If this happens, then if A <= 1-B, A becomes equal to zero, while if A >= 1-B, B becomes equal to one.

Thus at least one of the numbers will become equal to either zero or one. Note that this transformation does not change the expected value (because there is a Y/(X+Y) probability of changing it by X, and an X/(X+Y) probability of changing it by Y the other direction) and it does not change the sum, because you increase one number by the same amount you decrease the other.
Since each of thse transformations gets rid of one fractional value in the vector, if you keep doing it you will eventually get rid of all of them.

Nanette Goodman said...

Sounds like interesting work at the census bureau. Are you going to be working there over the summer also?

Alexander Mont said...

I don't know yet. My current appointment has a "not to exceed" date of May 20, so I will have to leave by then. But I'll have to see if they're willing to re-hire me over the summer.