The problem I am working on for the Census Bureau is known as "unbiased controlled rounding." The problem is the following:
- A two-dimensional array of values, each in the range [0,1).
- The marginal counts - i.e. the sum of each row, and the sum of each column (and the grand total.) We know that the row sums and column sums are all integers, and they of course are equal to what the values in each column add up to.
- A randomized method of producing an array of zeros and ones the same size as the original array such that:
-- The expected value in each cell is the same as the corresponding value in the original array.
-- The row sums and column sums are the same as those in the original array (all the time, not just on the average.)
The main application of this problem is to reduce the risk of disclosure when releasing "microdata" - data with very small sample sizes (for example, all the businesses within a particular area that's ten blocks wide) Basically you want to round off the answers that the respondents gave so that an adversary cannot identify an individual respondent by their answers, but still preserve all the aggregate statistical properties (like mean, variance etc.) and avoid introducing bias.
As it turns out, the two-dimensional case has a known solution. The three-dimensional case (where in addition to row and column sums, you have sums going in the third direction, and you also have sums through planar "slices" of the cubical "array") does not, and in fact it is not always possible to solve.However, the goal is to get as close as possible to a solution (say, make sure that the marginal counts are off by at most 2) and do this in a reasonably computationally efficient way.
BONUS QUESTION: Solve the problem given above for the one-dimensional case (i.e., when the array just has one row in it, and the only marginal sum there is is the grand total). The solution to this problem gives an idea of some of the techniques used in higher dimensional solutions.