Monday, 19 May 2014

Bigger data isn't necessarily better

Sometimes it's hard being a statistician. Sometimes a long established statistical concept jars with your audience and no matter how hard you try to explain it in plain terms, you can see in the audience's eyes that they don't really believe you. Those suspicious eyes staring back at you are fairly sure you're pulling some shenanigans to get out of working harder, or to wring an answer from the data that isn't really there. What you're saying just feels wrong.

Explaining sampling can be like that, particularly when you're dealing with online data that comes in huge volumes and fighting against a tidal wave of 'Big Data' PR.

The audience's thinking goes...

More data is just better, because more of a good thing is always better.

More data must be more accurate, more robust.

More impressive.

Then a statistician says, "We only need 10% of your file to get you all the answers that you need".

And rather than sounding like an efficient, cost effective analysis, it feels disappointing.


"You only need a spoonful of soup to know what the whole bowl tastes like"


A common question from non-statisticians is to ask, "Overall, I have five million advert views [or search advert clicks, or people living in the North East of England, or whatever], so how big does my sample size need to be?"

Which sounds like a sensible question, but it's wrong.

Statisticians call that overall views number the "Universe" or "Population". It's the group from which you're going to draw your sample.

Once your population is bigger than about twenty thousand, it makes no difference at all to the size of the sample that you need. If you say that you've got one hundred million online advert views, and ask how big your sample needs to be, then the answer is exactly the same as if you had fifty million views. Or two hundred million.

Which probably sounds like statistical shenanigans again.

Think about it like this. I've got lots of ping-pong balls in a really big box and I tell you that some are red and some are white and they've all been thoroughly mixed.You can draw balls from the box one at a time until you're happy to tell me what proportion of each colour you think is in the box. How many ping pong balls do you want to draw?

Seriously, pause and have a think, how many do you want to draw? It's a really big box and you'll be counting ping pong balls for a week if you check them all.

Let's start with ten. You draw ten balls and get four red and six white.

Is the overall proportion in the box 60/40 in favour of white? It might be, but you're not really sure. Ten isn't very many to check.

You pull another ten and this time you get five more of each colour. Now you've got eleven white and nine red. Happy to tell me what's in the box yet? No?

Let's keep drawing all the way up to 100 ping pong balls.

Now you've got 47 whites and 53 reds. The proportion seems like it's close to 50/50, but is it exactly 50/50 in the rest of the box?

Every time you draw more ping-pong balls, you get a bit more sure of your result. But have you noticed that we haven't mentioned once how many balls are in the box in total; only that it was a big box? It's because it doesn't matter.

As long as the population is "big" and we draw balls at random, it doesn't matter how big it is.

Here's how your confidence in the result changes as you draw more ping-pong balls from the box:


The bigger your sample, the better your accuracy, but beyond a certain size - say 5,000 - your result is highly accurate and having an even bigger sample doesn't make very much difference.

"But!", say the objectors, "Online, data is basically free and we can use the whole dataset, so we should!"

And that's true, up to a point. Data storage is so cheap it's close to free, but data processing isn't. A large part of the cost is in your own time - you can wait ten minutes for a results dashboard to refresh, or you can sample the data, wait thirty seconds and get the same answer. It's your choice, but personally I like faster.

Outside the digital world, storage is still cheap, but data collection can get really expensive.

The TV industry in the UK is constantly beaten with a stick based on the fact that TV audience figures are estimated using a sample of 'only' 5,100 homes. It costs a lot to put tracking boxes into homes and this number has been arrived at very carefully, by very well trained statisticians. It's just enough to measure TV audiences with high accuracy, without wasting money.

In fairness, The BARB TV audience panel is challenged by a proliferation of tiny satellite TV channels - because sometimes nobody at all out of those 5,100 homes is watching them - and by Sky AdSmart, which delivers different adverts to individual homes. It may need to adapt using new technology and grow to cope, but nobody is seriously suggesting tracking what everybody in the UK watches on TV, at all times, on all devices. That would be ridiculous.

I'll be blunt. Any online data specialist who uses the 5,100 home sample to beat 'old fashioned' TV viewing figures, doesn't know what they're talking about.

Sampling is an incredibly useful tool and sometimes more isn't better, it's just more. More time to wait, more computer processing power, more cost and more difficulty getting to the same answer.

.

No comments: