Thursday 22 May 2014

The insular world of marketing




It's election day! And it's an election day that I'm personally fascinated by, in terms of whether the pre-election polls are anywhere near accurate.

Take a look at the image above. The Sun and YouGov are predicting a narrow UKIP win.

Do you know anybody who's said they're voting UKIP? I don't. Maybe you've got a batty aunt, or a slightly racist grandparent who makes you cringe now and again in public, but do over a quarter of people you know intend to vote UKIP?

Probably not.

This effect caused me to lose a tenner, betting on the London Mayoral election that saw Boris Johnson beat Ken Livingstone. The bookies has Boris as nailed on favourite, but I only knew one person who planned to vote for him. Nobody I knew could name many people who planned to vote for Boris either.

Of course you often surround yourself with like-minded friends, but work colleagues and acquaintances were vehemently anti Boris and surely your work colleagues are a decent random(ish) sample of different opinions?

It turns out not and I lost my tenner.

If you're here, reading this, then you're likely a thoughtful, analytically minded person with either a marketing or football analysis interest. Probably, you're not planning to vote UKIP and you don't know many - or even any - people who are.

Does this matter? In marketing, I think it does. We're trying to sell products to the population of the UK in general and to do that, we need to understand what motivates people in general, not just people like ourselves.

Walk into any big marketing agency in London and the people you'll meet will predominantly be:

  • Under 35. Many will be under 25.
  • University educated.
  • White.
  • Renting their home.
  • Unmarried
  • No kids
  • Travelling daily on public transport. Mainly on the tube, which obviously only exists in London.
That's a very narrow selection. Even the simple fact that all of these people live in London makes their day-to-day life quite unlike that of 85% of the UK population.

I work for MediaCom North - based in Leeds - and so some of the regional biases are removed in our office, but I bet I still couldn't find a UKIP voter here. I'd be staggered if over a quarter of the voters in the office supported UKIP.

As marketing people, we need to be acutely aware of our own inherent biases so that we can avoid them. Look at the adverts running on TV on any night of the week and ask yourself how many are designed to appeal to an under thirty year old audience. Then ask yourself, honestly, if most of the people buying that product are likely to be under thirty. Cars? Nope. Supermarket shoppers? Nope. Holidays? Nope.

For me, agencies need to be doing much more immersion into the lives of people who don't think like themselves (and I mean real immersion, I love stats as much as the next guy but they're a starting point, not the whole solution). A once a year factory visit or focus group just doesn't cut it.

We should also be hiring and retaining a more diverse mix of people, particularly people over thirty five. If the problem is that those people leave London when they hit their mid-thirties then maybe we need some more innovative solutions to tap into their opinions and experience.

Finally, as a client, I'd be looking seriously at non-London agencies to get some wider perspective. A global car manufacturer would naturally look to the scale of the big London agencies - and maybe they should - but they need to be aware that the people working on their account almost certainly don't own a car, have the money to buy one, or anywhere to park one if they did. That's why virtually all car ads are either full of young people, or a very crude caricature of older people.

Could your agency advertise UKIP and really understand what motivates all of those people who plan to vote for them? Or would you end up with a stereotyped portrait, produced by a youthful, liberal-leaning, well educated planner?

Of course, the question of whether you should take that brief is a whole other issue.

Monday 19 May 2014

Bigger data isn't necessarily better

Sometimes it's hard being a statistician. Sometimes a long established statistical concept jars with your audience and no matter how hard you try to explain it in plain terms, you can see in the audience's eyes that they don't really believe you. Those suspicious eyes staring back at you are fairly sure you're pulling some shenanigans to get out of working harder, or to wring an answer from the data that isn't really there. What you're saying just feels wrong.

Explaining sampling can be like that, particularly when you're dealing with online data that comes in huge volumes and fighting against a tidal wave of 'Big Data' PR.

The audience's thinking goes...

More data is just better, because more of a good thing is always better.

More data must be more accurate, more robust.

More impressive.

Then a statistician says, "We only need 10% of your file to get you all the answers that you need".

And rather than sounding like an efficient, cost effective analysis, it feels disappointing.


"You only need a spoonful of soup to know what the whole bowl tastes like"


A common question from non-statisticians is to ask, "Overall, I have five million advert views [or search advert clicks, or people living in the North East of England, or whatever], so how big does my sample size need to be?"

Which sounds like a sensible question, but it's wrong.

Statisticians call that overall views number the "Universe" or "Population". It's the group from which you're going to draw your sample.

Once your population is bigger than about twenty thousand, it makes no difference at all to the size of the sample that you need. If you say that you've got one hundred million online advert views, and ask how big your sample needs to be, then the answer is exactly the same as if you had fifty million views. Or two hundred million.

Which probably sounds like statistical shenanigans again.

Think about it like this. I've got lots of ping-pong balls in a really big box and I tell you that some are red and some are white and they've all been thoroughly mixed.You can draw balls from the box one at a time until you're happy to tell me what proportion of each colour you think is in the box. How many ping pong balls do you want to draw?

Seriously, pause and have a think, how many do you want to draw? It's a really big box and you'll be counting ping pong balls for a week if you check them all.

Let's start with ten. You draw ten balls and get four red and six white.

Is the overall proportion in the box 60/40 in favour of white? It might be, but you're not really sure. Ten isn't very many to check.

You pull another ten and this time you get five more of each colour. Now you've got eleven white and nine red. Happy to tell me what's in the box yet? No?

Let's keep drawing all the way up to 100 ping pong balls.

Now you've got 47 whites and 53 reds. The proportion seems like it's close to 50/50, but is it exactly 50/50 in the rest of the box?

Every time you draw more ping-pong balls, you get a bit more sure of your result. But have you noticed that we haven't mentioned once how many balls are in the box in total; only that it was a big box? It's because it doesn't matter.

As long as the population is "big" and we draw balls at random, it doesn't matter how big it is.

Here's how your confidence in the result changes as you draw more ping-pong balls from the box:


The bigger your sample, the better your accuracy, but beyond a certain size - say 5,000 - your result is highly accurate and having an even bigger sample doesn't make very much difference.

"But!", say the objectors, "Online, data is basically free and we can use the whole dataset, so we should!"

And that's true, up to a point. Data storage is so cheap it's close to free, but data processing isn't. A large part of the cost is in your own time - you can wait ten minutes for a results dashboard to refresh, or you can sample the data, wait thirty seconds and get the same answer. It's your choice, but personally I like faster.

Outside the digital world, storage is still cheap, but data collection can get really expensive.

The TV industry in the UK is constantly beaten with a stick based on the fact that TV audience figures are estimated using a sample of 'only' 5,100 homes. It costs a lot to put tracking boxes into homes and this number has been arrived at very carefully, by very well trained statisticians. It's just enough to measure TV audiences with high accuracy, without wasting money.

In fairness, The BARB TV audience panel is challenged by a proliferation of tiny satellite TV channels - because sometimes nobody at all out of those 5,100 homes is watching them - and by Sky AdSmart, which delivers different adverts to individual homes. It may need to adapt using new technology and grow to cope, but nobody is seriously suggesting tracking what everybody in the UK watches on TV, at all times, on all devices. That would be ridiculous.

I'll be blunt. Any online data specialist who uses the 5,100 home sample to beat 'old fashioned' TV viewing figures, doesn't know what they're talking about.

Sampling is an incredibly useful tool and sometimes more isn't better, it's just more. More time to wait, more computer processing power, more cost and more difficulty getting to the same answer.

.