Thursday, 28 February 2013

Just stop it. There's no such thing as "Data Science".

I haven't had a good rant for a while. Rants are what blogs are for. Here comes a rant.

The term "Data Scientist" is getting well out of hand. I'm seeing articles all over the place with titles like "What is a Data Scientist?" and "do you need one?".


Data Scientists are what happens when marketing people and journalists spot a trend that's been going on for ages, but decide to act like they've just discovered fire. I should probably decide to call myself one and then double what I used to charge for work, before I was a Data Scientist.

I'm not doing that though, I'm ranting on a blog.

"Data Scientist" is a tautology.

You know what they call people who use data to do science?

Scientists.


.

Monday, 25 February 2013

Curse Blogger post scheduler. Predictions for erm... last Saturday

This set of predictions was meant to go live on Saturday morning, but a glitch in Blogger's post scheduler meant it didn't happen, sorry about that. I had a very kind trail from @OptaPro on Twitter this week too, so it was an even bigger disappointment. On the plus side, I couldn't post manually because I'd gone paragliding and I love paragliding, it's even better than statistics and football matches.

Back to football and the model's new and improved so I thought it would be worth reposting these predictions and explaining what I've been up to.

If you'd like a bit of history, try my past posts. I'm using an agent-based model of football matches to try to predict results and as usual, predicted starting line-ups for the teams are from Fantasy Football Scout. At some point, I'll build an engine to scrape the actual announced line-ups half an hour before kick off and re-run the model automatically, but one step at a time...

The big improvement I've been working on, which has turned out to make a small overall improvement in prediction accuracy, is to allow players to have a good or bad game. Previously, each player always performed at their average level - so for example if their passing accuracy averages 80% they'll always pass at 80% accuracy - but now I use the standard deviation of each player's passing accuracy and sample from a normal distribution, to decide how a player will perform. What this means is (to pick a couple of random examples) a very consistent player like Paul Scholes will always pass well in the model. A player like Darren Bent will have passing accuracy that's all over the place, with some very good games and some very bad ones.

This "form" feature is random for the minute, although I have spotted some interesting relationships  in the data and I think to an extent it's predictable when players will have a bad game. Check out this tweet for an example. I've promised EPL Index (where all the data comes from) an article on this though so it will have to wait for a minute.

Onto the predictions! They were predictions, honest.


And how did we do this week?

I actually had a small bet on these and am up for the weekend already, with the Spurs game still to play, so it didn't go too badly. From here, I picked:

Fulham to win (won)
Newcastle to win (won)
Wigan to win (won)
Sunderland to win (lost)
Norwich and Everton to draw (lost)
Spurs to win (playing tonight)

Of the remaining games, I don't trust Arsenal in the model at the moment. They pass well and it's largely a passing-based sim, so it seems to overestimate their chances, although it called this result correctly (just about). Who trusts Arsenal to reliably get a result anyway? The model called Man City and Man United's results correctly, but the odds were rubbish so I left those two.

If Norwich hadn't been allowed to take that last corner, I'd have had an even better weekend! This model's not doing so badly, if I do say so myself. Definitely worth persevering with.

I promise faithfully, on my honour, to have predictions up before the next set of matches this weekend.

Saturday, 2 February 2013

Football Sim: Predictions for 2/3 Feb 13

This is probably going to be the last set of predictions before I put some proper time into improving the model. We know that on current performance, it's going to slowly lose money if you bet on it and that's not tremendously exciting. Improvements from here are much harder than building the simulator in the first place, but I've got a few promising ideas to follow up.

Populating the fixtures with expected starting line-ups is also a complete pain in the neck and takes far too long. I'm going to have to sort that out, because sometimes my Friday evenings are based around beer rather than football match modelling.

Having said that, putting this set of forecasts together has thrown up a few interesting effects and led to me tweaking the algorithm a little already.

Here's what we've got. Starting line-ups from Fantasy Football Scout.



A few of those percentages stick out as disagreeing with the bookies odd this morning. Taking those ones in order...

Everton vs. Aston Villa

Everton are predicted to win, sure, but the bookies give Villa almost no chance and my model thinks they could win it. Why does it think that?

The big reason (that we'll see again for the Man City game) is that the model doesn't really understand defending yet. It will penalise teams that have only average ball retention but which are above average at defending. Conversely for Villa, it doesn't know that their back line has shipped 46 goals so far this season. The model also currently sees a player like Fellaini as a striker with decent shooting accuracy and below average passing - it doesn't understand the physicality of his game.

It's far from perfect! I did say I was doing my development work in public. Anyway, on to...

Manchester City vs. Liverpool

I'm sure this is the defending factor again. Could happen though and maybe this prediction will make some Liverpool fans happy.

Reading vs. Sunderland

I like this one, it's interesting! I've got Reading at 10% (decimal odds 10.0). The bookies odds say they're going to win the game. What's that all about then?

Well first of all, the model's using player stats over the season so far, not just the past few games. Up until Christmas, Reading really weren't good, which drags their performance down.

The big question in this game though is what's going to happen with Adam Le Fondre? The sim doesn't do substitutes yet and he's not in Fantasy Football Scout's predicted starting line-up. We can't do super subs.

Without Le Fondre starting in the sim, Reading will struggle badly to score.

We've played the game 1000 times without him. Let's stick Le Fondre in for Guthrie, play it another 1000 times and see what happens. We'll be giving Le Fondre his super-sub stats over the whole game.


That's quite a difference! Sunderland still win it, mind.

Now let's hope the favourites don't let us down this time and we can do a little better than last Tuesday evening.

Tuesday, 29 January 2013

Football Sim: Predictions for 29-Jan-13

OK, here they are! The first time I've let the football simulator loose in public, on games that haven't actually been played yet. Sounds like a recipe for disaster to me, but let's do it anyway.

If you've got no idea what I'm talking about, read this first.



A few things to be aware of... (in other words, here are my get out clauses, terms and conditions, caveats, call them what you like...)

I got the teams' predicted starting line-ups here.

Your guess is as good as mine (or Fantasy Football Scout's) what Newcastle's starting line-up is going to be.

I've made the new NUFC Frenchmen into completely average players for the purposes of the sim. They might be better than that and they might be worse!

As per my big "season so far" post yesterday, the model calls winners correctly around 50% of the time on average. Follow my tips at your own risk...

It gets exact scores right, around 10% of the time.

Both of those mean that for betting, it seems to just about break even (but not quite). I'm working on it.

That Sunderland prediction looks a little too heavily skewed to a home win for my liking.

But I'm off to put a bet on it anyway.

Monday, 28 January 2013

Football prediction: Simulating the season so far

I've been working for a while on an agent-based simulation of football matches, to see how close it can get to predicting the real results of Premier League games.

Last week, I explained roughly what the model is and how it works.

This week, is the model any good? I'll be honest, I may have cherry picked that Chelsea result a little for the last post, but this will be a warts-and-all picture of how well the model predicts. Or how it would have predicted the season so far (with one important caveat that we'll come to in a while.)

One last thing before we get stuck in... The most obvious use of a model like this (if it works) is to gamble based on its predictions, but I'm building more for the technical challenge of seeing if I can do it. I'm really interested in using a model like this as a scenario planning tool - what would happen to your season if you signed player x? Or if player y got injured? If the model can be made to work, you could run 'what ifs' and work out the value of players in terms of expected points added to the team's total across a season.

Back to betting, I might have a punt, but I'm not really a gambler. Bet on it if you like (sensibly!) as I start to predict games on Wallpapering Fog and don't forget to add a comment to let us know how you got on. I'll talk about odds a fair bit below, because they're an obvious source of another prediction, to compare with the model. Having said that, if you'd bet on every game so far this season using the model's predictions - up to 20th January 2013 - you'd have just about broken even. Improvements to the model from here, would make it profitable. Got your attention? Here we go.

What I've been doing this weekend, is building some code to run a whole series of games in succession, not just one at a time. Then I fed in the fixtures, starting line-ups and player statistics for each game this season, up to the 20th January, using data from EPL Index. We simulate each individual game 100 times and get an overall predicted likelihood of home win, away win or draw, plus the most likely scoreline.

Remember that caveat I mentioned? Here it is. I'm simulating each player, using their average performance across the whole season so far, which isn't strictly fair. When Fulham played Norwich on the first day of the season, I wouldn't actually have had any 2012/2013 data to feed into the model at the time - only the previous season's numbers. It's something else on the long list of development tasks that need dealing with...

Here are the predictions anyway. Correct calls in green.
Google's determined to open the image below in its G+ gallery, which isn't readable. Here for bigger.



Overall, the model calls 50% of results correctly, on the criteria that the team it gave the most chance of winning, ran out as winners in real life.

I was initially a bit disappointed with that. Only 50%? I was hoping for more.

Then I had a look to see how often the bookies get it right. No doubt this will be incredibly obvious to some readers, but as I said I'm not a gambler. How often did the bookies' favourite win those games? 51%. (odds from football-data.co.uk)

Suddenly 50% doesn't seem all that bad!

A big part of the error comes from draws, both in my model and in the bookies' odds. A draw is almost never the most likely result of a single game, but overall, around 30% of games will end in a draw. My model only called one game as having a draw as the overall most likely outcome - Aston Villa vs. Stoke.

When you simulate game-by-game, you'll predict almost zero draws, which means you'll be wrong 30% of the time before you even start. Predicting a season, where you only simulate each game once would give a 'normal' number of draws, but each individual game's prediction would be much less accurate. It's swings and roundabouts, depending on whether we're trying to predict final league placings, or the result of a single game.

If you'd bet on the model's prediction for every game, using the same stake, at Bet365's odds, you'd have lost 3% of your money so far. If you'd taken the best odds available in the market each time, not just Bet365's, you'd actually be up 1%. That's not a disaster for a first effort! At least we're not ruined.

Let's see what that accuracy looks like. I tweeted this one over the weekend; it shows ten game rolling average prediction accuracy for the model and also for the bookies.


When the shaded area between the lines is red, the bookies are predicting more results correctly than the model, over the previous ten games. When it's green, the model is out-performing the bookies.

It's interesting that a couple of weeks into the season, accuracy for the model and for the bookies plummets to just 10-20%. It may be that the early season is harder to predict - we'll need to run a few more seasons to find out. That period certainly screws up any hopes of winning a fortune as the bookmakers do slightly better, even though both are doing badly.

Here's the same data, cumulatively. The chart shows total accuracy across the season so far, with 20th January on the far right hand side.


You can see that the bookies' favourites have won more often than the model's across the whole season so far, ending with just the 1% gap that I described earlier, 51% to 50%.

What's very interesting to me, is that the model looks like it's improving slowly across the season and closing the gap; the games may be becoming more predictable as the season goes on. I'm not jumping on that conclusion just yet, but I'm certainly going to keep an eye on it.

If you'd bet using the model, only since the New Year at best market odds, you'd currently be up 19% on your original stake.

Let's finish with the next big improvement for the model - at least I hope it will see a significant improvement in predictive power. I'm seeing these developments as positives, even though there's a fair bit of work involved in building them, because our simple view of the world is doing pretty reasonably and there's huge scope for improvement from here.

At the moment, once a team has the ball in the sim, the opposition can't win it back, the team in possession can only lose it. This is fine in a game against an 'average' team, but taking Arsenal as an example, their passing accuracy was 90% against Sunderland and 81% against Manchester City. I've got no doubt that Man City caused that to happen by harrying their opponents, so the model needs to account for it.

Even more important, it's not every player's passing accuracy that drops against higher quality opponents - some will cope much better than others. We need a way to predict what each individual player's passing accuracy will be, against this week's opponents. I'm working on it.

Stay tuned for predictions for Tuesday's games!

Monday, 21 January 2013

Simulating football matches: An experiment.

Prediction. The holy grail of analysis. Diagnostics are good and they help us to understand the past, but if we can't use that work to get better at predicting the future, then isn't it all a bit pointless? If your theory doesn't predict the future, then it isn't science. Unlike football punditry, you don't get meteorologists diagnosing that last weekend's weather was rubbish because "the Sun must have had an off day". Scientists test their theories through prediction.

In my day job, we predict advertising. You've run adverts for your business before, so we measure the effect of those and then we can tell you how much you'll sell in the future.

Advertising's mostly quite dull. Can we do it for football?

Since Moneyball, a lot of people have been paying close attention to football statistics, writing up analyses and discovering relationships in OPTA's football data. Man City are even trying to tap into those amateur insights through the MCFC Analytics Project.

I couldn't resist diving in, so armed with a subscription to EPL Index (four quid a month for player stats for each individual game? Yes, please) I've been running a little side project and playing with the data.

If you want to predict football, there are a few ways you could go about it and I did have a crack at this project years ago, but with only very top-line data on past game results. You can build a regression model (a bit like the advertising models I build for a living) that uses each team's form to predict the outcome of the next game.

It works something like this...

Predicted result = f(Home Team form, Away Team form, plus lots of other things like goal-scoring and conceding rates...)

All weighted for the quality of opposition over the past few games.

This type of model kind of works. Basically, it will predict things like Man U should beat Wigan, but we already know that. The model I built a few years ago didn't do any better than my own guesses and that's not tremendously useful.

Top-line models like this also have a massive issue in that there are simply too many variables that need to be accounted for. What are the chances that Man U beat Wigan if Van Persie's injured? Our model based purely on past form will struggle with that, especially if he's played all season and we've got no experience with him not present (and just as important, a different player playing) in the team.

You also get very little in the way of explanation with top-line models like this. Man U will win because they usually win and Wigan will lose because (sorry Wigan fans) they usually lose. What can Wigan do about that? Well obviously they need to improve their form. Thanks, Mr. Consultant, you're fired.

Long story short, we need a different technique and the one I've been using is called Agent Based Modelling (ABM).

ABM simulates the world from the bottom up rather than the top down, which in football means simulating the players rather than the result. We set up an artificial game - using real world OPTA statistics about the performance of individual players - and we run the game to find a predicted result. The result is an outcome of the simulation that we can't control directly.

If you're thinking, "Is he trying to build Football Manager using OPTA data?", that's basically the size of it, yes. Told you it was more fun than advertising.

Inside the model, you kick off the game and from then on, it's all down to the simulation. The player with the ball will make a decision, based on what they do most often in real games - pass, shoot, dribble... each decision is randomly generated, but weighted towards the probability of what that player does most often in real life, based on the OPTA data.

If they choose to pass, the simulation checks for a successful pass and then works out who the ball went to, again a randomly generated choice but weighted by real data. It's the same if they shoot, when we work out the chances that their shot went in. If they lose the ball, it transfers to a player on the opposition, again determined by a weighting of... you get the idea. Then the whole thing starts again with the player who has the ball now.

We play the game through, with players passing, shooting, losing the ball etc. and we get a result, which is our prediction for the match.

Now, you might say, "but there were loads of randomly generated decisions in the model. If we ran it twice we might get different results", and you'd be absolutely right. It's just like the real world and if the same teams play each other a few times, you can get a different result every time.

What we're after is the probability of winning for each team, so we run the match 1000 times (for now - it's a nice round number) and count up how many times each team wins.

After some teething problems (it wouldn't be interesting if it was too easy) the model's starting to turn up sensible results and I promise I'll share its predictions for the next set of Premier League games (29th and 30th Jan). I'm not standing by those predictions yet, but it will keep me honest and motivated to do the development in public and to an extent already has... after tweeting on Saturday that Norwich might have more chance against Liverpool than the bookies thought and then watching them get battered 5-0. I now know why the model did that and it doesn't do it any more!

The model is a huge oversimplification of a real game but over time, it should help to teach us about what's really important. As a quick example, the model currently doesn't treat crosses any differently from other passes - they're a complete or incomplete pass and that's it. If they're a complete pass, then the player who receives the ball might shoot. But if we keep seeing that teams with traditional wingers win more games than the model would predict, then that might need sorting out.

I'll end for today with a bit about the Chelsea vs. Arsenal game on Sunday, to illustrate what you get from the model and the sorts of things that we might be able to do with it. Here are the teams (no subs yet by the way):


Run that one 1000 times and what happens?

Chelsea:    44%
Arsenal:    26%
Draw:       30%

So Chelsea are predicted to win 44% of the time. We get a scoreline from the model too and here are the 15 most likely, adding up to 94% of all results. An interesting outcome is that although we predict Chelsea to win overall, the single most likely result is 1-1.


The actual result was 2-1 to Chelsea, so the model got the winner right and 2-1 was our third most likely score. That looks potentially ok! We'll only find out if it really works by testing across a lot of games though.

Another way to see if the predictions are sensible is to compare with the bookies' odds. There's probably something wrong if we're not in the same ball-park as professionals taking advantage of the wisdom of their crowds of punters. The bookies had these odds (decimal odds, with implied percentage in brackets):

Chelsea:    1.83 (55%)
Arsenal:    3.5 (29%)
Draw:       3.25 (31%)

On an Arsenal win, or a draw, we're almost bang on the market odds.  On Chelsea we're below, but bear in mind that the model's odds have to add up to 100%. The bookmakers' odds add up to 114%, which is why bookmakers make money.

Lastly, just as an example of what else you can do with this type of model, there were some raised eyebrows that Torres started for Chelsea instead of Demba Ba. What if they'd been switched and Ba had started the game? Let's run it another 1000 times...

Chelsea's chance of winning goes up by two percentage points to 46%.

Arsenal's chances go down, right? Wrong. This is why agent based models are good - they can show us things that we don't expect.

With Demba Ba starting for Chelsea, Arsenal's chances of winning actually go up by 1% to 27% (to complete the picture, the chances of a draw drop 3%). The balance of how often Ba receives the ball and how often he gives it away compared to Torres makes all the difference. All in all, the predictions barely move but it's an interesting outcome. We'll see more of these.

OK, that will do for now. Predictions next week and maybe, just maybe, a cheeky punt based on our forecasts.

Oh and if you're running your own project like this, I'd LOVE to hear about it! Stay tuned, this is just the start.

Friday, 11 January 2013

Are we branding? Or selling?

If you've worked in advertising for a while, you'll have come across the question of whether it's ok to compromise the 'creative vision' of a 'brand' ad, by sullying it with practical things like phone numbers and web addresses.

Being a data-type person and so not the sort to think that the majority of TV ads have a great deal of 'creative vision', I don't really understand this question, but it's a question that comes up a lot.

I wrote some time ago about adverts, which don't have the product that they're advertising in them and ads which don't tell you how to buy the product, fall into pretty much the same category. The challenge of making an ad is to make something that holds people's attention, so that you can talk to them about a product. Holding people's attention for 30" without the discipline of showing the product, is easy. It's a music video. You just show whatever you want and anyone can do that, even me.

So we need an ad with the product in it. A comment on the post I linked to above, summed it up perfectly: "Can you describe the advertisement without mentioning the product?" Keep that one in mind as you watch an ad break tonight - it's scary how often the answer is, "yes you can and by the way, what was the product?"


OK, Mr. Super-Brand so now you've communicated your product to me and I want to buy it. What do I do next? Oh, you've gone.

If it's not blazingly obvious what to do next then you need to tell me, because your ad wasn't that gripping and I'm very, very lazy. Give me a phone number, give me a web address, tell me where to buy your product. Do not make me work hard to buy you.

Apple can get away with 'pure' brand ads that have no direction afterwards (but that always, always, have the product in them), because everybody knows where to buy Apple's stuff. Most brands aren't Apple.

If a creative agency tells you that their advert won't work so well if it includes your phone number, or retail stores, or web address, or (God forbid) the product that you sell, then they're not doing their job properly. Their job is to include all of those things and still make the ad interesting enough that people will pay attention. That's hard. It's why creative agencies cost money.

Here's something that's caught my eye recently; Samsung have taken to sticking their logo in the corner of the screen for virtually the entire duration of a TV spot.



In a laptop ad, this makes perfect sense because let's face it, all laptops look the same. Without that logo, you're asking somebody to really pay close attention to the ad in order to realise whose laptop they're looking at.

Why don't many, many more advertisers do this?

TV channels do it. Their only purpose is to entertain and they still stick their logo in the corner of the screen!

But the advertisers who pay to show their products on those TV channels, don't do it. Why on earth not? The only reason can be due to a feeling that it would make the ad look 'cheap'. Putting your logo on the screen isn't subtle. Maybe we're afraid to be caught actually doing marketing, so we pretend not to.

Me? In my 30" spot, I'd want my product front and centre, my brand logo next to it and a dirty great web address on the screen at the end. When you say that would compromise the 'quality' of the ad, are you saying my brand looks cheap? You might want to try another argument.