Wallpapering Fog: prediction

Showing posts with label prediction. Show all posts

Saturday, 30 March 2013

Model changes and predictions for this weekend

I've been making some changes to my football model over the international break and its prediction performance has got a little better again. Although with at least one side-effect that we'll see in a second.

Apart from some robustness testing, the biggest change has been adjusting shot success depending on the opponent. The model should handle number of shots and which players take those shots, reasonably well. It simulates the passing of individual players and if you haven't got the ball, you can't shoot, so in the simulation, shot rates will naturally increase against worse teams and drop against better ones, without me having to impose that. When you look at simulated shot stats for a whole season, it does do this job fairly well.

Where the previous version of the model had a problem is that all shots aren't equal. Just for today, let's call this 'The Stoke Effect'.

I've had issues with Stoke in the model since the start. Their pass success rate isn't great and shot conversion rate isn't either, so in the simulation they were predicted to lose a lot more than they did in reality. There is something that Stoke do well though - they force the opposition into taking unsuccessful shots.

I can't impose shot rates on the model, because a big part of shot rates is already simulated through possession. What I can impose is an adjustment factor on the chances of a shot being a goal.

Here are the adjustment factors:

At the moment, I'm not analysing tactics to find out why those numbers look the way they do, I'm just playing with outcomes, but if you're not fussy about pretty football then Stoke are definitely doing something right. I'm also starting to think, "Why don't Stoke get relegated?" might make a fun article for EPL Index, who provide all the stats for this model. Partly as a result of building this model, I'm becoming much more interested in how middling teams get results against the top four (six? Seven so that Liverpool are included? I think those are the rules), than in Man City's most effective attacking combination.

In terms of prediction, Man City should win most of the time and for the purposes of this model, I'm not all that fussed about by how many goals they do it, as long as the result's correct. What's really interesting is whether we can forecast when they won't win. Adjusting shot success helps a lot in doing this, because it points towards those days when a team has plenty of the ball, but just can't score.

Still here? You could definitely be forgiven for skipping ahead to this bit. Here are this weekend's predictions:

You can see the side effect I was talking about right at the start of this post, in the Newcastle win chance. Newcastle would have been predicted to lose anyway, but they also have below average stats for how many shots their opponents convert, so they get penalised very heavily now.

I have no doubt this percentage is a bit low, but it's happening because I use average performance to predict the result of single games. What the model's saying, I think, is that if Newcastle don't change their tactics against Man City from what they normally do (maybe 'park the bus'), they're going to get hammered.

The Swansea prediction also sticks out this week. Bit peculiar, but it would be no fun if we agreed with the bookies every time. Let's see what happens.

Lots more work still to do, but it's good to be making progress... Two big jobs on the list next:

1. As this blog points out, I really need to run the model over more seasons and see how it gets on.
It's a bit tricky because I use a stat on '% of passes in opponent's half' to model attacking pressure and that stat's only been available on the EPL Index site since this season. Overall it doesn't add huge amounts to the predictions though, so I'm probably just going to turn that feature off and then run over the past 3-4 seasons.

2. The model knows nothing about form at the moment - players perform better or worse depending on their opponents, but their base stats are the same in every game. I'm eyeing this as potentially the next big improvement.

Small bets placed and I'm off to enjoy the British countryside. With live score updates on my phone, obviously!

Friday, 15 March 2013

Football Sim: Predictions for 16/17 March 2013

Here we go again...

No big preamble this time, if you're reading this then you probably know where these forecasts come from by now (I like calling them forecasts, it sounds more scientific than predictions.) If you don't, have a read of these posts.

Here are the percentage chances for this week's games:

A few obviously stick out - Everton and Swansea to win are interesting, but I think both plausible (though maybe not at those percentages) and West Brom to beat Stoke, I'm not sure I trust because the model doesn't really seem to understand how Stoke get as many points as they do. Reasons for that are probably fairly obvious and to do with Stoke's style of play, but I haven't tried to deal with it yet.

And my betting choices:

Swansea v Arsenal - Home Win
Villa v QPR - Draw
Chelsea v West Ham - Home Win
Everton v Man City - Home Win
Tottenham v Fulham - Home Win
Southampton v Liverpool - Away Win
Man U v Reading - Home Win
Wigan v Newcastle - Draw
Stoke v West Brom - Away Win
Sunderland v Norwich - Draw

On the whole, the model's been running at around 50% of results called correctly and slowly making money through betting simple singles at the same stake on each prediction. Every time I try to mess about with that basic approach, it does worse, so I'm sticking to it for now.

I did make some improvements to the model this week, but overall it's made only a very small difference to accuracy, which was a bit disappointing. I may go for incorporating form next, rather than using each player's average stats across the whole season all the time, but it's going to a be a busy week at work so that may have to wait...

Same as last week, I might also tweet an update when the starting line-ups are confirmed - @data_monkey if you're interested. Although last week, the original predictions did marginally better than the update, due to Holt's penalty miss. Maybe go with these as they are!

Monday, 11 March 2013

Football Sim: 9/10 March performance review

The four fixtures played on Saturday seem to have been acknowledged by everybody on my Twitter feed as 'a bit tricky'. I placed small bets on each game again, not really to gamble but because it keeps me honest. If I'm forced to actually pick a result, then there's no making excuses afterwards and claiming the model did ok when it was poor, or that I'd really have gone for a draw in the end, not a home win. When you're only looking at percentage chances of a result, it's easy to convince yourself that some of the bad forecasts were random chance.

From Saturday's forecasts, I'm fairly happy with the percentages that the model turned out, with the exception of QPR. The bookies said QPR would win. Other models said it was close. My model said Sunderland would win.

When Sunderland scored first I was feeling pretty smug, but then QPR got three and that was that. I've only seen the highlights but it looked like QPR deserved it, so I'm going to do some diagnostics on that one and work out what went wrong.

I changed my mind on the predictions at the last minute, after re-running the sim with confirmed starting line-ups. This led to a switch from draws in Norwich v Southampton and West Brom v Swansea, to picking Norwich and Swansea wins for betting purposes (N.b. even though I picked Swansea for a bet, West Brom had the highest chance of winning that game.)

Overall, I had on Saturday:

Reading v Villa - Away Win - Won
QPR v Sunderland - Away Win - Lost
Norwich v Southampton - Home Win - Lost
West Brom v Swansea - Away Win - Lost

And on Sunday:

Liverpool v Tottenham - Home Win - Won
Newcastle v Stoke - Home Win - Won

50% correct, which is pretty much bang on the model's average. We were a missed penalty away from getting Norwich right too.

Betting with £10 stakes again, you'd be £3.30 up. Hardly setting the world on fire but a win's a win and that's wins two weeks in a row (four actually, but I didn't blog the first two so they don't count).

I'm going to spend some time with shot calibration this week, if I can find the time, and will try to post about that at some point. Goalscoring in the model is hard to get right, because there aren't that many historical goals per player to set the conversion chances. We see hundreds of passes every game, but only a couple of goals and the model is quite sensitive to the settings for each player's chance of a shot being a goal. I'm pretty sure there are some performance gains to be had here.

Saturday, 9 March 2013

Football Sim: Predictions for 9/10 March 2013

Looks like potentially a tricky set of fixtures to call this week as many are close to 'standard' home/away/draw percentages. If I get the chance, I'll re-post before 3.00 using the actual starting line-ups to simulate, but for now I've taken them from Fantasy Football Scout as usual.

We're fairly close to the market odds everywhere, with the exception of Sunderland, where my model is predicting an away win against QPR. Harry dropped Taarabt last week and Zamora may be coming back from injury so the QPR line-up is a little in doubt. Let's see if we can squeeze in that rerun of the model before kick off.

For betting, We've got some fairly clear over-indexes to pick (see last week for an explanation) and a couple of games where it's a toss-up between an away win and a draw.

Clear ones first:

Reading v Villa - Away Win
Liverpool v Tottenham - Home Win
Newcastle v Stoke - Home Win
QPR v Sunderland - Away Win

And the close calls:

Norwich v Southampton
West Brom v Swansea

I'm going for draws in those two, because it's an incredibly close call between draws and away wins and we'll get better odds on draws.

Neck stuck out. See you later!

Tuesday, 29 January 2013

Football Sim: Predictions for 29-Jan-13

OK, here they are! The first time I've let the football simulator loose in public, on games that haven't actually been played yet. Sounds like a recipe for disaster to me, but let's do it anyway.

If you've got no idea what I'm talking about, read this first.

A few things to be aware of... (in other words, here are my get out clauses, terms and conditions, caveats, call them what you like...)

I got the teams' predicted starting line-ups here.

Your guess is as good as mine (or Fantasy Football Scout's) what Newcastle's starting line-up is going to be.

I've made the new NUFC Frenchmen into completely average players for the purposes of the sim. They might be better than that and they might be worse!

As per my big "season so far" post yesterday, the model calls winners correctly around 50% of the time on average. Follow my tips at your own risk...

It gets exact scores right, around 10% of the time.

Both of those mean that for betting, it seems to just about break even (but not quite). I'm working on it.

That Sunderland prediction looks a little too heavily skewed to a home win for my liking.

But I'm off to put a bet on it anyway.

Monday, 28 January 2013

Football prediction: Simulating the season so far

I've been working for a while on an agent-based simulation of football matches, to see how close it can get to predicting the real results of Premier League games.

Last week, I explained roughly what the model is and how it works.

This week, is the model any good? I'll be honest, I may have cherry picked that Chelsea result a little for the last post, but this will be a warts-and-all picture of how well the model predicts. Or how it would have predicted the season so far (with one important caveat that we'll come to in a while.)

One last thing before we get stuck in... The most obvious use of a model like this (if it works) is to gamble based on its predictions, but I'm building more for the technical challenge of seeing if I can do it. I'm really interested in using a model like this as a scenario planning tool - what would happen to your season if you signed player x? Or if player y got injured? If the model can be made to work, you could run 'what ifs' and work out the value of players in terms of expected points added to the team's total across a season.

Back to betting, I might have a punt, but I'm not really a gambler. Bet on it if you like (sensibly!) as I start to predict games on Wallpapering Fog and don't forget to add a comment to let us know how you got on. I'll talk about odds a fair bit below, because they're an obvious source of another prediction, to compare with the model. Having said that, if you'd bet on every game so far this season using the model's predictions - up to 20th January 2013 - you'd have just about broken even. Improvements to the model from here, would make it profitable. Got your attention? Here we go.

What I've been doing this weekend, is building some code to run a whole series of games in succession, not just one at a time. Then I fed in the fixtures, starting line-ups and player statistics for each game this season, up to the 20th January, using data from EPL Index. We simulate each individual game 100 times and get an overall predicted likelihood of home win, away win or draw, plus the most likely scoreline.

Remember that caveat I mentioned? Here it is. I'm simulating each player, using their average performance across the whole season so far, which isn't strictly fair. When Fulham played Norwich on the first day of the season, I wouldn't actually have had any 2012/2013 data to feed into the model at the time - only the previous season's numbers. It's something else on the long list of development tasks that need dealing with...

Here are the predictions anyway. Correct calls in green.
Google's determined to open the image below in its G+ gallery, which isn't readable. Here for bigger.

Overall, the model calls 50% of results correctly, on the criteria that the team it gave the most chance of winning, ran out as winners in real life.

I was initially a bit disappointed with that. Only 50%? I was hoping for more.

Then I had a look to see how often the bookies get it right. No doubt this will be incredibly obvious to some readers, but as I said I'm not a gambler. How often did the bookies' favourite win those games? 51%. (odds from football-data.co.uk)

Suddenly 50% doesn't seem all that bad!

A big part of the error comes from draws, both in my model and in the bookies' odds. A draw is almost never the most likely result of a single game, but overall, around 30% of games will end in a draw. My model only called one game as having a draw as the overall most likely outcome - Aston Villa vs. Stoke.

When you simulate game-by-game, you'll predict almost zero draws, which means you'll be wrong 30% of the time before you even start. Predicting a season, where you only simulate each game once would give a 'normal' number of draws, but each individual game's prediction would be much less accurate. It's swings and roundabouts, depending on whether we're trying to predict final league placings, or the result of a single game.

If you'd bet on the model's prediction for every game, using the same stake, at Bet365's odds, you'd have lost 3% of your money so far. If you'd taken the best odds available in the market each time, not just Bet365's, you'd actually be up 1%. That's not a disaster for a first effort! At least we're not ruined.

Let's see what that accuracy looks like. I tweeted this one over the weekend; it shows ten game rolling average prediction accuracy for the model and also for the bookies.

When the shaded area between the lines is red, the bookies are predicting more results correctly than the model, over the previous ten games. When it's green, the model is out-performing the bookies.

It's interesting that a couple of weeks into the season, accuracy for the model and for the bookies plummets to just 10-20%. It may be that the early season is harder to predict - we'll need to run a few more seasons to find out. That period certainly screws up any hopes of winning a fortune as the bookmakers do slightly better, even though both are doing badly.

Here's the same data, cumulatively. The chart shows total accuracy across the season so far, with 20th January on the far right hand side.

You can see that the bookies' favourites have won more often than the model's across the whole season so far, ending with just the 1% gap that I described earlier, 51% to 50%.

What's very interesting to me, is that the model looks like it's improving slowly across the season and closing the gap; the games may be becoming more predictable as the season goes on. I'm not jumping on that conclusion just yet, but I'm certainly going to keep an eye on it.

If you'd bet using the model, only since the New Year at best market odds, you'd currently be up 19% on your original stake.

Let's finish with the next big improvement for the model - at least I hope it will see a significant improvement in predictive power. I'm seeing these developments as positives, even though there's a fair bit of work involved in building them, because our simple view of the world is doing pretty reasonably and there's huge scope for improvement from here.

At the moment, once a team has the ball in the sim, the opposition can't win it back, the team in possession can only lose it. This is fine in a game against an 'average' team, but taking Arsenal as an example, their passing accuracy was 90% against Sunderland and 81% against Manchester City. I've got no doubt that Man City caused that to happen by harrying their opponents, so the model needs to account for it.

Even more important, it's not every player's passing accuracy that drops against higher quality opponents - some will cope much better than others. We need a way to predict what each individual player's passing accuracy will be, against this week's opponents. I'm working on it.

Stay tuned for predictions for Tuesday's games!

Monday, 21 January 2013

Simulating football matches: An experiment.

Prediction. The holy grail of analysis. Diagnostics are good and they help us to understand the past, but if we can't use that work to get better at predicting the future, then isn't it all a bit pointless? If your theory doesn't predict the future, then it isn't science. Unlike football punditry, you don't get meteorologists diagnosing that last weekend's weather was rubbish because "the Sun must have had an off day". Scientists test their theories through prediction.

In my day job, we predict advertising. You've run adverts for your business before, so we measure the effect of those and then we can tell you how much you'll sell in the future.

Advertising's mostly quite dull. Can we do it for football?

Since Moneyball, a lot of people have been paying close attention to football statistics, writing up analyses and discovering relationships in OPTA's football data. Man City are even trying to tap into those amateur insights through the MCFC Analytics Project.

I couldn't resist diving in, so armed with a subscription to EPL Index (four quid a month for player stats for each individual game? Yes, please) I've been running a little side project and playing with the data.

If you want to predict football, there are a few ways you could go about it and I did have a crack at this project years ago, but with only very top-line data on past game results. You can build a regression model (a bit like the advertising models I build for a living) that uses each team's form to predict the outcome of the next game.

It works something like this...

Predicted result = f(Home Team form, Away Team form, plus lots of other things like goal-scoring and conceding rates...)

All weighted for the quality of opposition over the past few games.

This type of model kind of works. Basically, it will predict things like Man U should beat Wigan, but we already know that. The model I built a few years ago didn't do any better than my own guesses and that's not tremendously useful.

Top-line models like this also have a massive issue in that there are simply too many variables that need to be accounted for. What are the chances that Man U beat Wigan if Van Persie's injured? Our model based purely on past form will struggle with that, especially if he's played all season and we've got no experience with him not present (and just as important, a different player playing) in the team.

You also get very little in the way of explanation with top-line models like this. Man U will win because they usually win and Wigan will lose because (sorry Wigan fans) they usually lose. What can Wigan do about that? Well obviously they need to improve their form. Thanks, Mr. Consultant, you're fired.

Long story short, we need a different technique and the one I've been using is called Agent Based Modelling (ABM).

ABM simulates the world from the bottom up rather than the top down, which in football means simulating the players rather than the result. We set up an artificial game - using real world OPTA statistics about the performance of individual players - and we run the game to find a predicted result. The result is an outcome of the simulation that we can't control directly.

If you're thinking, "Is he trying to build Football Manager using OPTA data?", that's basically the size of it, yes. Told you it was more fun than advertising.

Inside the model, you kick off the game and from then on, it's all down to the simulation. The player with the ball will make a decision, based on what they do most often in real games - pass, shoot, dribble... each decision is randomly generated, but weighted towards the probability of what that player does most often in real life, based on the OPTA data.

If they choose to pass, the simulation checks for a successful pass and then works out who the ball went to, again a randomly generated choice but weighted by real data. It's the same if they shoot, when we work out the chances that their shot went in. If they lose the ball, it transfers to a player on the opposition, again determined by a weighting of... you get the idea. Then the whole thing starts again with the player who has the ball now.

We play the game through, with players passing, shooting, losing the ball etc. and we get a result, which is our prediction for the match.

Now, you might say, "but there were loads of randomly generated decisions in the model. If we ran it twice we might get different results", and you'd be absolutely right. It's just like the real world and if the same teams play each other a few times, you can get a different result every time.

What we're after is the probability of winning for each team, so we run the match 1000 times (for now - it's a nice round number) and count up how many times each team wins.

After some teething problems (it wouldn't be interesting if it was too easy) the model's starting to turn up sensible results and I promise I'll share its predictions for the next set of Premier League games (29th and 30th Jan). I'm not standing by those predictions yet, but it will keep me honest and motivated to do the development in public and to an extent already has... after tweeting on Saturday that Norwich might have more chance against Liverpool than the bookies thought and then watching them get battered 5-0. I now know why the model did that and it doesn't do it any more!

The model is a huge oversimplification of a real game but over time, it should help to teach us about what's really important. As a quick example, the model currently doesn't treat crosses any differently from other passes - they're a complete or incomplete pass and that's it. If they're a complete pass, then the player who receives the ball might shoot. But if we keep seeing that teams with traditional wingers win more games than the model would predict, then that might need sorting out.

I'll end for today with a bit about the Chelsea vs. Arsenal game on Sunday, to illustrate what you get from the model and the sorts of things that we might be able to do with it. Here are the teams (no subs yet by the way):

Run that one 1000 times and what happens?

Chelsea: 44%
Arsenal: 26%
Draw: 30%

So Chelsea are predicted to win 44% of the time. We get a scoreline from the model too and here are the 15 most likely, adding up to 94% of all results. An interesting outcome is that although we predict Chelsea to win overall, the single most likely result is 1-1.

The actual result was 2-1 to Chelsea, so the model got the winner right and 2-1 was our third most likely score. That looks potentially ok! We'll only find out if it really works by testing across a lot of games though.

Another way to see if the predictions are sensible is to compare with the bookies' odds. There's probably something wrong if we're not in the same ball-park as professionals taking advantage of the wisdom of their crowds of punters. The bookies had these odds (decimal odds, with implied percentage in brackets):

Chelsea: 1.83 (55%)
Arsenal: 3.5 (29%)
Draw: 3.25 (31%)

On an Arsenal win, or a draw, we're almost bang on the market odds. On Chelsea we're below, but bear in mind that the model's odds have to add up to 100%. The bookmakers' odds add up to 114%, which is why bookmakers make money.

Lastly, just as an example of what else you can do with this type of model, there were some raised eyebrows that Torres started for Chelsea instead of Demba Ba. What if they'd been switched and Ba had started the game? Let's run it another 1000 times...

Chelsea's chance of winning goes up by two percentage points to 46%.

Arsenal's chances go down, right? Wrong. This is why agent based models are good - they can show us things that we don't expect.

With Demba Ba starting for Chelsea, Arsenal's chances of winning actually go up by 1% to 27% (to complete the picture, the chances of a draw drop 3%). The balance of how often Ba receives the ball and how often he gives it away compared to Torres makes all the difference. All in all, the predictions barely move but it's an interesting outcome. We'll see more of these.

OK, that will do for now. Predictions next week and maybe, just maybe, a cheeky punt based on our forecasts.

Oh and if you're running your own project like this, I'd LOVE to hear about it! Stay tuned, this is just the start.

Tuesday, 15 February 2011

Probably the best strategy in the world

Neil Perkin over at Only Dead Fish has written a nice piece on new measurement techniques and predictive markets. It's an area of marketing measurement that I find fascinating, even if so far I've seen very few real world marketing applications.

Prediction markets are games where you trade shares in future events. The Hollywood Stock Exchange is a famous example, where you 'bet' on the audience that films will achieve at the box office. The idea is that people (on average, in large numbers) are quite good at guessing what other people will do and the outcome of future events. Running a survey and asking people if they plan to see an upcoming film at the cinema is - runs the theory - less accurate than asking those same people whether they think lots of other people will watch it.

In one respect, it's easy to see that the theory works. In horseracing, horses become favourites because people bet that they're going to win and very often the favourite does win. Odds on betfair are effectively the punters' averaged view of what they think is going to happen in future. Websites like Political Betting take those market odds and use them as a prediction tool for election outcomes or how long the current Prime Minister will last.

If you fancy reading a bit more and playing with some toys, then Inkling is a good place to start.

The advertising applications of prediction markets are exciting. Instead of a focus group asking people if they like a new product, you could ask a sample of respondents if they think other people will buy it. Want to know which mobile phone platform will dominate in five years? Get people to bet on it. Don't ask people if they like your creative, ask whether they think it will be popular.

In terms of their output, prediction markets have some similarities to another research technique that I'm excited about; agent-based modelling. It's a bottom-up approach to modelling where you create an artificial simulated market that contains individuals, give them some rules and then see how they behave. You might set up a simulation for a new product launch and then model how shoppers trial and adopt the product as they are exposed to advertising messages. The crucial difference to top-down modelling where you analyse past sales is that the simulated individuals in an agent-based model have an element of randomness in their decision making - they don't necessarily do the same thing every time you run the simulation.

These two new techniques are similar in that their output tries to account for randomness. You don't get a single answer and in that, they're much more like reality than a lot of the techniques we use right now. What you get are predicted likelihoods that rank possibilities of things that might happen.

Think about what that means for a minute. An analyst can predict the best strategy for launching a brand and that 70% of the time in simulations, sales exceeded the target. It's the best strategy, but even in the simulation it often doesn't work. We can tune the strategy to improve our chances, but in the end, randomness in the model means we might fail even though our strategy was a good one.

Weather forecasters often give us predictions this way - they'll say that there's only a 20-30% chance of rain, so you get annoyed when you turn up for your meeting without an umbrella and soaking wet. The forecaster didn't say it wouldn't rain though, so it's your fault really - he said it probably wouldn't rain and you chose to risk it.

I'm incredibly excited about these emerging techniques, but they need some new thinking on the analyst's and on the decision maker's side. We analysts need to work out how to apply new predictive techniques to marketing. Marketers need to recognise that they're going to get some extra information on which to base a decision and not the perfect answer.

That's actually the way that analytics should always have worked, but both sides too often like to pretend otherwise.

If you ask for randomness to be included in marketing analysis, then you're going to get answers that far more often include the word 'probably'.

Pages