Prediction. The holy grail of analysis. Diagnostics are good and they help us to understand the past, but if we can't use that work to get better at predicting the future, then isn't it all a bit pointless? If your theory doesn't predict the future, then it isn't science. Unlike football punditry, you don't get meteorologists diagnosing that last weekend's weather was rubbish because "the Sun must have had an off day". Scientists test their theories through prediction.
In my day job, we predict advertising. You've run adverts for your business before, so we measure the effect of those and then we can tell you how much you'll sell in the future.
Advertising's mostly quite dull. Can we do it for football?
Since
Moneyball, a lot of people have been paying close attention to football statistics, writing up analyses and discovering relationships in
OPTA's football data.
Man City are even trying to tap into those amateur insights through the MCFC Analytics Project.
I couldn't resist diving in, so armed with a subscription to
EPL Index (four quid a month for player stats for each individual game? Yes, please) I've been running a little side project and playing with the data.
If you want to predict football, there are a few ways you could go about it and I did have a crack at this project years ago, but with only very top-line data on past game results. You can build a regression model (a bit like the
advertising models I build for a living) that uses each team's form to predict the outcome of the next game.
It works something like this...
Predicted result =
f(Home Team form, Away Team form, plus lots of other things like goal-scoring and conceding rates...)
All weighted for the quality of opposition over the past few games.
This type of model kind of works. Basically, it will predict things like Man U should beat Wigan, but we already know that. The model I built a few years ago didn't do any better than my own guesses and that's not tremendously useful.
Top-line models like this also have a massive issue in that there are simply too many variables that need to be accounted for. What are the chances that Man U beat Wigan if Van Persie's injured? Our model based purely on past form will struggle with that, especially if he's played all season and we've got no experience with him not present (and just as important, a different player playing) in the team.
You also get very little in the way of explanation with top-line models like this. Man U will win because they usually win and Wigan will lose because (sorry Wigan fans) they usually lose. What can Wigan do about that? Well obviously they need to improve their form. Thanks, Mr. Consultant, you're fired.
Long story short, we need a different technique and the one I've been using is called
Agent Based Modelling (ABM).
ABM simulates the world from the bottom up rather than the top down, which in football means simulating the players rather than the result. We set up an artificial game - using real world OPTA statistics about the performance of individual players - and we run the game to find a predicted result. The result is an outcome of the simulation that we can't control directly.
If you're thinking, "Is he trying to build
Football Manager using OPTA data?", that's basically the size of it, yes. Told you it was more fun than advertising.
Inside the model, you kick off the game and from then on, it's all down to the simulation. The player with the ball will make a decision, based on what they do most often in real games - pass, shoot, dribble... each decision is randomly generated, but weighted towards the probability of what that player does most often in real life, based on the OPTA data.
If they choose to pass, the simulation checks for a successful pass and then works out who the ball went to, again a randomly generated choice but weighted by real data. It's the same if they shoot, when we work out the chances that their shot went in. If they lose the ball, it transfers to a player on the opposition, again determined by a weighting of... you get the idea. Then the whole thing starts again with the player who has the ball now.
We play the game through, with players passing, shooting, losing the ball etc. and we get a result, which is our prediction for the match.
Now, you might say, "but there were loads of randomly generated decisions in the model. If we ran it twice we might get different results", and you'd be absolutely right. It's just like the real world and if the same teams play each other a few times, you can get a different result every time.
What we're after is the
probability of winning for each team, so we run the match 1000 times (for now - it's a nice round number) and count up how many times each team wins.
After some teething problems (it wouldn't be interesting if it was too easy) the model's starting to turn up sensible results and I promise I'll share its predictions for the next set of Premier League games (29th and 30th Jan). I'm not standing by those predictions yet, but it will keep me honest and motivated to do the development in public and to an extent already has... after tweeting on Saturday that Norwich might have more chance against Liverpool than the bookies thought and then watching them get battered 5-0. I now know why the model did that and it doesn't do it any more!
The model is a huge oversimplification of a real game but over time, it should help to teach us about what's really important. As a quick example, the model currently doesn't treat crosses any differently from other passes - they're a complete or incomplete pass and that's it. If they're a complete pass, then the player who receives the ball might shoot. But if we keep seeing that teams with traditional wingers win more games than the model would predict, then that might need sorting out.
I'll end for today with a bit about the Chelsea vs. Arsenal game on Sunday, to illustrate what you get from the model and the sorts of things that we might be able to do with it. Here are the teams (no subs yet by the way):
Run that one 1000 times and what happens?
Chelsea: 44%
Arsenal: 26%
Draw: 30%
So Chelsea are predicted to win 44% of the time. We get a scoreline from the model too and here are the 15 most likely, adding up to 94% of all results. An interesting outcome is that although we predict Chelsea to win overall, the single most likely result is 1-1.
The actual result was 2-1 to Chelsea, so the model got the winner right and 2-1 was our third most likely score. That looks potentially ok! We'll only find out if it really works by testing across a lot of games though.
Another way to see if the predictions are sensible is to compare with the bookies' odds. There's probably something wrong if we're not in the same ball-park as professionals taking advantage of the
wisdom of their crowds of punters. The bookies
had these odds (decimal odds, with implied percentage in brackets):
Chelsea: 1.83 (55%)
Arsenal: 3.5 (29%)
Draw: 3.25 (31%)
On an Arsenal win, or a draw, we're almost bang on the market odds. On Chelsea we're below, but bear in mind that the model's odds have to add up to 100%. The bookmakers' odds add up to 114%, which is why bookmakers make money.
Lastly, just as an example of what else you can do with this type of model, there were some raised eyebrows that Torres started for Chelsea instead of Demba Ba. What if they'd been switched and Ba had started the game? Let's run it another 1000 times...
Chelsea's chance of winning goes up by two percentage points to 46%.
Arsenal's chances go down, right? Wrong. This is why agent based models are good - they can show us things that we don't expect.
With Demba Ba starting for Chelsea, Arsenal's chances of winning actually go up by 1% to 27% (to complete the picture, the chances of a draw drop 3%). The balance of how often Ba receives the ball and how often he gives it away compared to Torres makes all the difference. All in all, the predictions barely move but it's an interesting outcome. We'll see more of these.
OK, that will do for now. Predictions next week and maybe, just maybe, a cheeky punt based on our forecasts.
Oh and if you're running your own project like this, I'd LOVE to hear about it! Stay tuned, this is just the start.