Saturday 30 March 2013

Model changes and predictions for this weekend

I've been making some changes to my football model over the international break and its prediction performance has got a little better again. Although with at least one side-effect that we'll see in a second.

Apart from some robustness testing, the biggest change has been adjusting shot success depending on the opponent. The model should handle number of shots and which players take those shots, reasonably well. It simulates the passing of individual players and if you haven't got the ball, you can't shoot, so in the simulation, shot rates will naturally increase against worse teams and drop against better ones, without me having to impose that. When you look at simulated shot stats for a whole season, it does do this job fairly well.

Where the previous version of the model had a problem is that all shots aren't equal. Just for today, let's call this 'The Stoke Effect'.

I've had issues with Stoke in the model since the start. Their pass success rate isn't great and shot conversion rate isn't either, so in the simulation they were predicted to lose a lot more than they did in reality. There is something that Stoke do well though - they force the opposition into taking unsuccessful shots.

I can't impose shot rates on the model, because a big part of shot rates is already simulated through possession. What I can impose is an adjustment factor on the chances of a shot being a goal.

Here are the adjustment factors:



At the moment, I'm not analysing tactics to find out why those numbers look the way they do, I'm just playing with outcomes, but if you're not fussy about pretty football then Stoke are definitely doing something right. I'm also starting to think, "Why don't Stoke get relegated?" might make a fun article for EPL Index, who provide all the stats for this model. Partly as a result of building this model, I'm becoming much more interested in how middling teams get results against the top four (six? Seven so that Liverpool are included? I think those are the rules), than in Man City's most effective attacking combination.

In terms of prediction, Man City should win most of the time and for the purposes of this model, I'm not all that fussed about by how many goals they do it, as long as the result's correct. What's really interesting is whether we can forecast when they won't win. Adjusting shot success helps a lot in doing this, because it points towards those days when a team has plenty of the ball, but just can't score.

Still here? You could definitely be forgiven for skipping ahead to this bit. Here are this weekend's predictions:




You can see the side effect I was talking about right at the start of this post, in the Newcastle win chance. Newcastle would have been predicted to lose anyway, but they also have below average stats for how many shots their opponents convert, so they get penalised very heavily now.

I have no doubt this percentage is a bit low, but it's happening because I use average performance to predict the result of single games. What the model's saying, I think, is that if Newcastle don't change their tactics against Man City from what they normally do (maybe 'park the bus'), they're going to get hammered.

The Swansea prediction also sticks out this week. Bit peculiar, but it would be no fun if we agreed with the bookies every time. Let's see what happens.

Lots more work still to do, but it's good to be making progress... Two big jobs on the list next:

1. As this blog points out, I really need to run the model over more seasons and see how it gets on.
It's a bit tricky because I use a stat on '% of passes in opponent's half' to model attacking pressure and that stat's only been available on the EPL Index site since this season. Overall it doesn't add huge amounts to the predictions though, so I'm probably just going to turn that feature off and then run over the past 3-4 seasons.

2. The model knows nothing about form at the moment - players perform better or worse depending on their opponents, but their base stats are the same in every game. I'm eyeing this as potentially the next big improvement.

Small bets placed and I'm off to enjoy the British countryside. With live score updates on my phone, obviously!

Tuesday 26 March 2013

Truly inspirational. Who says it can't work?

I was pointed towards this blog post today, from Dilbert creator Scott Adams.

He's being deliberately provocative with this statement, but it's a good one:

"Management exists to minimize the problems created by its own hiring mistakes."

Or to put it another way, if you hire good people then they don't need managing.

Well, yes.

Easier said than done and very easily dismissed as a trite statement that's unachievable in reality. Of course staff need managing! To suggest that they don't is crazy, surely?

Except that the comments on that blog post link to a company that's done it. Valve software has a flat structure and everybody works on the thing that they, personally, think will add the most value to their company.

Valve isn't a here today, gone tomorrow, dotcom flash in the pan. It's worth billions and is more profitable per employee than Google or Apple.

Here's their employee handbook.


"But when you’re an entertainment company that’s spent 
the last decade going out of its way to recruit the most 
intelligent, innovative, talented people on Earth, telling 
them to sit at a desk and do what they’re told obliterates 
99 percent of their value."

Superb.

Thursday 21 March 2013

Eight steps to building analytics that actually get used

Inspired by a couple of football related posts around the topic of "what is analytics?" I've penned a few thoughts on a slightly different subject. Once you've decided what analytics is, how do you build something that will actually get used? It's often easier said than done... Many big ideas ultimately fail to deliver and I'm convinced there are just as many brilliant insights out there, which nobody's paying any attention to.

I'm writing mostly from my experience in the marketing analytics world, but these would guide my approach pretty much anywhere. How do you build a piece of analytical work that ends up delivering something valuable, rather than sitting on a hard drive, gathering virtual dust?



Clearly identify your questions first


This is a useful process in itself. What, specifically, do you want to know?

Now make it even more specific; break your question into pieces. And then possibly more pieces.

What would the world look like if the answer to this individual piece was "Yes"?

Now you can start doing analysis.


Walk before you run

If you try to go from nowhere, to the answer to life the universe and everything in one step then you're almost certain to fail. If only because right now, what you think the ultimate answer should look like is very likely to be wrong and you need to let some groundwork shape your next steps.

Before you collect any new data, what could a good analyst achieve with the data that you have? By the way, if you ask and they say "nothing" then they're not a very good analyst. A good analyst isn't afraid to speculate based on limited data, but they should also be honest and tell you that's what they're doing.


Have an opinion

You've just done a month's statistical modelling work and you can't present it back as an academic might, by saying "the effect of X is probably Y and the effect of Y is probably Z." We analysts like to hedge our bets, but if you want your stuff to get used, you can't do that. You're going to have to put your neck on the line.

You have to have an opinion. As a result of your work, what should be done? State it and state why you believe it, then allow people to disagree based on the evidence.


You need management backup

You know why Moneyball worked? Apart from all the clever numbers, Billy Beane, the A's General Manager believed in analytics, put it front and centre of decision making and didn't seem to mind who he upset along the way.

An angry analyst stamping their foot and demanding to be listened to, won't cut it. If you're trying to change an organisation, you need very senior backup and they need to trust their analysts.

Be careful what you wish for. Getting this bit right means that as an analyst you're going to feel some real pressure.


That management backup needs to have an open mind

Building on the previous point, it's no good having an evangelist forcing analytics into a company's decision making, if they're just using numbers as a battering ram, to push the strategy that they had anyway. We're back to trust again, to be earned by analysts and then respected by senior management.


Plain English Answers

As Albert Einstein said, "If you can't explain it to a six year old, you don't understand it yourself."

Dump the jargon and the statistics and find a plain English way to explain your results. Stories are good.


Never talk to IT until you've got a working prototype

I can't stress this enough. The only way to brief an IT development team to build you an analytics product, is to say, "See this? It works. Please turn it into a robust product for me."

Any IT people reading, the comment section is below - feel free to let me have both barrels, but this is true for virtually every IT team I've ever come across. IT and analytics are often confused, because they share some skills, but at the beginning, you need to be sure you're talking to analysts, not programmers.


Analysis is a process

Treating pieces of analysis as individual questions that are paid for individually, answered and then put to rest is very rarely the best way to get results. Analyses answer some questions and raise some more. They guide decisions, but could always be built better the second time around. Sometimes, despite everyone's best efforts, a line of enquiry doesn't achieve all that much.

Treat starting a piece of analysis as starting a process of discovery, rather than paying for the answer to a specific question right now, and over a period of time you'll reap the benefits.


Have I missed anything? What's the key reason why a piece of analysis you've done is still in use, or what barrier killed a piece of work that should have been brilliant? I'd love to hear about it in the comments.

Friday 15 March 2013

Football Sim: Predictions for 16/17 March 2013

Here we go again...

No big preamble this time, if you're reading this then you probably know where these forecasts come from by now (I like calling them forecasts, it sounds more scientific than predictions.) If you don't, have a read of these posts.

Here are the percentage chances for this week's games:



A few obviously stick out - Everton and Swansea to win are interesting, but I think both plausible (though maybe not at those percentages) and West Brom to beat Stoke, I'm not sure I trust because the model doesn't really seem to understand how Stoke get as many points as they do. Reasons for that are probably fairly obvious and to do with Stoke's style of play, but I haven't tried to deal with it yet.

And my betting choices:

Swansea v Arsenal - Home Win
Villa v QPR - Draw
Chelsea v West Ham - Home Win
Everton v Man City - Home Win
Tottenham v Fulham - Home Win
Southampton v Liverpool - Away Win
Man U v Reading - Home Win
Wigan v Newcastle - Draw
Stoke v West Brom - Away Win
Sunderland v Norwich - Draw

On the whole, the model's been running at around 50% of results called correctly and slowly making money through betting simple singles at the same stake on each prediction. Every time I try to mess about with that basic approach, it does worse, so I'm sticking to it for now.

I did make some improvements to the model this week, but overall it's made only a very small difference to accuracy, which was a bit disappointing. I may go for incorporating form next, rather than using each player's average stats across the whole season all the time, but it's going to a be a busy week at work so that may have to wait...

Same as last week, I might also tweet an update when the starting line-ups are confirmed - @data_monkey if you're interested. Although last week, the original predictions did marginally better than the update, due to Holt's penalty miss. Maybe go with these as they are!

Monday 11 March 2013

Football Sim: 9/10 March performance review

The four fixtures played on Saturday seem to have been acknowledged by everybody on my Twitter feed as 'a bit tricky'. I placed small bets on each game again, not really to gamble but because it keeps me honest. If I'm forced to actually pick a result, then there's no making excuses afterwards and claiming the model did ok when it was poor, or that I'd really have gone for a draw in the end, not a home win. When you're only looking at percentage chances of a result, it's easy to convince yourself that some of the bad forecasts were random chance.

From Saturday's forecasts, I'm fairly happy with the percentages that the model turned out, with the exception of QPR. The bookies said QPR would win. Other models said it was close. My model said Sunderland would win.

When Sunderland scored first I was feeling pretty smug, but then QPR got three and that was that. I've only seen the highlights but it looked like QPR deserved it, so I'm going to do some diagnostics on that one and work out what went wrong.

I changed my mind on the predictions at the last minute, after re-running the sim with confirmed starting line-ups. This led to a switch from draws in Norwich v Southampton and West Brom v Swansea, to picking Norwich and Swansea wins for betting purposes (N.b. even though I picked Swansea for a bet, West Brom had the highest chance of winning that game.)

Overall, I had on Saturday:

Reading v Villa - Away Win - Won
QPR v Sunderland - Away Win - Lost
Norwich v Southampton - Home Win - Lost
West Brom v Swansea - Away Win - Lost

And on Sunday:

Liverpool v Tottenham - Home Win - Won
Newcastle v Stoke - Home Win - Won

50% correct, which is pretty much bang on the model's average. We were a missed penalty away from getting Norwich right too.

Betting with £10 stakes again, you'd be £3.30 up. Hardly setting the world on fire but a win's a win and that's wins two weeks in a row (four actually, but I didn't blog the first two so they don't count).

I'm going to spend some time with shot calibration this week, if I can find the time, and will try to post about that at some point. Goalscoring in the model is hard to get right, because there aren't that many historical goals per player to set the conversion chances. We see hundreds of passes every game, but only a couple of goals and the model is quite sensitive to the settings for each player's chance of a shot being a goal. I'm pretty sure there are some performance gains to be had here.

Saturday 9 March 2013

Football Sim: Predictions for 9/10 March 2013

Looks like potentially a tricky set of fixtures to call this week as many are close to 'standard' home/away/draw percentages. If I get the chance, I'll re-post before 3.00 using the actual starting line-ups to simulate, but for now I've taken them from Fantasy Football Scout as usual.


We're fairly close to the market odds everywhere, with the exception of Sunderland, where my model is predicting an away win against QPR. Harry dropped Taarabt last week and Zamora may be coming back from injury so the QPR line-up is a little in doubt. Let's see if we can squeeze in that rerun of the model before kick off.

For betting, We've got some fairly clear over-indexes to pick (see last week for an explanation) and a couple of games where it's a toss-up between an away win and a draw.

Clear ones first:

Reading v Villa - Away Win
Liverpool v Tottenham - Home Win
Newcastle v Stoke - Home Win
QPR v Sunderland - Away Win

And the close calls:

Norwich v Southampton
West Brom v Swansea

I'm going for draws in those two, because it's an incredibly close call between draws and away wins and we'll get better odds on draws.

Neck stuck out. See you later!

Tuesday 5 March 2013

Football Sim: 2/3/4 March performance review

We let the football sim loose properly this weekend, with percentage chance forecasts for each game and I stuck my neck out with betting choices too. How did it go?

Overall, seven out of ten. Literally! Seven out of ten results called correctly, which isn't too bad at all.

Admittedly it wasn't the most difficult week to predict, with the top sides mostly playing teams that they were expected to beat and those games running to form, but still, you can only beat what's in front of you. As the pundit said to the viewer.

Link to last week's percentage chances and predictions if you want to take a look.

If you want to use the model for betting - and Twitter is telling me that some of you definitely do! - then I was trialling a new method last week. The method picks results that are 'over-predicted', so if 25% of games  in the model are usually a draw and the model's prediction for a single game is a 30% chance of a draw, then there's a good chance that's what we'll be betting on.

What this method seems to do is steer the model towards predictions that aren't already priced into bookies odds. There's no point in betting on home wins unless they're over-predicted, because the odds are rubbish and you'll just slowly lose money. I've simulated betting just on the highest chance result and the latest version of the model would end a little down across the whole season so far. This new method, as well as allowing us to bet on draws (read last week's post for an explanation), also swings us towards betting on slightly more unlikely events, but ones that have better odds.

For betting purposes, I picked (not bothering to chase best odds)...


Man City's win last night completed the fixtures and with a £10 stake on each game (£100 total), you'd be up £26.30.

I've said before that the model overcooks Arsenal. It's very frustrating... although possibly more so if you're an Arsenal fan. The simulation is based to a large extent on passing and attack - it doesn't do glaring defensive errors yet. Maybe an indication that if Wenger could sort the Arsenal back four out, they'd be doing an awful lot better. Anybody reading who didn't already know that? Let's move on.

The other two results the model missed were predicted draws at Southampton and Stoke. Both missed by just one goal. For betting purposes, I'm happy with that because if you can call roughly one draw in three, you'll break even and this week, the correct call on Sunderland's draw covered the other two, plus a little extra.

Finally, here's something I'm quit excited about...

What would have happened if you'd followed this £10 betting pattern all season so far? Starting with 100 quid in the bank, you'd now have £380, not including this week's win. If you're thinking it looks slightly odd, this chart shows games being played one at a time, not in weekend blocks of ten.



Blimey.

I'm still not building this primarily as a betting model, but well, it's kind of hard to resist when the chart looks like that.

A couple of (potentially big) caveats in the interests of honesty. I'm still running the full season sim using player stats for the whole of this season. This means the model knows about things like Michu's, Bale's and Van Persie's goalscoring form before the bookies did. That's not strictly fair but from now on, I'm calling results before the event.

I also don't know when the odds that I have (from www.football-data.co.uk/englandm.php) were scraped. The model knows actual starting line-ups for each game and those odds may be from the day before kick-off, when injury niggles etc. were still in doubt.

More this weekend. Let's see what happens...

Monday 4 March 2013

Big Data madness and my football prediction model

If you read Wallpapering Fog on a regular basis, you'll know I'm a Big Data sceptic. I've written about its limitations before, here and here for starters.

The main problems seem to stem from the fact that very often what non-specialists think is 'Big Data' isn't actually big at all. Your data might not fit easily on an Excel spreadsheet. Big Data doesn't fit on your laptop.

And the second problem - the really big one - is that having loads and loads of data doesn't usually help very much. I was reminded about that this morning with the news that Netflix has come over all Big Data. What tends to happen (as in the Guardian regional poverty example linked above) is that analysts spend ages processing loads of data and end up with an answer they could have reached much earlier, via a much simpler method.

I had a question recently regarding my football prediction model, as to whether I could use more detailed data. The answer? Maybe, but not yet. I'm nowhere near wringing everything out of Opta's top-line player performance stats, and hugely detailed game event data feeds would very likely bog the analysis down.

My football model is agent based, so it can have unpredictable outcomes and runs as a computer simulation, but it's now as good as the bookies at predicting the results of football games and it does that with inputs of only relatively top-line data.

For the record, even with all the different things that happen in a football game, you can still fairly effectively predict results (and win at the bookies for the last three weeks in a row) using only...

Pass completion rates
Goalscoring rates
Player dispossession rates
A measure of how good the opposition are at winning the ball back

And essentially, that's it.

Of course the model isn't perfect and there are tons of improvements to be made, but the crucial point is that if I'd started with Opta's event-level statistics, I'd be nowhere. I'd probably still be trying to pull that feed into a useful database and understand any underlying relationships in the data at all.

I've trained as an economist and more precisely an econometrician, so my instinct is to try to simplify problems. To build the 80% accurate answer, before you go for the 95%, because if you dive straight into complexity, you'll fail. People forget that even Google, the poster child for Big Data, started with a much simpler algorithm than they have now.

Management will learn through experience eventually, but at the moment, IT (largely) staff who are capable of assembling huge amounts of data, are promising nirvana and business owners are listening. Very few companies have any idea what they're going to do with all this information. An unspecific goal of "data mine it" is a business case that should never get past its first review.

This drive to collect and process massive amounts of data, by businesses that don't understand their simple data yet, is madness. Hugely expensive madness.

Friday 1 March 2013

Football Sim: Predictions for 2/3/4 March 2013

Here we go. Actual predictions, before the event this time! Radical idea I know.

Should ensure the model's performance takes a nose-dive...


As usual, if you'd like to know how this is all done, click here and scroll though some past posts.

I get team line-ups from Fantasy Football Scout.

EPL Index provides all the player performance data.

Footballdata.co.uk provides past betting odds and game results.


I'm excited this week because I've managed to improve prediction accuracy in the model after a few false starts, where I thought I had, but it turned out to be programming bugs and/or random errors.

The model's now learned about opposition strength and it will modify the expected performance for each player's passing accuracy and number of touches, based on who they're playing against. It means for example that teams will find it much easier to pass the ball to their strikers and create scoring chances against QPR, than against Spurs.

Exactly how it does that I'll save for another day, but it's got regression models in it and I'm quite proud of the way it predicts! It's taken some serious effort to get working.

This is what the improvements have done to our cumulative prediction accuracy vs. the bookmakers over the season so far - how often the model's favourite wins vs. how often the bookmakers' favourite does. Green means the model was winning at that point in the season.


It's not a huge leap in performance, but we're fighting for every fraction of a percentage point improvement from here.

Enough methodology, here are this week's predictions.


Am I going to have a bet this week? Damn right I am. I've got a system. (This can only go well.)

Actually, running across the season so far, this principle would have worked extremely well. I'm going to pick results where the model thinks an event is more likely than usual. E.g. the model averages 23% of matches ending in a draw (bit low that, as I've mentioned before), so we'll bet on a draw if the draw prediction for a single game is significantly higher than 23%.

This has the immediate advantage that it allows us to bet on draws. As I've mentioned before, a draw is almost never the most likely single game outcome, but in total, 25-30% of games end in a draw. If you never bet on a draw, you'll definitely lose over a quarter of the time before you even start.

Overall, this method will call slightly fewer results right than just betting on the model's favourite. But. And it's a big But. It means we call far more long-shots correctly.

For this weekend's games, the new method gives...

Tottenham v Arsenal - Away win.
Villa v Man City - Away win
Chelsea v West Brom - Home win
Everton v Reading - Home win
Sunderland v Fulham - Draw
Wigan v Liverpool - Away win
Man U v Norwich - Home win
Swansea v Newcastle - Home win
Southampton v QPR - Draw
Stoke v West Ham - Draw

And we'll see how that lot turn out next week.