Monday, 15 October 2012

Using Scoring Records To Predict Future Performance.

In a conversation last week with Simon Gleave of fame I was reminded of the pivotal role that the Poisson Distribution has played in explaining a team's win, loss and draw tally from their goal scoring record and also in predicting their future results from their presumed goal expectancy in those subsequent games. As a retrospective tool the Poisson is unspectacular, but in a predictive role, it's simplicity of use and the ability it gives us to shed light onto the nature of a team's talent and ability is unsurpassed.

There is no shortage of football based posts which describe the actual use and limitations of the Poisson, so I will direct readers to such posts as this one. But for brevity, the Poisson allows you to model the likelihood of any number of discrete events occurring given that we know the average rate at which these events are likely to occur.

So if we think a team is going to average 1.4 goals per 90 minutes against a particular defence we can estimate the probabilities of that team scoring exactly 0, 1, 2, 3 and so on goals. If we repeat the process for their matchday rivals, this allows us to move onto the prediction of assorted scorelines and ultimately game results.

The mathematical steps involved in producing goal probabilities is fairly straightforward, but it is the calculation of each team's future expected goal scoring and conceding rates where much of the hard work lies. Manchester United, as next Saturday will probably confirm score goals at a much higher rate than Stoke City. Before we can confidently begin to estimate the rate at which United will score against City, we need to incorporate such obvious variables as United's scoring rate over a period of games, Stoke's rate of conceding, together will less obvious figures such as the general rate at which home teams outscore away sides and the level of goalscoring typically seen within the Premiership.

By combining such rates for each team we can begin to estimate the likelihood of such match outcomes as a United victory and other, much less likely occurrences, such as a Stoke City away win. How we decide which rates to use for each team will depend on how we implement adjustments such as the weight we give to more recent matches, how severely we regress our final figures towards the mean of our choice and, in the case of the Newcastle Sunderland game whether or not the game is a derby match. (Such matches tend to throw up lower scoring, more evenly matched games).

But the most influential choice we will need to make will be how many matches we use to derive our average team scoring rates. In the table below I've calculated the chance of each EPL team winning their games over the coming weekend using a Poisson based approach incorporating the scoring averages of teams calculated over the previous 32 home and away games, the previous 20 and just using the 7 games played so far this season.

The Win% Chances For EPL Sides On Saturday Using A Poisson Calculation & Expected Scoring Rates Over Differing Timescales.

 Team. Using Last 32 Games. Using Last 20 Games. Using Last 7 Games. "True" Odds. Man Utd. 75 70 54 76 Stoke 8 9 18 8 Fulham. 60 71 80 53 A Villa. 16 9 7 21 Norwich. 15 9 4 15 Arsenal. 65 75 86 62 QPR. 19 15 8 29 Everton. 57 66 79 43 Sunderland. 45 26 36 38 Newcastle. 28 44 30 32 Swansea. 54 42 72 48 Wigan. 21 32 12 24 Spurs. 44 32 21 37 Chelsea. 29 41 52 34 WBA. 22 23 40 19 Man City. 52 52 34 57

The team specific inputs used for each team in Saturday's matches is merely the average of the goals that were scored or conceded by teams over the three different timescales and the purpose of the exercise is to see which timescale produces the most reliable estimation for pre game win and loss probabilities. in short, is it better to use much more recent, but smaller samples of a team's attacking and defensive ability or is more information contained in older, but more numerous sample sizes.

I've used the bookmakers win odds for each team as the "true" odds comparison. These figures are a great untapped source of information. Whatever your views on gambling, the odds presented by bookmakers, once the overround has been stripped from the prices give you an incredibly accurate estimation of the true odds of an event occurring. The odds compilers have access to large amounts of data, long experience at setting prices and a huge vested interest in producing accurate odds. Also we are not at a stage of the season when prices are particularly skewed by expected weight of money. Last season, Bolton's final "must win" visit to Stoke saw the relegation threatened side priced up as having around a 33% chance of winning, even though the evidence suggested that 22% was a more realistic estimate.

Using readily available bookmaking odds as a benchmark does away with the need to produce the copious amounts of estimations that are needed to evaluate a model's effectiveness by comparing predictions against actual outcomes. A fifth of the way into the year and if your win estimate greatly varies from the general bookmaking consensus, then it is your model of events that is almost certainly the one with the flaws.

 Swansea and Stoke are likely to experience very different results on Saturday.

Our naive model has no whistles and bells, but the results are overwhelmingly in favour of taking into account as much goalscoring data as possible, even at the expense of recency. Win predictions produced via the Poisson process using data going back 32 games were closest or joint closest to the bookmakers estimate in 13 of the 16 cases. They are shown in blue. (I've omitted the matches involving promoted sides, because they require separate adjustments based around the promoted sides scoring and conceding records from the Championship).

20 game estimates were top or joint top in three out of the 16 teams and the most recent data from the seven games so far this season won out for just Newcastle and Sunderland. A local derby where depressed scoring and home advantage is always factored into prices, so even this meagre victory for recent form alone was a hollow one.

So of the three choices, the scoring records over the previous 32 games trounces the two alternatives as the best predictor of performance in the future. Eight matches comprising 16 teams isn't a huge test sample, but the findings are confirmed over many more matches and multiple leagues and seasons.

Teams can often produce short term bursts of atypical results in a sport where team scoring very rarely averages much more than two goals a game and more is certainly better if you chose to evaluate teams by goals scored and allowed, even if that means you are going back to matches played early in a previous season.

1 comment:

1. I agree with your analysis here. My own work in this area suggests 40 ad above is the most reliable number of matches to work with, which is in keeping with your data.

However this is only part of the story. Predicting which team is most likely to match the bookies' odds isn't by itself going to make any money. After all, if you always back the favourite you always, eventually, lose everything.