Poisson Modeling and Predicting English Premier League Goal Scoring 97
by the exponential distribution. If we have a non-negative
random variable X that is the time until the next occur-
rence in a Poisson process, then X follows an exponential
distribution with probability density function
f
X
(x)=λe
−λx
=
1
β
e
−
1
β
x
; x ≥ 0, (3.3)
where λ represents the average rate of occurrence and β is
the average time between occurrences. The mean and vari-
ance of an exponentially distributed random variable X are
μ
X
=
1
λ
= β and σ
2
X
=
1
λ
2
= β
2
. (3.4)
Furthermore, there is a connection between Poisson and an-
other famous probability distribution – the continuous uni-
form distribution [1]. If a Poisson process contains a finite
number of events in a given time interval, then the unordered
times, or locations, or positions, or points of time at which
those events happen are uniformly distributed on that con-
tinuous interval. The continuous uniform distribution is a
probability distribution with equally likely outcomes, mean-
ing that its probability density is the same at each point in
an interval [A, B]. A continuous random variable X is uni-
formly distributed on [A, B] if its probability density func-
tion is defined by
f
X
(x)=
1
B − A
; A ≤ x ≤ B. (3.5)
In addition, X has mean and variance
μ
X
=
A + B
2
and σ
2
X
=
(B − A)
2
12
. (3.6)
We postulate that goal scoring in football can be modeled
by a Poisson process. According to the characteristics de-
scribed above, if goal scoring for a club happens at a certain
rate in a given time period, then a Poisson distribution can
be used to model the number of goals scored. Additionally,
the waiting time (in minutes) between successive goals can
be described using an exponential distribution. Moreover,
the time positions (or “minute marks”) in a game at which
scoring events transpire may be uniformly distributed. We
will explore these relationships in more detail in Section 4.
3.2 Simulating and Predicting Season
Outcomes Using Poisson Regression
Our second goal of this research is to use the method of
Poisson regression to predict the outcomes for EPL matches.
Poisson regression is a member of a broad class of models
known as the Generalized Linear Models (GLM) [5]. A gen-
eralized linear model has the general form
E(Y
i
)=μ
i
= g
−1
(β
0
+ β
1
X
i1
+ β
2
X
i2
+ ···+β
k
X
ik
). (3.7)
There are three main components to a generalized linear
model:
1. A random component, indicating the conditional dis-
tribution of the response variable Y
i
(for the ith of n in-
dependently sampled observations), given the values of the
explanatory variables. Y
i
’s distribution must be a member of
an exponential family, such as Gaussian, Binomial, Poisson,
or Gamma.
2. A linear predictor (β
0
+ β
1
X
1
+ β
2
X
2
+ ···+ β
k
X
k
),
which is a linear combination of the predictors (the X’s),
with the β’s as the regression coefficients to be estimated.
3. A canonical link function g(·), which transforms the
expected value of the response variable, E(Y
i
)=μ
i
,tothe
linear predictor.
Poisson regression models are generalized linear models
with the natural logarithm as the link function. It is used
when our response’s data type is a count, which is appro-
priate for our case since our count variable is the number
of goals scored. The model assumes that the observed out-
come variable follows a Poisson distribution and attempts
to fit the mean parameter to a linear model of explanatory
variables. The general form of a Poisson regression model is
ln(μ
i
)=β
0
+ β
1
X
i1
+ β
2
X
i2
+ ···+ β
k
X
ik
. (3.8)
To make predictions for Premier League matches and to de-
termine what would happen in the 2018–19 season using
Poisson regression, we fitted two models to get the scoring
rates for every EPL team, 1) at home, and 2) away from
home. Here we are interested in evaluating the model equa-
tion at different values of the explanatory variables. Since
the link function for Poisson regression is the natural log
function, we would back-transform the equation with the
corresponding exponential function. This will then give us
the home and away mean (expected) scoring rates for every
EPL club, aggregated across all opponents.
After that, we executed a large number of simulations, to
get the hypothetical 2018–19 season results and then ana-
lyzed and compared what we got for each of the three subsets
of season mentioned in the previous section. For each sub-
set of data, we performed 10000 simulations, and this was
accomplished by randomly generating the match final score
for every team matchup, using the clubs’ average scoring
rates that we got from fitting the Poisson regression mod-
els, which returns a random integer for each team’s number
of goals scored. In addition, the number of points for every
match outcome based on the teams’ number of goals scored
were also calculated (see Table 1), as a side gets 3 points
if they score more than their opponent, 1 point if the fi-
nal score is a tie, and 0 points if the opposing roster has
more goals. For each simulated season (out of 10000 total
for each method), we tallied up the points, calculated the
goal differentials, and obtained the final standings for EPL
clubs (see Table 2). From this information, we kept track of
various metrics for EPL clubs and utilized them to evaluate
and compare the models and their predictions, which will
be discussed in the next section.