Paper I · A statistical view on a surrogate model for estimating extreme events with an
application to wind turbines
applications; a few references are [4, 5, 12, 17]. This is one example where knowledge
of the empirical distribution of Y ,
b
P
Y
(δ
y
1
,... , δ
y
n
) =
1
n
n
X
i=1
δ
y
i
, (1.1)
is valuable. (Here
δ
y
denotes the Dirac measure at the point
y
.) If one is interested in
the entire distribution of
Y
, one may use the estimator
(1.1)
directly or a smoothed
version, for example, replacing
δ
y
i
by the Gaussian distribution with mean
y
i
and
variance
σ
2
>
0 (the latter usually referred to as the bandwidth). The problem in
determining
(1.1)
arises if
Y
is not observable. Such a situation can happen for several
reasons, for instance, it may be that
Y
is difficult or expensive to measure or that its
importance has just recently been recognized, and hence one have not collected the
historic data that is needed. Sometimes, a solution to the problem of having a latent
variable could be to set up a suitable simulation environment and, by varying the
conditions of the system, obtain various realizations of
Y
. Since we cannot be sure
that the variations in the simulation environment correspond to the variations in the
physical environment, the realizations of
Y
are not necessarily drawn from the true
distribution. This is essentially similar to any experimental study and one will have
to rely on the existence of control variables.
By assuming the existence of an observable
d
-dimensional vector
X
of covariates
carrying information about the environment, a typical way to proceed would be
regression/matching which in turn would form a surrogate model. To be concrete,
given a realization
x
of
X
, a surrogate model is expected to output (approximately)
f
(
x
) =
E
[
Y | X
=
x
], the conditional mean of
Y
given
X
=
x
. Consequently, given
inputs
x
1
,... , x
n
, the model would produce
f
(
x
1
)
,... , f
(
x
n
) as stand-ins for the missing
values
y
1
,... , y
n
of
Y
. Building a surrogate for the distribution of
Y
on top of this could
now be done by replacing
y
i
by
f
(
x
i
) in
(1.1)
to obtain an estimate
b
P
Y
(
δ
f (x
1
)
,... , δ
f (x
n
)
)
of the distribution of
Y
. This surrogate model for the distribution of
Y
can thus be
seen as a composition of two maps:
(x
1
,... , x
n
) −→(δ
f (x
1
)
,... , δ
f (x
n
)
) −→
b
P
Y
(δ
f (x
1
)
,... , δ
f (x
n
)
). (1.2)
In the context of an incomplete data problem, the strategy of replacing unobserved
quantities by the corresponding conditional means is called regression imputation
and will generally not provide a good estimate of the distribution of
Y
. For instance,
while the (unobtainable) estimate in
(1.1)
converges weakly to the distribution of
Y
as the sample size
n
increases, the one provided by
(1.2)
converges weakly to the
distribution of the conditional expectation
E
[
Y |X
] of
Y
given
X
. In fact, any of the so-
called single imputation approaches, including regression imputation, usually results
in proxies
ˆ
y
1
,... ,
ˆ
y
n
which exhibit less variance than the original values
y
1
,... , y
n
, and
in this case
b
P
Y
(
δ
ˆ
y
1
,... , δ
ˆ
y
n
) will provide a poor estimate of the distribution of
Y
(see
[15] for details).
The reason that the approach
(1.2)
works unsatisfactory is that
δ
f (X)
is an (unbi-
ased) estimator for the distribution of
E
[
Y |X
] rather than of
Y
. For this reason we
will replace
δ
f (x)
by an estimator for the conditional distribution
µ
x
of
Y
given
X
=
x
and maintain the overall structure of (1.2):
(x
1
,... , x
n
) −→(µ
x
1
,... , µ
x
n
) −→
b
P
Y
(µ
x
1
,... , µ
x
n
). (1.3)
194