Cook-Abstand

In der Statistik, insbesondere in der Regressionsdiagnostik, ist der Cook-Abstand, die Cook-Maßzahl, oder auch Cook'sche Distanz genannt, die wichtigste Maßzahl zur Bestimmung sogenannter einflussreicher Beobachtungen, wenn eine Kleinste-Quadrate-Regression durchgeführt wurde. Der Cook-Abstand ist nach dem amerikanischen Statistiker R. Dennis Cook benannt, der das Konzept 1977 vorstellte.

Definition

Datenpunkte mit großen Residuen (Ausreißern) und/oder großen „Hebelwerten“ könnten das Ergebnis und die Präzision einer Regression beeinflussen. Der Cook-Abstand misst den Effekt der Auslassung einer gegebenen Beobachtung. Datenpunkte mit einem großen Cook-Abstand sollte man bei der Datenanalyse näher betrachten. Es sei das multiple lineare Regressionsmodell in Vektor-Matrix-Form:

{\underset {n\times 1}{\mathbf {y} }}={\underset {n\times p}{\mathbf {X} }}\quad {\underset {p\times 1}{\boldsymbol {\beta }}}\quad +\quad {\underset {n\times 1}{\boldsymbol {\varepsilon }}}

,

wobei der Störgrößenvektor einer mehrdimensionalen Normalverteilung folgt ${\boldsymbol {\varepsilon }}\sim {\mathcal {N}}\left(\mathbf {0} ,\sigma ^{2}\mathbf {I} \right)$ und ${\boldsymbol {\beta }}=\left(\beta _{0}\,\beta _{1},\dots ,\beta _{k}\right)^{\top }$ der Vektor der Regressionskoeffizienten ist (hierbei ist $p=k+1$ die Anzahl der zu schätzenden unbekannten Parameter und $k$ die Anzahl der erklärenden Variablen), und $\mathbf {X}$ die Datenmatrix. Der Kleinste-Quadrate-Schätzvektor lautet dann ${\hat {\boldsymbol {\beta }}}=\left(\mathbf {X} ^{\top }\mathbf {X} \right)^{-1}\mathbf {X} ^{\top }\mathbf {y}$ , woraus folgt, dass sich der Schätzvektor der abhängigen Variablen wie folgt ergibt:

\mathbf {\hat {y}} =\mathbf {X} {\hat {\boldsymbol {\beta }}}=\underbrace {\mathbf {X} \left(\mathbf {X} ^{\top }\mathbf {X} \right)^{-1}\mathbf {X} ^{\top }} _{=\mathbf {P} }\mathbf {y} =\mathbf {P} \mathbf {y}

,

wobei $\mathbf {P} \equiv \mathbf {X} \left(\mathbf {X} ^{\top }\mathbf {X} \right)^{-1}\mathbf {X} ^{\top }$ die Prädiktionsmatrix darstellt. Das $i$ te Diagonalelement von $\mathbf {P} \,$ ist gegeben durch $p_{ii}\equiv \mathbf {x} _{i}^{\top }\left(\mathbf {X} ^{\top }\mathbf {X} \right)^{-1}\mathbf {x} _{i}$ , wobei $\mathbf {x} _{i}^{\top }$ die $i$ -te Zeile der Datenmatrix $\mathbf {X}$ ist.^[1] Die Werte werden auch als „Hebelwerte“ der $i$ ten Beobachtung bezeichnet. Um den Einfluss eines Punktes $(y_{i},\mathbf {x} _{i}^{\top })$ zu formalisieren betrachtet man den Effekt der Auslassung des Punktes auf ${\boldsymbol {\beta }}$ und $\mathbf {\hat {y}} =\mathbf {X} {\hat {\boldsymbol {\beta }}}$ . Der Schätzer von ${\boldsymbol {\beta }}$ , der dadurch gewonnen wird, dass die $i$ te Beobachtung $(y_{i},\mathbf {x} _{i}^{\top })$ ausgelassen wird, ist gegeben durch ${\hat {\boldsymbol {\beta }}}_{(i)}=(\mathbf {X} _{(i)}^{\top }\mathbf {X} _{(i)})^{-1}\mathbf {X} _{(i)}^{\top }\mathbf {y} _{(i)}$ .^[2] Man kann ${\hat {\boldsymbol {\beta }}}_{(i)}$ mit ${\hat {\boldsymbol {\beta }}}$ mittels dem Cook-Abstand vergleichen, der definiert ist durch:^[3]^[4]

D_{i}={\frac {({\hat {\boldsymbol {\beta }}}_{(i)}-{\hat {\boldsymbol {\beta }}})^{\top }(\mathbf {X} ^{\top }\mathbf {X} )({\hat {\boldsymbol {\beta }}}_{(i)}-{\hat {\boldsymbol {\beta }}})}{(k+1)s^{2}}}={\frac {(\mathbf {X} {\hat {\boldsymbol {\beta }}}_{(i)}-\mathbf {X} {\hat {\boldsymbol {\beta }}})^{\top }(\mathbf {X} {\hat {\boldsymbol {\beta }}}_{(i)}-\mathbf {X} {\hat {\boldsymbol {\beta }}})}{(k+1)s^{2}}}={\frac {({\hat {\mathbf {y} }}_{(i)}-{\hat {\mathbf {y} }})^{\top }({\hat {\mathbf {y} }}_{(i)}-{\hat {\mathbf {y} }})}{(k+1)s^{2}}}

,

wobei $s^{2}$ die erwartungstreue Schätzung der Varianz der Störgrößen darstellt. Das Maß $D_{i}$ ist proportional zum gewöhnlichen euklidischen Abstand zwischen ${\hat {\mathbf {y} }}_{(i)}$ und ${\hat {\mathbf {y} }}$ . Daher ist $D_{i}$ groß, wenn die Beobachtung $(y_{i},\mathbf {x} _{i}^{\top })$ eine substantiellen Einfluss auf sowohl ${\hat {\boldsymbol {\beta }}}$ , als auch ${\hat {\mathbf {y} }}$ hat.

Eine numerisch einfachere Darstellung von $D_{i}$ ist gegeben durch:^[5]

D_{i}={\frac {t_{i}^{2}}{k+1}}\left({\frac {p_{ii}}{1-p_{ii}}}\right)

,

wobei $t_{i}$ die studentisierten Residuen $t_{i}={{\widehat {\varepsilon }}_{i} \over s_{(i)}^{2}{\sqrt {1-p_{ii}\ }}}$ darstellen.

Erkennen von stark einflussreichen Beobachtungen

Es gibt unterschiedliche Ansätze zur Bestimmung der Grenzen, was stark einflussreiche Beobachtungen sein sollen. Es wurde die einfache Daumenregel $D_{i}>1$ vorgeschlagen.^[6] Andere Autoren haben $D_{i}>4/n$ vorgeschlagen, wobei $n$ die Anzahl der Beobachtungen ist.^[7]

Siehe auch

Mahalanobis-Abstand

Literatur

Rencher, Alvin C., und G. Bruce Schaalje: Linear models in statistics., John Wiley & Sons, 2008

Einzelnachweise

↑ Fumio Hayashi: Econometrics., Princeton University Press., 2000, S. 21–23
↑ Rencher, Alvin C., und G. Bruce Schaalje: Linear models in statistics., John Wiley & Sons, 2008, S. 236
↑ Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, Brian Marx: Regression: models, methods and applications. Springer Science & Business Media, 2013, ISBN 978-3-642-34332-2, S. 165.
↑ Rencher, Alvin C., und G. Bruce Schaalje: Linear models in statistics., John Wiley & Sons, 2008, S. 237
↑ Rencher, Alvin C., und G. Bruce Schaalje: Linear models in statistics., John Wiley & Sons, 2008, S. 237
↑ R. Dennis Cook und Sanford Weisberg: Residuals and Influence in Regression, 1982., New York, Chapman & Hall, ISBN 0-412-24280-X
↑ Kenneth A. Bollen und Robert W. Jackman: Regression Diagnostics: An Expository Treatment of Outliers and Influential Cases in Modern Methods of Data Analysis (1990), Newbury Park, CA, ISBN 0-8039-3366-5, S. 257–9.

[1] Fumio Hayashi: Econometrics., Princeton University Press., 2000, S. 21–23

[2] Rencher, Alvin C., und G. Bruce Schaalje: Linear models in statistics., John Wiley & Sons, 2008, S. 236

[3] Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, Brian Marx: Regression: models, methods and applications. Springer Science & Business Media, 2013, ISBN 978-3-642-34332-2, S. 165.

[4] Rencher, Alvin C., und G. Bruce Schaalje: Linear models in statistics., John Wiley & Sons, 2008, S. 237

[5] Rencher, Alvin C., und G. Bruce Schaalje: Linear models in statistics., John Wiley & Sons, 2008, S. 237

[6] R. Dennis Cook und Sanford Weisberg: Residuals and Influence in Regression, 1982., New York, Chapman & Hall, ISBN 0-412-24280-X

[7] Kenneth A. Bollen und Robert W. Jackman: Regression Diagnostics: An Expository Treatment of Outliers and Influential Cases in Modern Methods of Data Analysis (1990), Newbury Park, CA, ISBN 0-8039-3366-5, S. 257–9.

[1]

[2]

[3]

[4]

[5]

[6]

[7]