The gkrreg package implements the Gaussian Kernel Robust Regression (GKRReg, or GKRR) method proposed by De Carvalho et al. (2017). The method fits a linear regression model by iteratively re-weighting observations using the Gaussian kernel, so that outliers and leverage points are automatically down-weighted. Convergence is theoretically guaranteed (Propositions 4.1 and 4.2 of the paper).
The package provides:
gkrr() — model fitting with four width hyper-parameter
estimators ("s1", "s2", "s3",
"s4") plus automatic data-driven selection
("auto");gkrr_boot() — bootstrap inference (standard errors,
confidence intervals and p-values);summary.gkrr() — a coefficient table modelled after
summary.lm(), with sandwich inference by default and
bootstrap inference when available;plot.gkrr() — six diagnostic panels emphasising kernel
weights;GKRReg minimises
\[S(\boldsymbol{\beta}) = 2\sum_{i=1}^n \bigl[1 - G(y_i,\, \hat{\mu}_i)\bigr], \qquad G(a,b) = \exp\!\bigl(-{(a-b)^2}/{\gamma^2}\bigr),\]
where \(\hat{\mu}_i = \mathbf{x}_i^\top\boldsymbol{\beta}\) and \(\gamma^2 > 0\) is the kernel width hyper-parameter. An observation with a large residual contributes close to 2 to \(S\); a perfectly fitted observation contributes 0.
Differentiating \(S\) yields an IRWLS problem with kernel weights \(k_{ii}^{(t)} = G(y_i, \hat{\mu}_i^{(t-1)}) \in (0,1]\). The algorithm starts from OLS and alternates between WLS and weight updates until convergence (Propositions 4.1–4.2 of De Carvalho et al. (2017)).
Five options are available via the sigma_method
argument:
| Method | Description | Recommended scenario |
|---|---|---|
"s1" |
Mean of the 0.1 and 0.9 quantiles of OLS squared residuals on a sub-sample (Caputo estimator) | Clean data |
"s2" |
Pairwise median of \((y_i - \hat{\mu}_j^{\text{OLS}})^2\), \(i \neq j\) | Y-space outliers \(\leq 10\%\); X-space outliers \(\leq 15\%\) |
"s3" |
Residual variance \(\sum(y_i - \hat{\mu}_i)^2 / (n - p - 1)\) | Y-space outliers \(\geq 15\%\); leverage points |
"s4" |
AICc-selected bandwidth \(h^2\) via
sm::h.select() |
Large samples (\(n \geq 200\)) |
"auto" |
Selects among S1–S4 by out-of-bag bootstrap MSE | When the outlier scenario is unknown |
When "auto" is used, a message is emitted showing which
method was selected and the comparative OOB MSEs. The selection uses
\(B = 99\) replicates by default,
configurable via auto_args = list(B = ...).
library(gkrreg)
data(belgium_calls)
fit <- gkrr(calls ~ year, data = belgium_calls, sigma_method = "s3")
print(fit)
#>
#> GKRReg -- Gaussian Kernel Robust Regression
#> --------------------------------------------
#> gamma^2 : 31.61 (method: s3)
#> Iterations: 9 [converged]
#>
#> Coefficients:
#> (Intercept) year
#> -6.5764 0.1351
#>
#> (Sandwich inference available -- use summary() for the full table)summary() uses the sandwich variance
estimator by default, producing standard errors, confidence
intervals and Wald z-test p-values with no additional computation:
summary(fit)
#>
#> Call:
#> gkrr(formula = calls ~ year, data = belgium_calls, sigma_method = "s3")
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.6165 -0.1917 -0.0271 3.5213 18.4537
#>
#> Coefficients:
#> Estimate Std. Error CI 95% lower CI 95% upper p-value
#> (Intercept) -6.5764 1.1145 -8.7609 -4.3920 3.621e-09 ***
#> year 0.1351 0.0203 0.0954 0.1748 2.529e-11 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> (Sandwich SE (asymptotic Wald z-test))
#>
#> gamma^2: 31.61 | method: s3 | iterations: 9 | converged: TRUE
#> R-squared: -0.1219 | Weighted R-squared: 0.5499
#> Kernel weights -- min: 0.0000 mean: 0.7501 max: 1.0000
#> Note: sandwich inference may be less reliable here (n = 24 (small sample); 12% of observations received near-zero kernel weight (< 0.01)).
#> Consider bootstrap inference via boot = TRUE in gkrr() or
#> summary(fit, boot = gkrr_boot(fit)).The sandwich covariance matrix is accessible via
vcov():
round(vcov(fit), 6)
#> (Intercept) year
#> (Intercept) 1.242183 -0.022546
#> year -0.022546 0.000410The original paper does not provide a sampling distribution for \(\hat{\boldsymbol{\beta}}\). The WLS covariance matrix \((X^\top \hat{K} X)^{-1}\) from the final IRWLS step treats \(\hat{K}\) as fixed and therefore underestimates the true variance.
gkrreg implements an analytic sandwich variance estimator based on the theory of generalised \(M\)-estimators. At convergence the GKRReg estimating equations are:
\[\sum_{i=1}^n k_{ii}(\boldsymbol{\beta})\,\mathbf{x}_i (y_i - \mathbf{x}_i^\top\boldsymbol{\beta}) = \mathbf{0},\]
and the asymptotic sandwich covariance is \(n^{-1}\hat{A}^{-1}\hat{B}\hat{A}^{-1}\), where
\[\hat{A} = \frac{1}{n}\,\mathbf{X}^\top \operatorname{diag}\!\left[k_{ii}\!\left(1 - \frac{2\hat{e}_i^2}{\hat{\gamma}^2}\right)\right] \mathbf{X}, \qquad \hat{B} = \frac{1}{n}\,\mathbf{X}^\top \operatorname{diag}\!\left(k_{ii}^2\,\hat{e}_i^2\right)\mathbf{X}.\]
This corresponds to the HC0 class of heteroskedasticity-robust covariance estimators (white1982?). The correction term \((1 - 2\hat{e}_i^2/\hat{\gamma}^2)\) in \(\hat{A}\) arises directly from differentiating the Gaussian kernel with respect to \(\boldsymbol{\beta}\).
When is the sandwich reliable? The sandwich is
consistent and efficient asymptotically, but treats \(\hat{\gamma}^2\) as fixed. For small
samples (\(n < 50\)) or heavy
contamination, bootstrap inference is recommended.
summary() emits a note when these conditions are detected
automatically.
gkrr_boot() runs a pairs bootstrap, re-fitting
the complete GKRReg algorithm — including re-estimating \(\gamma^2\) — on each replicate. This
captures all sources of variability and is more reliable than the
sandwich for small samples or heavy contamination.
Three CI types are available: "percentile",
"normal", and "bca" (bias-corrected and
accelerated, Efron (1987); the default and
recommended). Bootstrap p-values use the centred-t approach:
\[p_j = \frac{2}{B}\,\#\!\left\{|\hat\beta_j^* - \hat\beta_j| \geq |\hat\beta_j|\right\}.\]
Option A — boot = TRUE inside
gkrr()
fit_b <- gkrr(calls ~ year, data = belgium_calls,
sigma_method = "s3",
boot = TRUE,
boot_args = list(B = 999, type = "bca", seed = 1))
summary(fit_b)Option B — separate gkrr_boot()
call
| Scenario | Recommended |
|---|---|
| \(n \geq 50\), mild contamination | Sandwich (fast, deterministic) |
| \(n < 50\) or heavy contamination | Bootstrap |
| Both available and SEs diverge by > 25% | Bootstrap |
When both are available, summary() automatically warns
if the relative discrepancy in standard errors exceeds 25% for any
coefficient.
plot.gkrr() provides six panels. Point size is
inversely proportional to \(k_{ii}\): outliers appear large
and red; well-fitted observations appear small and blue.
oldpar <- par(no.readonly = TRUE)
on.exit(par(oldpar))
par(mfrow = c(1, 3))
plot(fit, which = 1, ask = FALSE) # residuals vs. fitted
plot(fit, which = 3, ask = FALSE) # weight vs. residual + kernel curve
plot(fit, which = 4, ask = FALSE) # weight vs. indexPanel 3 overlays the theoretical curve \(G(e) = \exp(-e^2/\hat{\gamma}^2)\), making
the down-weighting mechanism directly visible. Observations 15–20 in
belgium_calls (a recording error) receive near-zero weights
and are effectively excluded from the fit.
which |
Description |
|---|---|
| 1 | Residuals vs. fitted values |
| 2 | Observed vs. fitted (\(y\) vs. \(\hat{\mu}\)) |
| 3 | Kernel weight vs. residual with theoretical curve |
| 4 | Kernel weight vs. observation index |
| 5 | Normal QQ-plot of residuals, coloured by weight |
| 6 | Convergence of \(S(\boldsymbol{\beta})\) |
data(kootenay)
fit_ols <- lm(newgate ~ libby, data = kootenay)
fit_gkrr <- gkrr(newgate ~ libby, data = kootenay, sigma_method = "s1")
if (requireNamespace("MASS", quietly = TRUE)) {
fit_m <- MASS::rlm(newgate ~ libby, data = kootenay, method = "M")
fit_mm <- MASS::rlm(newgate ~ libby, data = kootenay, method = "MM")
} else { fit_m <- fit_mm <- NULL }
tab <- rbind(OLS = coef(fit_ols),
GKRR = coef(fit_gkrr))
if (!is.null(fit_m))
tab <- rbind(tab, M = coef(fit_m), MM = coef(fit_mm))
print(round(tab, 4))
#> (Intercept) libby
#> OLS 23.1649 -0.0138
#> GKRR 5.4561 0.6216
#> M 23.1942 -0.0186
#> MM 5.4667 0.6209
plot(kootenay$libby, kootenay$newgate,
xlab = "Libby flow", ylab = "Newgate flow",
main = "Kootenay River -- X-space outlier (1934)",
pch = 19, col = "grey60")
points(kootenay["1934","libby"], kootenay["1934","newgate"],
col = "red", pch = 17, cex = 1.6)
abline(fit_ols, col = "black", lwd = 2, lty = 2)
abline(fit_gkrr, col = "firebrick", lwd = 2)
if (!is.null(fit_m)) {
abline(fit_m, col = "darkorange", lwd = 2, lty = 3)
abline(fit_mm, col = "purple", lwd = 2, lty = 4)
legend("topleft", c("OLS","GKRR","M","MM","1934"),
col = c("black","firebrick","darkorange","purple","red"),
lty = c(2,1,3,4,NA), pch = c(NA,NA,NA,NA,17), lwd = 2, bty = "n")
} else {
legend("topleft", c("OLS","GKRR","1934"),
col = c("black","firebrick","red"),
lty = c(2,1,NA), pch = c(NA,NA,17), lwd = 2, bty = "n")
}fit_ols <- lm(calls ~ year, data = belgium_calls)
fit_gkrr <- gkrr(calls ~ year, data = belgium_calls, sigma_method = "s3")
plot(belgium_calls$year + 1900, belgium_calls$calls,
xlab = "Year", ylab = "Calls (tens of millions)",
main = "Belgium International Calls",
pch = 19, col = "grey60")
points(belgium_calls$year[15:20] + 1900, belgium_calls$calls[15:20],
col = "red", pch = 17, cex = 1.4)
abline(fit_ols, col = "black", lwd = 2, lty = 2)
abline(fit_gkrr, col = "firebrick", lwd = 2)
legend("topleft", c("OLS","GKRR (S3)","Outliers (1964-69)"),
col = c("black","firebrick","red"),
lty = c(2,1,NA), pch = c(NA,NA,17), lwd = 2, bty = "n")oldpar <- par(no.readonly = TRUE)
on.exit(par(oldpar))
par(mfrow = c(1, 3))
plot(fit_gkrr, which = 1, ask = FALSE)
plot(fit_gkrr, which = 3, ask = FALSE)
plot(fit_gkrr, which = 4, ask = FALSE)data(delivery)
fit_ols <- lm(delivery_time ~ n_products + distance, data = delivery)
fit_gkrr <- gkrr(delivery_time ~ n_products + distance, data = delivery,
sigma_method = "s3")
round(rbind(OLS = coef(fit_ols),
GKRR = coef(fit_gkrr)), 4)
#> (Intercept) n_products distance
#> OLS 2.3412 1.6159 0.0144
#> GKRR 4.0607 1.3891 0.0145oldpar <- par(no.readonly = TRUE)
on.exit(par(oldpar))
par(mfrow = c(1, 3))
plot(fit_gkrr, which = 2, ask = FALSE)
plot(fit_gkrr, which = 4, ask = FALSE)
plot(fit_gkrr, which = 5, ask = FALSE)data(mammals)
fit_ols <- lm(log_brain ~ log_body, data = mammals)
fit_gkrr <- gkrr(log_brain ~ log_body, data = mammals, sigma_method = "s3")
plot(mammals$log_body, mammals$log_brain,
xlab = "log(body mass, kg)", ylab = "log(brain mass, g)",
main = "Brain vs. Body Mass (62 Mammal Species)",
pch = 19, col = "grey60", cex = 0.8)
elephants <- mammals$species %in% c("African elephant","Asian elephant")
points(mammals$log_body[elephants], mammals$log_brain[elephants],
col = "red", pch = 17, cex = 1.5)
abline(fit_ols, col = "black", lwd = 2, lty = 2)
abline(fit_gkrr, col = "firebrick", lwd = 2)
legend("topleft", c("OLS","GKRR (S3)","Elephants"),
col = c("black","firebrick","red"),
lty = c(2,1,NA), pch = c(NA,NA,17), lwd = 2, bty = "n")sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] gkrreg_0.4.0 rmarkdown_2.31
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.39 R6_2.6.1 fastmap_1.2.0 xfun_0.58
#> [5] maketools_1.3.2 cachem_1.1.0 knitr_1.51 htmltools_0.5.9
#> [9] buildtools_1.0.0 lifecycle_1.0.5 cli_3.6.6 sass_0.4.10
#> [13] jquerylib_0.1.4 sm_2.2-6.0 compiler_4.6.0 sys_3.4.3
#> [17] tools_4.6.0 evaluate_1.0.5 bslib_0.11.0 yaml_2.3.12
#> [21] otel_0.2.0 jsonlite_2.0.0 rlang_1.2.0 MASS_7.3-65