Source Code
You can download the Rmarkdown notebook used to render this article
here.
To download the file, use the button Download raw file on the
upper-right hand of the Code panel.
Overview
The neanderthal dataset contains records of Neanderthal
archaeological sites from Marine Isotope Stage 5e (Last Interglacial,
~130,000-115,000 years ago), combined with 25 paleoclimate and
topographic predictor variables. This dataset was built to support
binary classification, species distribution modelling, and habitat
suitability analysis.
Data Structure
The dataset is an sf POINT dataframe (EPSG 4326) with
227 rows and 28 columns, and no missing data.
dplyr::glimpse(neanderthal)
#> Rows: 227
#> Columns: 28
#> $ presence <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ bio1 <dbl> 93.90861, 101.39613, 98.43701, 80.31912, 67.72340…
#> $ bio10 <dbl> 230.8535, 193.2386, 196.0572, 208.2075, 193.7100,…
#> $ bio11 <dbl> -40.628402, 19.409437, 7.986982, -47.536219, -58.…
#> $ bio12 <dbl> 1016.7842, 634.9449, 651.3601, 481.9939, 598.3581…
#> $ bio13 <dbl> 124.79270, 86.20893, 78.12757, 64.77663, 79.24467…
#> $ bio14 <dbl> 42.9601370, 20.1204209, 23.6468781, 17.5191646, 2…
#> $ bio15 <dbl> 28.78959, 34.29474, 27.51153, 40.68027, 36.12821,…
#> $ bio16 <dbl> 321.4307, 235.2489, 220.1406, 189.4942, 228.7584,…
#> $ bio17 <dbl> 141.42826, 103.49764, 107.45727, 59.94072, 84.456…
#> $ bio18 <dbl> 306.99387, 125.23646, 155.59023, 179.19781, 213.6…
#> $ bio19 <dbl> 174.55667, 150.36287, 140.85947, 60.03907, 84.465…
#> $ bio2 <dbl> 107.35804, 80.55478, 86.42271, 101.75828, 100.366…
#> $ bio3 <dbl> 24.97327, 29.14201, 29.00000, 24.82866, 24.62171,…
#> $ bio4 <dbl> 10434.941, 6795.361, 7319.844, 9877.545, 9766.644…
#> $ bio5 <dbl> 315.4290, 259.6066, 268.1590, 290.8134, 275.5202,…
#> $ bio6 <dbl> -109.914922, -12.303463, -26.610513, -115.376691,…
#> $ bio7 <dbl> 425.3439, 271.9101, 294.7695, 406.1901, 401.8273,…
#> $ bio8 <dbl> 219.387741, 95.264316, 89.376189, 200.248222, 186…
#> $ bio9 <dbl> -23.84609, 52.21129, 49.78763, -41.15274, -58.285…
#> $ topo_aspect <dbl> 203.37818, 149.00686, 169.84060, 142.45523, 149.4…
#> $ topo_diversity_local <dbl> 7.1118334, 6.2437519, 1.2860197, -3.1426907, -0.7…
#> $ topo_diversity <dbl> 78.13295, 74.73658, 73.03239, 67.50587, 71.77473,…
#> $ topo_elev <dbl> 240.97322, 62.10563, 32.56124, 110.95158, 348.351…
#> $ topo_slope <dbl> 2.0600362, 0.9570366, 0.3261872, 1.3632851, 1.354…
#> $ topo_wetness <dbl> -3.142004, -3.236383, -1.925804, -3.006190, -4.24…
#> $ geometry <POINT [°]> POINT (15.8632 46.1654), POINT (1.8785 50.1…
#> $ color <chr> "#F5191C", "#F5191C", "#F5191C", "#F5191C", "#F51…The response variable is presence (see
neanderthal_response), a integer binary variable indicating
Neanderthal presence (1) or pseudo-absence (0).
table(neanderthal$presence)
#>
#> 0 1
#> 178 49The locations of the presence and background records are mapped below.
mapview::mapview(
neanderthal |>
dplyr::mutate(
presence = as.factor(presence)
),
zcol = "presence",
layer.name = "Presence",
col.regions = neanderthal$color
)The dataset includes 25 palaeoclimatic (last interglacial) and topographic predictors.
neanderthal_predictors
#> [1] "bio1" "bio10" "bio11"
#> [4] "bio12" "bio13" "bio14"
#> [7] "bio15" "bio16" "bio17"
#> [10] "bio18" "bio19" "bio2"
#> [13] "bio3" "bio4" "bio5"
#> [16] "bio6" "bio7" "bio8"
#> [19] "bio9" "topo_aspect" "topo_diversity_local"
#> [22] "topo_diversity" "topo_elev" "topo_slope"
#> [25] "topo_wetness"There is a companion raster dataset loaded via
neanderthal_extra().
neanderthal_env <- spatialData::neanderthal_extra()
plot(
x = neanderthal_env,
nc = 3,
col = env_colors,
)
Example Usage
In this section we use neanderthal to model habitat
suitability via an ensemble of univariate GAMs, one per predictor.
The neanderthal data has 49 presences. Fitting a model
with all predictors at once might lead to overparameterization and
overfitting. To avoid this potential issue, we will fit univariate GAM
models between presence and each environmental predictor. Each model in
the ensemble follows this structure:
m <- mgcv::gam(
formula = presence ~ s(
bio1,
k = ceiling(sum(neanderthal$presence)/10)
),
data = sf::st_drop_geometry(neanderthal),
family = stats::quasibinomial(link = "logit"),
weights = collinear::case_weights(
x = neanderthal$presence
),
select = TRUE
)mgcv::gam(): GAM models replace the linear terms used in GLMs with smoothing splines that better represent non-linear relationships between presence and environment.s(..., k = ceiling(sum(neanderthal$presence)/10): smoothing splines capture non-linear relationships between presence and each predictor. The basis dimensionkis set relative to the number of presences, keeping the smooth complexity well within the data budget.quasibinomial(link = "logit"): handles the overdispersion introduced by pseudo-absences and the fractional case weights, which a standardbinomialfamily would reject.weights = case_weights(...): inverse-frequency weights ensure that presences and pseudo-absences contribute equally to model fit regardless of their imbalance.
sample_weights <- collinear::case_weights(x = neanderthal$presence)
table(sample_weights)
#> sample_weights
#> 0.00280898876404494 0.0102040816326531
#> 178 49-
select = TRUE: adds a shrinkage penalty so that uninformative smooths can be penalized to zero, acting as automatic term selection.
After all relevant models are fitted, we will ensemble their predictions using their intrinsic AUC (probability of giving a higher suitability to a presence than to a background point) as weights.
collinear::score_auc(
o = neanderthal$presence,
p = stats::predict(m, type = "response")
)
#> [1] 0.8357028Multicollinearity Filtering
There are many redundant predictors in neanderthal. To
mitigate this issue, here we apply collinear::collinear()
to select a non-redundant subset of the most relevant predictors. The
function ranks all predictors by the AUC of univariate binomial GAM
models with case weights (very similar to the ones we’ll use for the
habitat suitability model!), and then applies a multicollinearity
filtering that preserves the most important ones.
neanderthal_selection <- collinear::collinear(
df = neanderthal,
responses = neanderthal_response,
predictors = neanderthal_predictors,
f = collinear::f_binomial_gam
)
#>
#> collinear::collinear()
#> └── collinear::validate_arg_df()
#> └── collinear::drop_geometry_column(): dropping geometry column from 'df'.
#>
#> collinear::collinear(): setting 'max_cor' to 0.517.
#>
#> collinear::collinear(): setting 'max_vif' to 2.8923.
#>
#> collinear::collinear(): selected predictors:
#> - bio17
#> - bio1
#> - bio7
#> - bio8
#> - topo_slope
#> - bio13
#> - topo_diversity
#> - topo_wetness
#> - topo_aspectThe names of the selected predictors are in the vector below.
neanderthal_selection$presence$selection
#> [1] "bio17" "bio1" "bio7" "bio8"
#> [5] "topo_slope" "bio13" "topo_diversity" "topo_wetness"
#> [9] "topo_aspect"
#> attr(,"validated")
#> [1] TRUEModel Fitting
The code below fits the univariate GAM models. Notice that models with an intrinsic AUC below 0.75 are ignored.
neanderthal_models <- lapply(
X = neanderthal_selection$presence$selection,
FUN = \(x) {
model_formula <- as.formula(
paste0(
"presence ~ s(",
x,
", k = ",
ceiling(sum(neanderthal$presence) / 10),
")"
)
)
model <- mgcv::gam(
formula = model_formula,
data = sf::st_drop_geometry(neanderthal),
family = stats::quasibinomial(link = "logit"),
weights = collinear::case_weights(
x = neanderthal$presence
),
select = TRUE
)
auc <- collinear::score_auc(
o = neanderthal$presence,
p = stats::predict(
object = model,
type = "response"
)
)
model$auc <- auc
model
}
)
names(neanderthal_models) <- neanderthal_selection$presence$selectionLet’s take a look at the performance of these univariate models.
neanderthal_auc <- lapply(
X = neanderthal_models,
FUN = \(x) x$auc
) |>
unlist()
neanderthal_auc
#> bio17 bio1 bio7 bio8 topo_slope
#> 0.8225178 0.8357028 0.8049759 0.7455859 0.7984407
#> bio13 topo_diversity topo_wetness topo_aspect
#> 0.7860582 0.7043109 0.6113277 0.6315065Several of these models have very low intrinsic AUC scores, meaning that they lack the ability to discriminate properly between presences and background points. We are going to ignore all models with an AUC below 0.75.
#keep models with AUC >= 0.75
neanderthal_auc <- neanderthal_auc[neanderthal_auc >= 0.75]
#remove bad models from the models list
neanderthal_models <- neanderthal_models[names(neanderthal_auc)]The response curves below show the predicted probability of Neanderthal presence for the selected univariate models.
lapply(
X = names(neanderthal_models),
FUN = \(x) {
x <- plotmo::plotmo(
object = neanderthal_models[[x]],
type = "response",
level = 0.95,
level.shade = "gray80",
main = x,
caption = "",
pt.col = neanderthal$color,
jitter = 0.3,
smooth.col = 0
)
}) |>
invisible()




Habitat Suitability Map
To build the weighted ensemble of univariate models we first need to
predict each univariate model over the spatRaster object
neanderthal_env.
neanderthal_predictions <- lapply(
X = names(neanderthal_models),
FUN = \(x) {
terra::predict(
object = neanderthal_env,
model = neanderthal_models[[x]],
type = "response",
na.rm = TRUE
)
}
) |>
terra::rast()
names(neanderthal_predictions) <- names(neanderthal_models)The resulting object is a spatRaster where each layer
represents a univariate suitability map.
plot(
x = neanderthal_predictions,
col = neanderthal_colors
)
To build the ensemble, we apply terra::weighted.mean()
to the individual model predictions and their intrinsic AUCs.
neanderthal_suitability <- terra::weighted.mean(
x = neanderthal_predictions,
w = neanderthal_auc
)
names(neanderthal_suitability) <- "suitability"The map below shows the high-suitability values in warmer colors.
mapview(
x = neanderthal_suitability,
layer.name = "Suitability",
col.regions = neanderthal_colors,
na.color = "transparent"
)To finish, we can
#extract predicted values for all records
neanderthal$prediction <- terra::extract(
x = neanderthal_suitability,
y = neanderthal,
ID = FALSE
) |>
dplyr::pull(
suitability
)
#remove NA data (not supported by score_auc())
neanderthal <- na.omit(neanderthal)
#compute intrinsic AUC
collinear::score_auc(
o = neanderthal$presence,
p = neanderthal$prediction
)
#> [1] 0.9091353That’s good to see: the AUC of the ensemble is better than any AUC of the univariate models, which is the whole point of building this kind of ensemble modelling.
Conclusion
The ensemble of univariate GAMs, weighted by cross-validated AUC, provides a robust picture of Neanderthal habitat suitability during the Last Interglacial.
References
Benito, B.M., et al. (2017). The ecological niche and distribution of Neanderthals during the Last Interglacial. Journal of Biogeography, 44, 51–61. https://doi.org/10.1111/jbi.12845
Nielsen, T.K., Benito, B.M., Svenning, J.-C., Sandel, B., McKerracher, L., Riede, F., & Kjærgaard, P.C. (2017). Investigating Neanderthal dispersal above 55°N in Europe during the Last Interglacial Complex. Quaternary International, 431, 88–103. https://doi.org/10.1016/j.quaint.2015.10.039
