Skip to contents

Overview

The quercus dataset contains presence and absence records of eight European Quercus (oak) species combined with 31 environmental predictor variables covering bioclimatic, topographic, vegetation, and human impact dimensions.

Its main features are:

  • Complete case study - No missing values, clean data ready for analysis.
  • Multi-class classification - Eight oak species plus absence points (9 levels).
  • European scope - Records spanning western, central, and southern Europe.
  • Ecological predictors - Climate, topography, vegetation indices, land cover, and human footprint.

This dataset was built to support a range of analytical approaches, such as binary and multi-class classification, niche modelling, and niche overlap analysis.

Setup

The following R libraries are required to run this tutorial:

library(dplyr)
library(sf)
library(mapview)
library(terra)
library(rpart)
library(rpart.plot)
library(collinear)
library(spatialData)

data(
  quercus,
  quercus_response,
  quercus_predictors,
  package = "spatialData"
)

#convert `species` to factor
quercus <- dplyr::mutate(
  quercus,
  species = as.factor(species)
)

quercus_colors <- grDevices::palette.colors(
  n = length(levels(quercus$species)),
  palette = "Okabe-Ito",
  alpha = 0.6
  ) |> 
  rev() |> 
  stats::setNames(levels(quercus$species))

env_colors <- grDevices::hcl.colors(
  n = 100,
  palette = "Zissou 1"
)

Data Structure

The dataset is an sf POINT dataframe with EPSG 4326, 6728 rows and 33 columns, and no missing data. The first 10 records and all columns but geometry are shown below.

quercus |> 
  sf::st_drop_geometry() |> 
  head() |> 
  dplyr::glimpse()
#> Rows: 6
#> Columns: 32
#> $ species               <fct> Quercus robur, Quercus robur, Quercus petraea, Q…
#> $ bio1                  <int> 75, 75, 110, 58, 82, 157
#> $ bio10                 <int> 163, 149, 179, 158, 164, 250
#> $ bio11                 <int> -20, 5, 41, -36, 2, 73
#> $ bio12                 <int> 740, 818, 1006, 587, 582, 510
#> $ bio13                 <int> 106, 96, 105, 73, 67, 69
#> $ bio14                 <int> 34, 43, 61, 28, 30, 4
#> $ bio15                 <int> 40, 27, 13, 33, 23, 51
#> $ bio16                 <int> 294, 279, 296, 210, 194, 178
#> $ bio17                 <int> 107, 138, 212, 92, 103, 29
#> $ bio18                 <int> 294, 205, 212, 160, 194, 29
#> $ bio19                 <int> 115, 162, 285, 118, 125, 172
#> $ bio2                  <int> 90, 65, 103, 64, 68, 124
#> $ bio3                  <int> 31, 28, 41, 22, 27, 37
#> $ bio4                  <int> 7116, 5759, 5368, 7617, 6366, 6904
#> $ bio5                  <int> 227, 193, 249, 206, 213, 353
#> $ bio6                  <int> -62, -35, 0, -74, -31, 22
#> $ bio7                  <int> 289, 228, 249, 280, 244, 331
#> $ topographic_diversity <int> 93, 12, 43, 8, 11, 55
#> $ human_footprint       <dbl> 42.33, 52.65, 32.92, 26.68, 40.08, 36.03
#> $ landcover_veg_bare    <dbl> 0.36, 0.25, 0.00, 0.08, 0.30, 2.40
#> $ landcover_veg_herb    <dbl> 58.85, 91.73, 71.67, 44.06, 53.94, 87.12
#> $ landcover_veg_tree    <dbl> 40.69, 8.02, 28.25, 55.78, 35.20, 10.47
#> $ ndvi_average          <dbl> 0.63, 0.63, 0.73, 0.50, 0.38, 0.48
#> $ ndvi_maximum          <dbl> 0.79, 0.77, 0.80, 0.83, 0.53, 0.70
#> $ ndvi_minimum          <dbl> 0.44, 0.45, 0.63, 0.06, 0.22, 0.27
#> $ ndvi_range            <dbl> 0.36, 0.33, 0.17, 0.77, 0.31, 0.42
#> $ sun_rad_average       <dbl> 4164.84, 3552.85, 4480.35, 3376.14, 3787.15, 511…
#> $ sun_rad_maximum       <dbl> 7515.76, 7189.52, 7601.16, 7099.44, 7301.63, 776…
#> $ sun_rad_minimum       <dbl> 897.09, 356.39, 1271.03, 230.87, 552.49, 2118.26
#> $ sun_rad_range         <dbl> 6618.67, 6833.12, 6330.14, 6868.57, 6749.14, 564…
#> $ topo_slope            <dbl> 3.10, 0.39, 1.25, 0.28, 0.43, 1.88

The response variable is species, a categorical variable with 9 levels (8 species + absence):

quercus |> 
  sf::st_drop_geometry() |> 
  dplyr::group_by(species) |> 
  dplyr::summarise(
    n = dplyr::n()
  ) |> 
  dplyr::arrange(
    dplyr::desc(n)
    )
#> # A tibble: 9 × 2
#>   species               n
#>   <fct>             <int>
#> 1 absence            1899
#> 2 Quercus robur      1660
#> 3 Quercus petraea    1445
#> 4 Quercus ilex        627
#> 5 Quercus cerris      394
#> 6 Quercus faginea     287
#> 7 Quercus pubescens   225
#> 8 Quercus pyrenaica   133
#> 9 Quercus suber        58

The dataset includes 31 predictor variables organized into six categories:

Category Variables Source
Bioclimatic (17) bio1, bio2, …, bio19 (excluding bio8, bio9) WorldClim
NDVI (4) ndvi_average, ndvi_maximum, ndvi_minimum, ndvi_range MODIS
Solar radiation (4) sun_rad_average, sun_rad_maximum, sun_rad_minimum, sun_rad_range WorldClim
Land cover (3) landcover_veg_bare, landcover_veg_herb, landcover_veg_tree MODIS MOD44B
Topographic (2) topo_slope, topographic_diversity SRTM
Human impact (1) human_footprint Venter et al. 2016

The map below shows the spatial distribution of oak species and absence points across Europe:

mapview::mapview(
  quercus,
  zcol = "species",
  layer.name = "Species",
  col.regions = quercus_colors,
  color = NULL
)

The function spatialData::quercus_extra() returns a spatRaster object, named quercus_env below, useful to generate continuous predictions from models trained with the quercus data.

quercus_env <- spatialData::quercus_extra()
#> spatialData::quercus_extra(): Downloading file 'quercus_env.tif'.
plot(
  x = quercus_env,
  col = env_colors,
  maxnl = length(names(quercus_env)),
  nc = 4
  )

Example Usage

In this example we use quercus to identify the environmental conditions that distinguish oak species using recursive partition trees.

Multicollinearity Filtering

Many predictors in this dataset are correlated (e.g., bio5, bio6, bio7 are all derived from temperature extremes). The collinear::collinear() function from the R package collinear selects a non-redundant subset that maximizes predictive power for the response variable.

quercus_selection <- collinear::collinear(
  df = quercus,
  responses = "species",
  predictors = quercus_predictors,
  f = collinear::f_categorical_rf,
  quiet = TRUE
)

The function generates a vector with selected predictor names…

quercus_selection$species$selection
#> [1] "sun_rad_minimum"       "bio14"                 "bio4"                 
#> [4] "landcover_veg_bare"    "ndvi_maximum"          "topographic_diversity"
#> [7] "bio16"                 "human_footprint"       "landcover_veg_tree"   
#> attr(,"validated")
#> [1] TRUE

… and a classification formula ready for use.

quercus_selection$species$formulas$classification
#> species ~ sun_rad_minimum + bio14 + bio4 + landcover_veg_bare + 
#>     ndvi_maximum + topographic_diversity + bio16 + human_footprint + 
#>     landcover_veg_tree
#> <environment: 0x563b5ac97908>

Decision Tree Model

Species occurrence counts are imbalanced — Q. robur has 1,660 records while Q. suber has only 58. Without correction, rare species get absorbed into leaves dominated by common ones. To mitigate this issue, we use the collinear::case_weights() function to generate inverse-frequency weights so that each species contributes equally to the tree splits.

quercus_rpart <- rpart::rpart(
  formula = quercus_selection$species$formulas$classification,
  data = sf::st_drop_geometry(quercus),
  method = "class",
  weights = collinear::case_weights(x = quercus$species)
)

The decision tree below shows which environmental variables and thresholds best separate oak species. Each split represents a condition (e.g., “bio6 < -2.3”), and each leaf shows the most likely species at that combination of conditions.

rpart.plot::rpart.plot(
  quercus_rpart,
  type = 0,
  extra = 8,
  fallen.leaves = TRUE,
  roundint = FALSE,
  cex = 0.7,
  box.palette = as.list(quercus_colors),
  main = "Decision Tree: European Oak Species Classification"
)

The figure shows how different combinations of environmental conditions lead to the dominance of a different Quercus species in each terminal node.

Spatial Prediction

The prediction of the model quercus_rpart over quercus_env generates a spatRaster where each layer represents the presence probability for a single species.

quercus_probability <- predict(
  object = quercus_env,
  model = quercus_rpart,
  type = "prob",
  na.rm = TRUE
)

plot(
  x = quercus_probability,
  col = env_colors
  )

We can extract the class with the highest probability on each cell.

quercus_class <- terra::which.max(
  x = quercus_probability
)
plot(
  x = quercus_class, 
  col = quercus_colors
  )

To facilitate the identification of class labels, we need to add category names with terra::categories() and colors per category using terra::coltab().

quercus_categories_df <- data.frame(
  ID = seq_along(quercus_colors),
  category = names(quercus_colors)
)

communities_colors_df <- data.frame(
  value = seq_along(quercus_colors),
  col = quercus_colors
)

levels(quercus_class) <- quercus_categories_df
terra::coltab(quercus_class) <- communities_colors_df

Now we can replot the map with the correct class labels.

terra::plet(quercus_class)

Notice how binary splits in rpart generate box-like artifacts in the map, which represent specific map splits. For example, the horizontal limit between Quercus cerris and Quercus robur corresponds to the split sun_rad_minimum >= 773. Recursive partition trees are fast to train, and really good at offering a simple and intuitive explanation on why these species are where they are, but is not the best possible method to generate accurate spatial predictions.

Conclusion

The decision tree above gives interpretable climate rules for each oak species, but it is just one entry point into quercus. More flexible methods like random forests or gradient boosting can push accuracy further while the multi-class structure and 31 predictors keep the dataset interesting.