Skip to contents

Overview

The quercus dataset contains presence and absence records of eight European Quercus (oak) species combined with 31 environmental predictor variables covering bioclimatic, topographic, vegetation, and human impact dimensions.

Its main features are:

  • Complete case study - No missing values, clean data ready for analysis.
  • Multi-class classification - Eight oak species plus absence points (9 levels).
  • European scope - Records spanning western, central, and southern Europe.
  • Ecological predictors - Climate, topography, vegetation indices, land cover, and human footprint.

We built this dataset to support a range of analytical approaches, such as binary and multi-class classification, niche modelling, and niche overlap analysis.

Setup

The following R libraries are required to run this tutorial:

The code below loads the example data.

data(
  quercus,
  quercus_response,
  quercus_predictors,
  package = "spatialData"
)

Data Structure

The dataset is an sf data frame with 6728 rows and 33 columns, and no missing data. The first 10 records and all columns but geometry are shown below.

quercus |> 
  sf::st_drop_geometry() |> 
  head() |> 
  dplyr::glimpse()
#> Rows: 6
#> Columns: 32
#> $ species               <chr> "Quercus robur", "Quercus robur", "Quercus petra…
#> $ bio1                  <int> 75, 75, 110, 58, 82, 157
#> $ bio10                 <int> 163, 149, 179, 158, 164, 250
#> $ bio11                 <int> -20, 5, 41, -36, 2, 73
#> $ bio12                 <int> 740, 818, 1006, 587, 582, 510
#> $ bio13                 <int> 106, 96, 105, 73, 67, 69
#> $ bio14                 <int> 34, 43, 61, 28, 30, 4
#> $ bio15                 <int> 40, 27, 13, 33, 23, 51
#> $ bio16                 <int> 294, 279, 296, 210, 194, 178
#> $ bio17                 <int> 107, 138, 212, 92, 103, 29
#> $ bio18                 <int> 294, 205, 212, 160, 194, 29
#> $ bio19                 <int> 115, 162, 285, 118, 125, 172
#> $ bio2                  <int> 90, 65, 103, 64, 68, 124
#> $ bio3                  <int> 31, 28, 41, 22, 27, 37
#> $ bio4                  <int> 7116, 5759, 5368, 7617, 6366, 6904
#> $ bio5                  <int> 227, 193, 249, 206, 213, 353
#> $ bio6                  <int> -62, -35, 0, -74, -31, 22
#> $ bio7                  <int> 289, 228, 249, 280, 244, 331
#> $ topographic_diversity <int> 93, 12, 43, 8, 11, 55
#> $ human_footprint       <dbl> 42.33, 52.65, 32.92, 26.68, 40.08, 36.03
#> $ landcover_veg_bare    <dbl> 0.36, 0.25, 0.00, 0.08, 0.30, 2.40
#> $ landcover_veg_herb    <dbl> 58.85, 91.73, 71.67, 44.06, 53.94, 87.12
#> $ landcover_veg_tree    <dbl> 40.69, 8.02, 28.25, 55.78, 35.20, 10.47
#> $ ndvi_average          <dbl> 0.63, 0.63, 0.73, 0.50, 0.38, 0.48
#> $ ndvi_maximum          <dbl> 0.79, 0.77, 0.80, 0.83, 0.53, 0.70
#> $ ndvi_minimum          <dbl> 0.44, 0.45, 0.63, 0.06, 0.22, 0.27
#> $ ndvi_range            <dbl> 0.36, 0.33, 0.17, 0.77, 0.31, 0.42
#> $ sun_rad_average       <dbl> 4164.84, 3552.85, 4480.35, 3376.14, 3787.15, 511…
#> $ sun_rad_maximum       <dbl> 7515.76, 7189.52, 7601.16, 7099.44, 7301.63, 776…
#> $ sun_rad_minimum       <dbl> 897.09, 356.39, 1271.03, 230.87, 552.49, 2118.26
#> $ sun_rad_range         <dbl> 6618.67, 6833.12, 6330.14, 6868.57, 6749.14, 564…
#> $ topo_slope            <dbl> 3.10, 0.39, 1.25, 0.28, 0.43, 1.88

Response Variable

The response variable is species, a categorical variable with 9 levels (8 species + absence):

quercus |> 
  sf::st_drop_geometry() |> 
  dplyr::group_by(species) |> 
  dplyr::summarise(
    n = dplyr::n()
  ) |> 
  dplyr::arrange(
    dplyr::desc(n)
    )
#> # A tibble: 9 × 2
#>   species               n
#>   <chr>             <int>
#> 1 absence            1899
#> 2 Quercus robur      1660
#> 3 Quercus petraea    1445
#> 4 Quercus ilex        627
#> 5 Quercus cerris      394
#> 6 Quercus faginea     287
#> 7 Quercus pubescens   225
#> 8 Quercus pyrenaica   133
#> 9 Quercus suber        58

Predictor Variables

The dataset includes 31 predictor variables organized into six categories:

Category Variables Source
Bioclimatic (17) bio1, bio2, …, bio19 (excluding bio8, bio9) WorldClim
NDVI (4) ndvi_average, ndvi_maximum, ndvi_minimum, ndvi_range MODIS
Solar radiation (4) sun_rad_average, sun_rad_maximum, sun_rad_minimum, sun_rad_range WorldClim
Land cover (3) landcover_veg_bare, landcover_veg_herb, landcover_veg_tree MODIS MOD44B
Topographic (2) topo_slope, topographic_diversity SRTM
Human impact (1) human_footprint Venter et al. 2016

Interactive Map

The map below shows the spatial distribution of oak species and absence points across Europe:

mapview::mapview(
  quercus,
  zcol = "species",
  layer.name = "Species"
)

Example Usage

In this example we use quercus to identify the environmental conditions that distinguish oak species using recursive partition trees.

Multicollinearity Filtering

Many predictors in this dataset are correlated (e.g., bio5, bio6, bio7 are all derived from temperature extremes). The collinear::collinear() function from the R package collinear selects a non-redundant subset that maximizes predictive power for the response variable.

quercus_selection <- collinear::collinear(
  df = quercus,
  responses = "species",
  predictors = quercus_predictors,
  f = collinear::f_categorical_rf,
  quiet = TRUE
)

quercus_selection$species$selection
#> [1] "sun_rad_minimum"       "bio14"                 "bio4"                 
#> [4] "landcover_veg_bare"    "ndvi_maximum"          "topographic_diversity"
#> [7] "bio16"                 "human_footprint"       "landcover_veg_tree"   
#> attr(,"validated")
#> [1] TRUE

Decision Tree Model

Species occurrence counts are imbalanced — Q. robur has 1,660 records while Q. suber has only 58. Without correction, rare species get absorbed into leaves dominated by common ones. To fix this, we use the collinear::case_weights() function, which generates inverse-frequency weights so that each species contributes equally to the tree splits.

quercus_rpart <- rpart::rpart(
  formula = quercus_selection$species$formulas$classification,
  data = sf::st_drop_geometry(quercus),
  method = "class",
  weights = collinear::case_weights(x = quercus$species)
)

The decision tree below shows which environmental variables and thresholds best separate oak species. Each split represents a condition (e.g., “bio6 < -2.3”), and each leaf shows the most likely species at that combination of conditions.

rpart.plot::rpart.plot(
  quercus_rpart,
  type = 0,
  extra = 8,
  fallen.leaves = TRUE,
  roundint = FALSE,
  cex = 0.7,
  box.palette = as.list(viridis::turbo(n = 9, alpha = 0.5)),
  main = "Decision Tree: European Oak Species Classification"
)

The figure shows how different combinations of environmental conditions lead to the dominance of a different Quercus species.

Prediction Map

Assigning predictions back to the spatial data reveals where each species is most likely to occur according to the recursive partition model.

quercus$predicted <- predict(
  object = quercus_rpart,
  newdata = sf::st_drop_geometry(quercus),
  type = "class"
)

mapview::mapview(
  quercus,
  zcol = "predicted",
  layer.name = "Predicted species",
  col.regions = viridis::turbo(
    n = length(unique(quercus$predicted))
  )
)

Species Probability Map

The package provides the raster dataset quercus_env, downloaded via spatialData::quercus_extra(), useful to generate continuous predictions from models trained with the quercus data.

quercus_env <- spatialData::quercus_extra()
#> spatialData::quercus_extra(): Downloading file 'quercus_env.tif'.
class(quercus_env)
#> [1] "SpatRaster"
#> attr(,"package")
#> [1] "terra"
names(quercus_env)
#>  [1] "bio1"                  "bio10"                 "bio11"                
#>  [4] "bio12"                 "bio13"                 "bio14"                
#>  [7] "bio15"                 "bio16"                 "bio17"                
#> [10] "bio18"                 "bio19"                 "bio2"                 
#> [13] "bio3"                  "bio4"                  "bio5"                 
#> [16] "bio6"                  "bio7"                  "human_footprint"      
#> [19] "landcover_veg_bare"    "landcover_veg_herb"    "landcover_veg_tree"   
#> [22] "ndvi_average"          "ndvi_maximum"          "ndvi_minimum"         
#> [25] "ndvi_range"            "sun_rad_average"       "sun_rad_maximum"      
#> [28] "sun_rad_minimum"       "sun_rad_range"         "topo_slope"           
#> [31] "topographic_diversity"
plot(
  x = quercus_env[["bio1"]],
  col = viridis::turbo(n = 100)
  )

The map below shows the fitted probability of Quercus petraea across Europe.

quercus_probability <- predict(
  object = quercus_env,
  model = quercus_rpart,
  type = "prob",
  na.rm = TRUE
)

plot(
  x = quercus_probability[["Quercus.petraea"]],
  col = viridis::viridis(n = 5)
  )

Conclusion

The decision tree above gives interpretable climate rules for each oak species — but it is just one entry point into quercus. More flexible methods like random forests or gradient boosting can push accuracy further while the multi-class structure and 31 predictors keep the dataset interesting.