Skip to contents

Mediterranean mountain ecosystems are among the most biodiverse and climate-sensitive environments in Europe. Projecting how their plant communities respond to future climate change requires models that can handle multi-class responses and be projected across climate scenarios — which is what this dataset is built for.

Overview

The communities dataset contains spatially-thinned plant community records from the Sierra Nevada mountain range in SE Spain, compiled from the official vegetation map at 1:10,000 scale. It is designed to support multi-class classification, binary presence-absence modelling, and ecological niche analysis using climate and topographic predictors.

Its main features are:

  • Sierra Nevada scope: 7,300 point records covering the Sierra Nevada mountain range (SE Spain, EPSG:25830).
  • Multi-class and binary responses: a community column with 6 levels ("none" + 5 named plant communities) and 5 binary presence-absence columns for individual community types.
  • Climate and topographic predictors: 9 numeric predictors spanning seasonal temperature extremes, rainfall, and terrain attributes.
  • Temporal climate scenario rasters: associated rasters for baseline (2010), 2050, and 2100 climate scenarios are available via download functions.

Setup

The following R libraries and example data are required to run this tutorial.

data(
  communities,
  communities_responses,
  communities_predictors,
  package = "spatialData"
)

Description

The dataset is an sf data frame with 6747 rows and 16 columns. The first 10 records (geometry dropped) are shown below.

communities |>
  sf::st_drop_geometry() |>
  head(10) |>
  dplyr::glimpse()
#> Rows: 10
#> Columns: 15
#> $ pyrenean_oak              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
#> $ juniper_shrubland         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
#> $ pinus_forest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
#> $ alpine_pastures           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
#> $ holm_oak                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
#> $ community                 <fct> Pyrenean oak forests, Pyrenean oak forests, 
#> $ max_temperature_summer    <dbl> 28.88643, 28.50699, 28.25909, 28.50201, 28.0…
#> $ max_temperature_winter    <dbl> 9.457385, 8.020030, 7.829192, 7.315129, 7.34…
#> $ min_temperature_summer    <dbl> 13.15320, 16.14400, 15.84599, 15.66309, 15.6…
#> $ min_temperature_winter    <dbl> -1.49025047, 0.11102799, -0.06090841, -0.615…
#> $ northness                 <dbl> 68.19292, 26.15096, 42.63157, 27.40485, 31.7…
#> $ rainfall_summer           <dbl> 41.79313, 43.88976, 44.59952, 45.00000, 45.5…
#> $ rainfall_winter           <dbl> 193.7088, 196.4996, 221.5068, 193.7104, 210.…
#> $ slope                     <dbl> 30.84319, 27.24990, 36.85854, 27.38869, 34.1…
#> $ topographic_wetness_index <dbl> 5.126790, 6.500056, 4.380448, 5.770209, 7.54…

Response Variables

The dataset contains 6 response variables (named in communities_responses):

Variable Type Description
community factor (6 levels) Multi-class community label: one of 5 named community types, and "none" to represent absence of the target communities
pyrenean_oak binary integer Presence (1) / absence (0) of Pyrenean oak forests
juniper_shrubland binary integer Presence (1) / absence (0) of juniper-broom shrublands
pinus_forest binary integer Presence (1) / absence (0) of Pinus forests
alpine_pastures binary integer Presence (1) / absence (0) of alpine pastures
holm_oak binary integer Presence (1) / absence (0) of holm oak forests

You can use the binary responses for presence-absence modelling, or community for multi-class classification.

The overall number of samples per community type is shown below. The none community refers to samples with absences for all communities.

communities |> 
  sf::st_drop_geometry() |> 
  dplyr::group_by(community) |> 
  dplyr::summarize(
    samples = dplyr::n()
  )
#> # A tibble: 6 × 2
#>   community                samples
#>   <fct>                      <int>
#> 1 none                        2028
#> 2 Holm oak forests             847
#> 3 Pyrenean oak forests         228
#> 4 Pinus forests               1589
#> 5 Juniper-broom shrublands    1789
#> 6 Alpine pastures              266

The map below shows all records coloured by communities$community.

communities_colors <- stats::setNames(
  viridis::turbo(n = length(levels(communities$community))),
  levels(communities$community)
)

mapview::mapview(
  communities,
  zcol = "community",
  layer.name = "community",
  col.regions = communities_colors,
  color = NULL
)

Predictor Variables

The dataset contains 9 numeric predictors (named in communities_predictors) covering two categories:

Variable Category Description
max_temperature_summer Climate Maximum summer temperature (°C)
max_temperature_winter Climate Maximum winter temperature (°C)
min_temperature_summer Climate Minimum summer temperature (°C)
min_temperature_winter Climate Minimum winter temperature (°C)
rainfall_summer Climate Summer rainfall (mm)
rainfall_winter Climate Winter rainfall (mm)
northness Topography Northness index (cosine of aspect, −1 to 1)
slope Topography Terrain slope (degrees)
topographic_wetness_index Topography Topographic wetness index

The VIF analysis below confirms that temperature predictors are collinear — worth filtering before modelling.

collinear::vif_df(
  df = communities,
  predictors = communities_predictors,
  quiet = TRUE
)
#>        vif                 predictor
#> 1 796.7677    max_temperature_summer
#> 2 369.9279    max_temperature_winter
#> 3 198.2054    min_temperature_summer
#> 4  30.3419    min_temperature_winter
#> 5  12.0304                 northness
#> 6   5.6303           rainfall_summer
#> 7   4.0429           rainfall_winter
#> 8   2.2412                     slope
#> 9   1.2312 topographic_wetness_index

The classes in communities$community follow clear gradients for the climatic predictors, as evidenced by the boxplots below.

communities |> 
  tidyr::pivot_longer(
    cols = dplyr::all_of(communities_predictors),
    names_to = "predictor",
    values_to = "value"
  ) |> 
  dplyr::mutate(
    predictor = factor(predictor, levels = c(
      "min_temperature_winter",
      "max_temperature_winter",
      "min_temperature_summer",
      "max_temperature_summer",
      "rainfall_summer",
      "rainfall_winter",
      "slope",
      "topographic_wetness_index",
      "northness"
    ))
  ) |>
  ggplot2::ggplot() + 
  ggplot2::aes(
    x = value,
    y = community,
    fill = community
  ) + 
  ggplot2::facet_wrap(~predictor, scales = "free_x") + 
  ggplot2::geom_boxplot(notch = TRUE, varwidth = TRUE) + 
  ggplot2::scale_fill_viridis_d(option = "turbo") +
  ggplot2::theme_bw() + 
  ggplot2::theme(
    legend.position = "none",
    axis.text.y = ggplot2::element_text(size = 12)
    ) + 
  ggplot2::labs(
    y = "", 
    x = "Predictor Value"
    )

Raster Layers

The communities dataset also has associated environmental rasters for present and future climate scenarios. These raster layers can be downloaded using the dedicated functions shown below.

communities_2010 <- spatialData::communities_extra_2010()
communities_2050 <- spatialData::communities_extra_2050()
communities_2100 <- spatialData::communities_extra_2100()

These functions download the data as multi-band GeoTIFF files.

file.exists("communities_2010.tif")
#> [1] TRUE
file.exists("communities_2050.tif")
#> [1] TRUE
file.exists("communities_2100.tif")
#> [1] TRUE

The files are loaded into R as SpatRaster objects, managed by the R package terra.

The raster variables have the same names and units as the predictors in communities.

The raster layers with climate and topographic conditions for the 2010 baseline in the area of interest are shown below.

plot(communities_2010)

Example

In this example we model the spatial distribution of plant communities in Sierra Nevada via Random Forest classification, and project the model across the 2010 baseline and the 2050 and 2100 future climate scenarios.

With the data characterised, let’s train a multi-class random forest.

Model Training

In this section, we train a probability random forest model with ranger::ranger(). To give each class in community equal influence, we use collinear::case_weights() to assign inverse-frequency case weights per community, so that rare communities are up-weighted relative to common ones.

#remove geometry column
training_df <-  sf::st_drop_geometry(communities)

#train random forest model
communities_rf <- ranger::ranger(
  y = training_df$community,
  x = training_df[, communities_predictors],
  probability = TRUE,
  importance = "permutation",
  num.trees = 1000,
  min.node.size = 30,
  seed = 1,
  case.weights = collinear::case_weights(x = training_df$community)
)

The model output is a list of class ranger containing several objects of interest.

names(communities_rf)
#>  [1] "predictions"               "num.trees"                
#>  [3] "num.independent.variables" "mtry"                     
#>  [5] "min.node.size"             "variable.importance"      
#>  [7] "prediction.error"          "forest"                   
#>  [9] "splitrule"                 "treetype"                 
#> [11] "call"                      "importance.mode"          
#> [13] "num.samples"               "replace"                  
#> [15] "max.depth"

The three most relevant output objects are described below:

  • predictions: Holds the out-of-bag class probabilities. Each prediction is generated by the trees that were not trained on that observation. These predictions help assess model performance in a way that is not as rigorous as spatial cross-validation, but is sufficient for our purposes here.
head(communities_rf$predictions)
#>            none Holm oak forests Pyrenean oak forests Pinus forests
#> [1,] 0.02395833       0.73130805            0.1910958    0.05363777
#> [2,] 0.12019704       0.28399015            0.3571429    0.23866995
#> [3,] 0.02535273       0.02116402            0.7985263    0.15495693
#> [4,] 0.12814876       0.07762973            0.4261738    0.32200690
#> [5,] 0.01785714       0.08888889            0.6915025    0.13825944
#> [6,] 0.02416264       0.12928571            0.4849132    0.04563583
#>      Juniper-broom shrublands Alpine pastures
#> [1,]               0.00000000               0
#> [2,]               0.00000000               0
#> [3,]               0.00000000               0
#> [4,]               0.04604087               0
#> [5,]               0.06349206               0
#> [6,]               0.31600263               0
  • prediction.error: The Brier score of the model, representing the mean squared error between predicted probabilities and actual outcomes.
communities_rf$prediction.error
#> [1] 0.2133352
  • variable.importance: Permutation importance of each predictor, expressed as the increase in model error on the out-of-bag data when the predictor is permuted. Important predictors show higher scores.
#permutation importance
communities_rf$variable.importance |> 
  sort() |> 
  rev() |> 
  round(digits = 3)
#>    max_temperature_winter    min_temperature_winter           rainfall_summer 
#>                     0.161                     0.119                     0.106 
#>           rainfall_winter    max_temperature_summer    min_temperature_summer 
#>                     0.065                     0.060                     0.057 
#>                     slope                 northness topographic_wetness_index 
#>                     0.017                     0.011                     0.003

These scores represent how well a predictor helps the model discriminate between categories from a statistical standpoint rather than from an ecological one.

Model Assessment

This model has a Brier score of 0.213, which is reasonably good, indicating that the model is producing well-calibrated class probabilities.

However, this score says nothing about per-class performance. To understand how well the model predicts each class, we first need to transform the out-of-bag class probabilities to class predictions.

The code below extracts the name of the column in communities_rf$predictions with the highest probability per observation.

#case index of the column with the maximum probability
max_probability_column <- max.col(communities_rf$predictions)

#name of the column with the maximum probability
predicted_class <- colnames(communities_rf$predictions)[max_probability_column]

head(predicted_class)
#> [1] "Holm oak forests"     "Pyrenean oak forests" "Pyrenean oak forests"
#> [4] "Pyrenean oak forests" "Pyrenean oak forests" "Pyrenean oak forests"

Now we use the observed and predicted data to build a confusion matrix showing model performance per class in community.

#generate confusion matrix
confusion_matrix <- table(
  #WARNING! if factor, must be converted to character!
  observed = as.character(training_df$community),
  predicted = predicted_class
) |> 
  as.matrix()

#abbreviate dimension names
colnames(confusion_matrix) <-
  rownames(confusion_matrix) <-
    abbreviate(
      colnames(confusion_matrix),
      minlength = 8
    )

confusion_matrix
#>           predicted
#> observed   Alpnpstr Hlmokfrs Jnpr-brs none Pnsfrsts Pyrnnokf
#>   Alpnpstr      262        0        4    0        0        0
#>   Hlmokfrs        0      631        3   40      102       71
#>   Jnpr-brs       29        1     1597    2      152        8
#>   none           27      185      129 1450      157       80
#>   Pnsfrsts        0      242      159   96     1028       64
#>   Pyrnnokf        0       37       11   10        9      161

Below, we use the diagonal of the confusion matrix to compute the model’s overall recall (sensitivity or true positive rate), and the matrix rows to compute the per-class recall.

sum(diag(confusion_matrix)) / sum(confusion_matrix)
#> [1] 0.7601897
sapply(
  X = colnames(confusion_matrix),
  FUN = function(x){
    confusion_matrix[x, x] / sum(confusion_matrix[x, ])
  }
) |> 
  sort() |> 
  round(digits = 3)
#>            Pinus forests     Pyrenean oak forests                     none 
#>                    0.647                    0.706                    0.715 
#>         Holm oak forests Juniper-broom shrublands          Alpine pastures 
#>                    0.745                    0.893                    0.985

The model achieves above-random recall for all communities, and is particularly accurate at discriminating juniper-broom shrublands and alpine pastures.

Now let’s project the trained model across climate scenarios.

Spatial Predictions

The function terra::predict() applies the trained random forest model to a SpatRaster object whose layers share the same predictor names and units used during training.

probabilities_2010 <- terra::predict(
  object = communities_2010,
  model  = communities_rf,
  na.rm  = TRUE
)

probabilities_2050 <- terra::predict(
  object = communities_2050,
  model  = communities_rf,
  na.rm  = TRUE
)

probabilities_2100 <- terra::predict(
  object = communities_2100,
  model  = communities_rf,
  na.rm  = TRUE
)

The output has one probability layer per class.

plot(
  probabilities_2010, 
  nc = 2, 
  axes = NULL,
  col = viridis::turbo(n = 100)
  )

To transform these probabilities into class predictions, we apply terra::which.max() to retrieve the index of the layer with the maximum probability.

prediction_2010 <- terra::which.max(
  x = probabilities_2010
)

prediction_2050 <- terra::which.max(
  x = probabilities_2050
)

prediction_2100 <- terra::which.max(
  x = probabilities_2100
)

The output is a raster in which each pixel’s value is the index of the layer with the maximum predicted probability.

plot(prediction_2010, col = communities_colors)

To facilitate the plotting of these categorical maps, first we add category names via terra::categories() and a categories data frame, and second we add specific colors per category using terra::coltab().

#categories dataframe
categories_df <- data.frame(
  ID = seq_along(communities_colors),
  category = levels(communities$community)
)

#colors dataframe
colors_df <- data.frame(
  value = seq_along(communities_colors),
  col = communities_colors
)


#assign categories
levels(prediction_2010) <- categories_df
levels(prediction_2050) <- categories_df
levels(prediction_2100) <- categories_df

#assign colors
terra::coltab(prediction_2010) <- colors_df
terra::coltab(prediction_2050) <- colors_df
terra::coltab(prediction_2100) <- colors_df

Let’s take a look at the resulting maps.

terra::plet(prediction_2010)
terra::plet(prediction_2050)
terra::plet(prediction_2100)

These maps are visually compelling, but it is important to think about their meaning. These projections illustrate how the model’s predictions shift under different climate scenarios, but they should not be interpreted as forecasts of future vegetation. Many processes that govern vegetation dynamics are not captured here, including species dispersal limitations, biotic interactions, and community resilience or adaptation. The maps represent the climatic envelope of each community under different climatic conditions as learnt from the observations in communities, and assume that communities can track climate change instantaneously and without constraint.

Model Analysis

The number of pixels per plant community over time can be converted to hectares and plotted together to get a sense of how much the climate envelope of each community is going to shrink or expand in the future.

area_df <- dplyr::bind_rows(
  terra::freq(prediction_2010) |> dplyr::mutate(year = 2010),
  terra::freq(prediction_2050) |> dplyr::mutate(year = 2050),
  terra::freq(prediction_2100) |> dplyr::mutate(year = 2100)
) |>
  dplyr::rename(community = value) |>
  #resolution is 100x100 meters = 1 ha per cell
  dplyr::mutate(hectares = count) |>
  dplyr::select(year, community, hectares) |> 
  dplyr::filter(community != "none")

ggplot2::ggplot(
  data = area_df,
  ggplot2::aes(
    x = year,
    y = hectares,
    colour = community,
    group  = community
  )
) +
  ggplot2::geom_line(linewidth = 1) +
  ggplot2::geom_point(size = 3) +
  ggplot2::scale_colour_manual(values = communities_colors) +
  ggplot2::scale_x_continuous(breaks = c(2010, 2050, 2100)) +
  ggplot2::scale_y_continuous(labels = scales::comma) +
  ggplot2::labs(
    x = "Year",
    y = "Area (hectares)",
    colour = "Community"
  ) +
  ggplot2::theme_bw()

Rather than actual community surface at a given time, these curves can be interpreted as a proxy for climatic stress: the smaller the community’s predicted area with respect to the initial area, the higher the climate stress the community may undergo.

We can also obtain a proxy for biotic stress by assessing the fraction of the original area of each community that will become suitable for another community.

#stack together projections for 2010 and 2100
transition_df <- terra::as.data.frame(
  x = c(prediction_2010, prediction_2050), 
  na.rm = TRUE
  ) |>
  setNames(c("communities_2010", "communities_2050")) |>
  #count pixels transitioning from one community to another
  dplyr::count(
    from = communities_2010, 
    to = communities_2050, 
    name = "hectares"
    ) |>
  dplyr::group_by(from) |>
  dplyr::mutate(
    transition_percent = round(
      x = hectares * 100 / sum(hectares), 
      digits = 1
      )
    ) |>
  dplyr::select(-hectares) |> 
  dplyr::ungroup() |> 
  tidyr::pivot_wider(
    names_from = "to",
    values_from = "transition_percent",
    values_fill = 0
  ) |> 
  dplyr::filter(
    from != "none"
  ) |> 
  dplyr::rename(
    Community = from
  )

knitr::kable(transition_df)
Community none Juniper-broom shrublands Holm oak forests Pinus forests Alpine pastures
Holm oak forests 100.0 0.0 0.0 0.0 0.0
Pyrenean oak forests 99.9 0.0 0.1 0.0 0.0
Pinus forests 92.0 0.0 2.5 5.5 0.0
Juniper-broom shrublands 75.7 5.3 1.2 17.8 0.0
Alpine pastures 12.8 87.1 0.0 0.0 0.1

The transition matrix shows the source community (year 2010) in the first column, the resulting community as columns, and the values represent the percentage of the original area of the source community becoming suitable for a different community in 2050.

For example, the 87.1% of the area of Alpine pastures transitioning to Juniper-broom shrublands can be interpreted as an indicator of biological pressure. Future conditions may facilitate the upwards migration of Juniper-broom species, resulting in a heightened competition with pasture species, leading to the eventual replacement of this community.

Conclusion

The communities dataset offers a rich testbed for multi-class classification and ecological niche modelling in a real Mediterranean mountain system. The random forest workflow above demonstrates how climate gradients help discriminate plant communities, and how a single trained multi-class model can be projected across climate scenarios. The area analysis further highlights which communities are projected to expand or contract under future climate scenarios, while underscoring the importance of interpreting such projections with caution.