Skip to contents

Source Code

You can download the Rmarkdown notebook used to render this article here. To download the file, use the button Download raw file on the upper-right hand of the Code panel.

Overview

The communities dataset contains plant community records from the Sierra Nevada mountain range in SE Spain. The data was compiled from the official vegetation map at 1:10,000 scale. It is designed to support multi-class classification, binary presence-absence modelling, and ecological niche analysis using climate and topographic predictors.

Its main features are:

  • Scope: 7,300 point records covering the Sierra Nevada mountain range (SE Spain, EPSG:25830).
  • Multi-class and binary responses: a community column with 6 levels ("none" + 5 named plant communities) and 5 binary presence-absence columns for individual community types.
  • Climate and topographic predictors: 9 numeric predictors spanning seasonal temperature extremes, rainfall, and terrain attributes.
  • Temporal climate scenario rasters: associated rasters for baseline (2010), 2050, and 2100 climate scenarios are available via download functions.

Setup

The code below loads the R packages and example data required for this tutorial.

The following R libraries and example data are required to run this tutorial.

data(
  communities,
  communities_responses,
  communities_predictors,
  package = "spatialData"
)

communities_colors <- grDevices::palette.colors(
  n = length(levels(communities$community))
  ) |> 
  stats::setNames(
    levels(communities$community)
    )

env_colors <- grDevices::hcl.colors(
  n = 100,
  palette = "Zissou 1"
)

Data Structure

The dataset is an sf data frame with 6747 rows and 16 columns. The first 10 records (geometry dropped) are shown below.

communities |>
  sf::st_drop_geometry() |>
  head(10) |>
  dplyr::glimpse()
#> Rows: 10
#> Columns: 15
#> $ pyrenean_oak              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
#> $ juniper_shrubland         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
#> $ pinus_forest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
#> $ alpine_pastures           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
#> $ holm_oak                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
#> $ community                 <fct> Pyrenean oak forests, Pyrenean oak forests, 
#> $ max_temperature_summer    <dbl> 28.88643, 28.50699, 28.25909, 28.50201, 28.0…
#> $ max_temperature_winter    <dbl> 9.457385, 8.020030, 7.829192, 7.315129, 7.34…
#> $ min_temperature_summer    <dbl> 13.15320, 16.14400, 15.84599, 15.66309, 15.6…
#> $ min_temperature_winter    <dbl> -1.49025047, 0.11102799, -0.06090841, -0.615…
#> $ northness                 <dbl> 68.19292, 26.15096, 42.63157, 27.40485, 31.7…
#> $ rainfall_summer           <dbl> 41.79313, 43.88976, 44.59952, 45.00000, 45.5…
#> $ rainfall_winter           <dbl> 193.7088, 196.4996, 221.5068, 193.7104, 210.…
#> $ slope                     <dbl> 30.84319, 27.24990, 36.85854, 27.38869, 34.1…
#> $ topographic_wetness_index <dbl> 5.126790, 6.500056, 4.380448, 5.770209, 7.54…

The dataset contains 6 response variables (named in communities_responses):

Variable Type Description
community factor (6 levels) Multi-class community label: one of 5 named community types, and "none" to represent absence of the target communities
pyrenean_oak binary integer Presence (1) / absence (0) of Pyrenean oak forests
juniper_shrubland binary integer Presence (1) / absence (0) of juniper-broom shrublands
pinus_forest binary integer Presence (1) / absence (0) of Pinus forests
alpine_pastures binary integer Presence (1) / absence (0) of alpine pastures
holm_oak binary integer Presence (1) / absence (0) of holm oak forests

You can use the binary responses for presence-absence modelling, or community for multi-class classification.

The overall number of samples per community type is shown below. The none community refers to samples with absences for all communities.

communities |> 
  sf::st_drop_geometry() |> 
  dplyr::group_by(community) |> 
  dplyr::summarize(
    samples = dplyr::n()
  )
#> # A tibble: 6 × 2
#>   community                samples
#>   <fct>                      <int>
#> 1 none                        2028
#> 2 Holm oak forests             847
#> 3 Pyrenean oak forests         228
#> 4 Pinus forests               1589
#> 5 Juniper-broom shrublands    1789
#> 6 Alpine pastures              266

The map below shows all records coloured by communities$community.

mapview::mapview(
  communities,
  zcol = "community",
  layer.name = "community",
  col.regions = communities_colors,
  color = NULL
)

The dataset contains 9 numeric predictors (named in communities_predictors) covering two categories:

Variable Category Description
max_temperature_summer Climate Maximum summer temperature (°C)
max_temperature_winter Climate Maximum winter temperature (°C)
min_temperature_summer Climate Minimum summer temperature (°C)
min_temperature_winter Climate Minimum winter temperature (°C)
rainfall_summer Climate Summer rainfall (mm)
rainfall_winter Climate Winter rainfall (mm)
northness Topography Northness index (cosine of aspect, −1 to 1)
slope Topography Terrain slope (degrees)
topographic_wetness_index Topography Topographic wetness index

The communities dataset also has associated environmental rasters for present and future climate scenarios. These raster layers can be downloaded using the dedicated functions shown below.

communities_2010 <- spatialData::communities_extra_2010()
communities_2050 <- spatialData::communities_extra_2050()
communities_2100 <- spatialData::communities_extra_2100()

These functions download the data as multi-band GeoTIFF files. The files are loaded into R as spatRaster objects, managed by the R package terra. The plot below shows communities_2010.

plot(
  x = communities_2010,
  col = env_colors
  )

Example Usage

In this example we model the spatial distribution of plant communities in Sierra Nevada via Random Forest classification, and project the model across the 2010 baseline and the 2050 and 2100 future climate scenarios.

Model Training

We are going to train a probability random forest model with ranger::ranger(). This type of model returns the probability of a case to belong to each class.

To ensure that we give each class in community equal influence, we use collinear::case_weights() to assign inverse-frequency case weights per community, so that rare communities are up-weighted relative to common ones.

communities_model <- ranger::ranger(
  dependent.variable.name = "community",
  data = communities |> 
    sf::st_drop_geometry() |> 
    dplyr::select(
      dplyr::all_of(
        c("community", communities_predictors)
      )
    ),
  probability = TRUE,
  importance = "permutation",
  num.trees = 500,
  min.node.size = 30,
  seed = 1,
  case.weights = collinear::case_weights(
    x = communities$community
    )
  )

The model output is a list of class ranger containing several objects of interest.

names(communities_model)
#>  [1] "predictions"               "num.trees"                
#>  [3] "num.independent.variables" "mtry"                     
#>  [5] "min.node.size"             "variable.importance"      
#>  [7] "prediction.error"          "forest"                   
#>  [9] "splitrule"                 "treetype"                 
#> [11] "call"                      "importance.mode"          
#> [13] "num.samples"               "replace"                  
#> [15] "dependent.variable.name"   "max.depth"

The three most relevant output objects are described below:

  • predictions: Holds the out-of-bag class probabilities. Each prediction is generated by the trees that were not trained on that observation. These predictions help assess model performance in a way that is not as rigorous as spatial cross-validation, but is sufficient for our purposes here.
head(communities_model$predictions)
#>            none Holm oak forests Pyrenean oak forests Pinus forests
#> [1,] 0.00000000       1.00000000            0.0000000    0.00000000
#> [2,] 0.14285714       0.44642857            0.2142857    0.19642857
#> [3,] 0.02876984       0.03174603            0.7227183    0.21676587
#> [4,] 0.14792009       0.09908662            0.3230706    0.36191502
#> [5,] 0.02500000       0.08000000            0.7681034    0.12689655
#> [6,] 0.06604938       0.05000000            0.4246589    0.01754386
#>      Juniper-broom shrublands Alpine pastures
#> [1,]               0.00000000               0
#> [2,]               0.00000000               0
#> [3,]               0.00000000               0
#> [4,]               0.06800766               0
#> [5,]               0.00000000               0
#> [6,]               0.44174789               0
  • prediction.error: The Brier score of the model, representing the mean squared error between predicted probabilities and actual outcomes.
communities_model$prediction.error
#> [1] 0.2146029
  • variable.importance: Permutation importance of each predictor, expressed as the increase in model error on the out-of-bag data when the predictor is permuted. Important predictors show higher scores.
#permutation importance
communities_model$variable.importance |> 
  sort() |> 
  rev() |> 
  round(digits = 3)
#>    max_temperature_winter    min_temperature_winter           rainfall_summer 
#>                     0.159                     0.120                     0.107 
#>           rainfall_winter    max_temperature_summer    min_temperature_summer 
#>                     0.064                     0.059                     0.057 
#>                     slope                 northness topographic_wetness_index 
#>                     0.017                     0.011                     0.003

These scores represent how well a predictor helps the model discriminate between categories from a statistical standpoint rather than from an ecological one.

Model Assessment

This model has a Brier score of 0.215, which is reasonably good, indicating that the model is producing well-calibrated class probabilities.

However, this score says nothing about per-class performance. To understand how well the model predicts each class, we first need to transform the out-of-bag class probabilities to class predictions.

The code below extracts the name of the column in communities_model$predictions with the highest probability per observation, and prints the first cases.

#case index of the column with the maximum probability
max_probability_column <- max.col(communities_model$predictions)

#name of the column with the maximum probability
predicted_class <- colnames(communities_model$predictions)[max_probability_column]

head(predicted_class)
#> [1] "Holm oak forests"         "Holm oak forests"        
#> [3] "Pyrenean oak forests"     "Pinus forests"           
#> [5] "Pyrenean oak forests"     "Juniper-broom shrublands"

Now we use the observed and predicted data to build a confusion matrix showing model performance per class in community.

#generate confusion matrix
confusion_matrix <- table(
  #WARNING! if factor, must be converted to character!
  observed = as.character(communities$community),
  predicted = predicted_class
) |> 
  as.matrix()

#abbreviate dimension names
colnames(confusion_matrix) <-
  rownames(confusion_matrix) <-
    abbreviate(
      colnames(confusion_matrix),
      minlength = 8
    )

confusion_matrix
#>           predicted
#> observed   Alpnpstr Hlmokfrs Jnpr-brs none Pnsfrsts Pyrnnokf
#>   Alpnpstr      260        0        5    1        0        0
#>   Hlmokfrs        0      628        3   39      105       72
#>   Jnpr-brs       31        1     1595    2      153        7
#>   none           27      185      127 1452      159       78
#>   Pnsfrsts        0      240      158   98     1032       61
#>   Pyrnnokf        0       35       10   13       11      154

Below, we use the diagonal of the confusion matrix to compute the model’s overall recall (sensitivity or true positive rate)…

sum(diag(confusion_matrix)) / sum(confusion_matrix)
#> [1] 0.7595669

…and the matrix rows to compute the per-class recall.

sapply(
  X = colnames(confusion_matrix),
  FUN = function(x){
    confusion_matrix[x, x] / sum(confusion_matrix[x, ])
  }
) |> 
  sort() |> 
  round(digits = 3)
#>            Pinus forests     Pyrenean oak forests                     none 
#>                    0.649                    0.691                    0.716 
#>         Holm oak forests Juniper-broom shrublands          Alpine pastures 
#>                    0.741                    0.892                    0.977

The model achieves above-random recall for all communities, and is particularly accurate at discriminating juniper-broom shrublands and alpine pastures.

Now let’s project the trained model across climate scenarios.

Spatial Predictions

The function terra::predict() applies the trained random forest model to a SpatRaster object whose layers share the same predictor names and units used during training.

probabilities_2010 <- terra::predict(
  object = communities_2010,
  model  = communities_model,
  na.rm  = TRUE
)

probabilities_2050 <- terra::predict(
  object = communities_2050,
  model  = communities_model,
  na.rm  = TRUE
)

probabilities_2100 <- terra::predict(
  object = communities_2100,
  model  = communities_model,
  na.rm  = TRUE
)

The output has one probability layer per class.

plot(
  probabilities_2010, 
  nc = 2, 
  axes = NULL,
  col = env_colors
  )

To transform these probabilities into class predictions, we apply terra::which.max() to retrieve the index of the layer with the maximum probability.

prediction_2010 <- terra::which.max(
  x = probabilities_2010
)

prediction_2050 <- terra::which.max(
  x = probabilities_2050
)

prediction_2100 <- terra::which.max(
  x = probabilities_2100
)

The output of a prediction is a raster in which each pixel’s value is the index of the layer with the maximum predicted probability.

plot(
  x = prediction_2010, 
  col = communities_colors
  )

To facilitate the plotting of these categorical maps, first we add category names via terra::categories() and a categories data frame, and second we add specific colors per category using terra::coltab().

#categories dataframe
communities_categories_df <- data.frame(
  ID = seq_along(communities_colors),
  category = levels(communities$community)
)

#colors dataframe
communities_colors_df <- data.frame(
  value = seq_along(communities_colors),
  col = communities_colors
)


#assign categories
levels(prediction_2010) <- communities_categories_df
levels(prediction_2050) <- communities_categories_df
levels(prediction_2100) <- communities_categories_df

#assign colors
terra::coltab(prediction_2010) <- communities_colors_df
terra::coltab(prediction_2050) <- communities_colors_df
terra::coltab(prediction_2100) <- communities_colors_df

Let’s take a look at the resulting maps.

terra::plet(prediction_2010)
terra::plet(prediction_2050)
terra::plet(prediction_2100)

Let’s transforme these pixels into a plot to get a better picture of changes in habitat suitability for each plant community.

#resolution is 100x100 meters = 1 ha per cell
communities_area_df <- dplyr::bind_rows(
  terra::freq(prediction_2010) |> dplyr::mutate(year = 2010),
  terra::freq(prediction_2050) |> dplyr::mutate(year = 2050),
  terra::freq(prediction_2100) |> dplyr::mutate(year = 2100)
) |>
  dplyr::rename(community = value) |>
  dplyr::mutate(hectares = count) |>
  dplyr::select(year, community, hectares) |> 
  dplyr::filter(community != "none")

ggplot2::ggplot(
  data = communities_area_df,
  ggplot2::aes(
    x = year,
    y = hectares,
    colour = community,
    group  = community
  )
) +
  ggplot2::geom_line(linewidth = 1) +
  ggplot2::geom_point(size = 3) +
  ggplot2::scale_colour_manual(values = communities_colors) +
  ggplot2::scale_x_continuous(breaks = c(2010, 2050, 2100)) +
  ggplot2::scale_y_continuous(labels = scales::comma) +
  ggplot2::labs(
    x = "Year",
    y = "Area (hectares)",
    colour = "Community"
  ) +
  ggplot2::theme_bw()

Rather than actual community surface at a given time, these curves can be interpreted as a proxy for climatic stress: the smaller the community’s predicted area with respect to the initial area, the higher the climate stress the community may undergo.

We can also obtain a proxy for community transitions by assessing the fraction of the original area of each community that will become suitable for another community.

#stack together projections for 2010 and 2100
communities_transition_df <- terra::as.data.frame(
  x = c(prediction_2010, prediction_2050), 
  na.rm = TRUE
  ) |>
  setNames(c("communities_2010", "communities_2050")) |>
  dplyr::count(
    from = communities_2010, 
    to = communities_2050, 
    name = "hectares"
    ) |>
  dplyr::group_by(from) |>
  dplyr::mutate(
    transition_percent = round(
      x = hectares * 100 / sum(hectares), 
      digits = 1
      )
    ) |>
  dplyr::select(-hectares) |> 
  dplyr::ungroup() |> 
  tidyr::pivot_wider(
    names_from = "to",
    values_from = "transition_percent",
    values_fill = 0
  ) |> 
  dplyr::filter(
    from != "none"
  ) |> 
  dplyr::rename(
    Community = from
  )

knitr::kable(communities_transition_df)
Community none Juniper-broom shrublands Holm oak forests Pinus forests Alpine pastures
Holm oak forests 100.0 0.0 0.0 0.0 0.0
Pyrenean oak forests 99.9 0.0 0.1 0.0 0.0
Pinus forests 91.9 0.0 2.6 5.4 0.0
Juniper-broom shrublands 76.2 5.0 1.3 17.4 0.0
Alpine pastures 14.3 85.6 0.0 0.0 0.1

The transition matrix shows the source community (year 2010) in the first column, the resulting community as columns, and the values represent the percentage of the original area of the source community becoming suitable for a different community in 2050.

For example, most of the area of Alpine pastures transitioning to Juniper-broom shrublands can be interpreted as an indicator of biological pressure. Environmental change may facilitate the upwards migration of Juniper-broom species, resulting in a heightened competition with pastures, leading to the eventual replacement of this community.

Things to think about

The habitat suitability projections here don’t represent the future reality of the plant communities of Sierra Nevada for at least three reasons:

  • The future climate data used here is the result of an old climate model running on one of many possible scenarios.

  • The model used cannot capture many processes that govern vegetation dynamics, including species dispersal limitations, biotic interactions, and community resilience or adaptation.

  • Random forest cannot predict increased suitability even under improving conditions, unlike parametric models such as GLMs or GAMs.

The communities dataset offers a rich testbed for multi-class classification and ecological niche modelling in a real Mediterranean mountain system. The random forest workflow above demonstrates how a single trained multi-class model can be projected across climate scenarios, while underscoring the importance of interpreting such projections with caution.

References

Benito, B., Lorite, J., & Peñas, J. (2011). Simulating potential effects of climatic warming on altitudinal patterns of key species in Mediterranean-alpine ecosystems. Climatic Change, 108, 471–483. https://doi.org/10.1007/s10584-010-0015-3