Skip to contents

Overview

The andalusia dataset contains presence records of threatened and endemic plant species from Andalusia, Spain — a biodiversity hotspot in south-western Europe renowned for its exceptionally high rates of endemism within the Mediterranean Basin. Records were collected at 400 m resolution in the ETRS89 / UTM zone 30N coordinate system (EPSG:25830), with species presences spatially thinned at the raster cell level to remove spatial redundancy. Species with fewer than 30 records after thinning were excluded.

The dataset pairs occurrence points for multiple threatened and endemic species with randomly sampled background points, and includes 20 environmental predictors derived from Landsat imagery, climate surfaces, and a digital elevation model. This structure is well suited for per-species species distribution models (SDMs), multi-species richness mapping, and spatial conservation prioritisation.

Setup

Data Structure

The dataset is an sf data frame with 8666 rows and 23 columns.

andalusia |> dplyr::glimpse()
#> Rows: 8,666
#> Columns: 23
#> $ species                <chr> "Adenocarpus_gibbsianus", "Adenocarpus_gibbsian…
#> $ presence               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
#> $ landsat_band_1         <dbl> 61.25444, 65.91648, 82.49583, 85.40617, 74.6444…
#> $ landsat_band_2         <dbl> 77.78752, 83.29051, 99.06973, 107.76321, 97.494…
#> $ landsat_band_3         <dbl> 81.18155, 85.66795, 97.28080, 109.72297, 91.556…
#> $ landsat_band_4         <dbl> 141.2575, 144.9496, 158.4905, 130.6584, 102.595…
#> $ landsat_band_5         <dbl> 123.6921, 125.1109, 133.4070, 121.9335, 95.1965…
#> $ landsat_band_6         <dbl> 109.41866, 112.08743, 117.67980, 114.50157, 93.…
#> $ landsat_ndvi           <dbl> 27.735855, 27.032965, 24.912889, 7.063870, -3.6…
#> $ rainfall_annual        <dbl> 559.7617, 553.3437, 532.9059, 348.0289, 325.503…
#> $ rainfall_summer        <dbl> 24.53417, 24.47948, 22.63235, 21.57755, 21.2542…
#> $ solar_radiation_summer <dbl> 7685.383, 7684.105, 7663.054, 7643.000, 7642.30…
#> $ solar_radiation_winter <dbl> 2214.927, 2189.019, 2195.299, 2192.499, 2180.77…
#> $ temperature_summer_max <dbl> 31.98804, 31.98711, 31.96779, 30.20288, 29.4001…
#> $ temperature_summer_min <dbl> 18.30772, 18.30594, 18.53732, 18.83981, 18.8364…
#> $ temperature_winter_max <dbl> 15.64293, 15.62554, 15.99173, 16.47264, 16.4793…
#> $ temperature_winter_min <dbl> 6.451786, 6.395597, 6.865064, 8.208864, 8.53619…
#> $ topography_eastness    <dbl> 38.95396, 53.94389, 18.68366, 50.00000, 45.0372…
#> $ topography_elevation   <dbl> 159.015488, 159.414963, 89.350151, 0.000000, 0.…
#> $ topography_northness   <dbl> 44.40618, 50.61015, 45.43943, 50.00000, 57.4226…
#> $ topography_position    <dbl> 9.46080685, 12.75258732, 1.77814519, 0.00000000…
#> $ topography_slope       <dbl> 1.0978156, 1.1250243, 2.0473621, 0.0000000, 0.2…
#> $ geometry               <POINT [m]> POINT (125777.2 4145725), POINT (125377.2…

The species column identifies each record as a named species or "background". The table below shows the number of records per species.

andalusia |>
  sf::st_drop_geometry() |>
  dplyr::group_by(species) |>
  dplyr::summarise(n = dplyr::n()) |>
  dplyr::arrange(dplyr::desc(n))
#> # A tibble: 28 × 2
#>    species                            n
#>    <chr>                          <int>
#>  1 background                      5598
#>  2 Maytenus_senegalensis_europaea   871
#>  3 Abies_pinsapo                    500
#>  4 Linaria_nigricans                344
#>  5 Juniperus_oxycedrus_macrocarpa   165
#>  6 Narcissus_longispathus            98
#>  7 Viola_cazorlensis                 98
#>  8 Cynomorium_coccineum              73
#>  9 Quercus_alpestris                 73
#> 10 Thymus_albicans                   64
#> # ℹ 18 more rows

The map below shows all species presences and background points by colour.

mapview::mapview(
  andalusia,
  zcol = "species",
  layer.name = "Species",
  col.regions = viridis::turbo(n = length(unique(andalusia$species)))
)

The dataset includes 20 remote sensing, topographic, and climate predictors.

andalusia_predictors
#>  [1] "landsat_band_1"         "landsat_band_2"         "landsat_band_3"        
#>  [4] "landsat_band_4"         "landsat_band_5"         "landsat_band_6"        
#>  [7] "landsat_ndvi"           "rainfall_annual"        "rainfall_summer"       
#> [10] "solar_radiation_summer" "solar_radiation_winter" "temperature_summer_max"
#> [13] "temperature_summer_min" "temperature_winter_max" "temperature_winter_min"
#> [16] "topography_eastness"    "topography_elevation"   "topography_northness"  
#> [19] "topography_position"    "topography_slope"

Example Usage

This section demonstrates a complete per-species distribution modelling workflow:

  • Environmental Raster: download and visualise the 20-band raster covering the study area.
  • Multicollinearity Filtering: select a shared, non-redundant predictor set using collinear::collinear().
  • Modelling Rationale: describe the per-species GLM design and threshold strategy.
  • Model Fitting: fit one binomial GLM per species with inverse-frequency case weights.
  • Threshold Selection: choose the optimal binary threshold per model via Youden’s J statistic.
  • Species Richness Model: sum thresholded binary predictions to produce a richness raster.
  • High-Richness Areas: identify and map cells at or above the 75th-percentile richness.

Environmental Raster

The environmental raster (same extent as the occurrence data, 400 m resolution, EPSG:25830) is loaded with andalusia_extra(). When the vignette renders, andalusia_env.tif is read from disk without re-downloading.

andalusia_env <- spatialData::andalusia_extra()
plot(andalusia_env)

Multicollinearity Filtering

A shared predictor set is selected for all species models using collinear::collinear() on the pooled presence response. Using the same predictors across all species ensures that suitability surfaces are directly comparable when summed into a richness layer.

andalusia_selection <- collinear::collinear(
  df         = andalusia,
  responses  = "presence",
  predictors = andalusia_predictors,
  f          = collinear::f_binomial_gam,
  quiet      = TRUE
)
selected_predictors <- andalusia_selection$presence$selection
selected_predictors
#> [1] "temperature_summer_max" "rainfall_summer"        "solar_radiation_summer"
#> [4] "landsat_band_5"         "solar_radiation_winter" "topography_position"   
#> [7] "landsat_ndvi"           "topography_eastness"   
#> attr(,"validated")
#> [1] TRUE

Modelling Rationale

Each species is modelled with a separate binomial GLM. For a given species sp, the training data consists of:

  • All rows where species == sp (labelled sp_presence = 1).
  • All background rows (labelled sp_presence = 0).

Background points typically outnumber any individual species’ presences by a large margin. To prevent the model from being dominated by the background class, collinear::case_weights() assigns inverse-frequency weights so that presences and background contribute equally to the likelihood.

The shared predictor set selected above ensures that all suitability layers rest on the same environmental basis and can be meaningfully summed into a species richness estimate.

Each model is converted from a continuous suitability surface to a binary presence/absence map using Youden’s J statistic (sensitivity + specificity − 1). The threshold that maximises J balances commission and omission errors symmetrically, giving a principled default for binary mapping across species with very different prevalences.

The species richness raster is then the sum of all binary layers: each cell’s value equals the number of species predicted present there.

Model Fitting

The helper function find_threshold() evaluates Youden’s J at 99 candidate thresholds and returns the one with the highest value.

find_threshold <- function(observed, predicted) {
  thresholds <- seq(0.01, 0.99, by = 0.01)
  youden <- vapply(thresholds, function(t) {
    pred_bin <- as.integer(predicted >= t)
    tp <- sum(pred_bin == 1L & observed == 1L)
    tn <- sum(pred_bin == 0L & observed == 0L)
    fp <- sum(pred_bin == 1L & observed == 0L)
    fn <- sum(pred_bin == 0L & observed == 1L)
    if ((tp + fn) == 0L || (tn + fp) == 0L) return(-1)
    (tp / (tp + fn)) + (tn / (tn + fp)) - 1
  }, FUN.VALUE = numeric(1))
  thresholds[which.max(youden)]
}

One binomial GLM is fitted per species. The warning = FALSE chunk option suppresses the non-integer weights warning that stats::glm() emits when non-standard case weights are provided.

species_names <- unique(andalusia$species[andalusia$species != "background"])

species_models <- lapply(
  X = species_names,
  FUN = function(sp) {
    sp_df <- andalusia |>
      dplyr::filter(species == sp | species == "background") |>
      sf::st_drop_geometry() |>
      dplyr::mutate(sp_presence = dplyr::if_else(species == sp, 1L, 0L))
    sp_weights <- collinear::case_weights(x = sp_df$sp_presence)
    sp_formula <- as.formula(
      paste0("sp_presence ~ ", paste(selected_predictors, collapse = " + "))
    )
    stats::glm(
      formula = sp_formula,
      data    = sp_df,
      family  = binomial(link = "logit"),
      weights = sp_weights
    )
  }
)
names(species_models) <- species_names

Threshold Selection

For each species, training-set predictions are scored against observed labels and find_threshold() selects the Youden-optimal cut-off.

species_thresholds <- vapply(
  X = species_names,
  FUN = function(sp) {
    sp_df <- andalusia |>
      dplyr::filter(species == sp | species == "background") |>
      sf::st_drop_geometry() |>
      dplyr::mutate(sp_presence = dplyr::if_else(species == sp, 1L, 0L))
    sp_pred <- stats::predict(species_models[[sp]], newdata = sp_df, type = "response")
    find_threshold(observed = sp_df$sp_presence, predicted = sp_pred)
  },
  FUN.VALUE = numeric(1)
)
names(species_thresholds) <- species_names
species_thresholds
#>          Adenocarpus_gibbsianus                Allium_pruinatum 
#>                            0.44                            0.53 
#> Aquilegia_pyrenaica_cazorlensis           Artemisia_granatensis 
#>                            0.66                            0.90 
#>               Astragalus_edulis                  Atropa_baetica 
#>                            0.81                            0.44 
#>        Campanula_specularioides            Cynomorium_coccineum 
#>                            0.75                            0.51 
#>             Dianthus_hinoxianus                Erodium_rupicola 
#>                            0.45                            0.81 
#>      Hymenostemma_pseudanthemis  Juniperus_oxycedrus_macrocarpa 
#>                            0.66                            0.53 
#>               Linaria_nigricans                 Linaria_tursica 
#>                            0.54                            0.30 
#>               Marsilea_batardae               Marsilea_strigosa 
#>                            0.43                            0.37 
#>  Maytenus_senegalensis_europaea            Moehringia_fontqueri 
#>                            0.50                            0.77 
#>          Narcissus_longispathus Narcissus_nevadensis_nevadensis 
#>                            0.65                            0.45 
#>               Picris_willkommii               Quercus_alpestris 
#>                            0.65                            0.78 
#>           Rosmarinus_tomentosus              Silene_fernandezii 
#>                            0.55                            0.54 
#>                 Thymus_albicans               Viola_cazorlensis 
#>                            0.42                            0.49 
#>                   Abies_pinsapo 
#>                            0.53

Species Richness Model

The richness raster is built by predicting each model onto andalusia_env, thresholding the continuous suitability surface at the species-specific Youden cut-off, and summing the resulting binary layers.

richness_raster <- lapply(
  X = species_names,
  FUN = function(sp) {
    sp_pred <- terra::predict(
      object = andalusia_env,
      model  = species_models[[sp]],
      type   = "response",
      na.rm  = TRUE
    )
    terra::ifel(sp_pred >= species_thresholds[[sp]], 1L, 0L)
  }
) |>
  terra::rast() |>
  terra::app(fun = sum)

names(richness_raster) <- "species_richness"

plot(
  x    = richness_raster,
  col  = viridis::turbo(n = 100),
  main = "Predicted species richness"
)

High-Richness Areas

Areas at or above the 75th percentile of predicted richness are identified and mapped interactively to highlight the most species-rich parts of the study region.

richness_threshold <- terra::global(
  x   = richness_raster,
  fun = function(x) quantile(x, probs = 0.75, na.rm = TRUE)
)[[1]]

high_richness <- terra::ifel(
  test = richness_raster >= richness_threshold,
  yes  = richness_raster,
  no   = NA
)

mapview::mapview(
  high_richness,
  layer.name  = "High species richness",
  col.regions = viridis::turbo(n = 100)
)

Conclusion

The andalusia dataset provides a multi-species presence-background framework well suited to comparative SDM workflows and conservation prioritisation. By fitting one binomial GLM per species with case-weighted training data and a shared, multicollinearity-filtered predictor set, the models remain comparable across species and their outputs can be meaningfully combined into a spatially explicit richness estimate.

The Youden’s J thresholding approach offers a principled, symmetric criterion for converting continuous suitability surfaces into binary maps — an important step when the downstream goal is counting species rather than ranking habitat quality. The resulting richness raster highlights the areas of Andalusia predicted to support the greatest concentration of threatened and endemic plant species, offering a direct input to spatial conservation planning.

References

Benito, B.M., Lorite, J., Pérez-Pérez, R., Gómez-Aparicio, L., & Peñas, J. (2014). Forecasting plant range collapse in a mediterranean hotspot: when dispersal uncertainties matter. Diversity and Distributions, 20(1), 72–83. https://doi.org/10.1111/ddi.12148