Create spatially independent training and testing folds
Source:R/make_spatial_fold.R
make_spatial_fold.RdGenerates two spatially independent data folds by growing a rectangular buffer from a focal point until a specified fraction of records falls inside. Used internally by make_spatial_folds() and rf_evaluate() for spatial cross-validation.
Usage
make_spatial_fold(
data = NULL,
dependent.variable.name = NULL,
xy.i = NULL,
xy = NULL,
distance.step.x = NULL,
distance.step.y = NULL,
training.fraction = 0.8
)Arguments
- data
Data frame containing response variable and predictors. Required only for binary response variables.
- dependent.variable.name
Character string with the name of the response variable. Must be a column name in
data. Required only for binary response variables.- xy.i
Single-row data frame with columns "x" (longitude), "y" (latitude), and "id" (record identifier). Defines the focal point from which the buffer grows.
- xy
Data frame with columns "x" (longitude), "y" (latitude), and "id" (record identifier). Contains all spatial coordinates for the dataset.
- distance.step.x
Numeric value specifying the buffer growth increment along the x-axis. Default:
NULL(automatically set to 1/1000th of the x-coordinate range).- distance.step.y
Numeric value specifying the buffer growth increment along the y-axis. Default:
NULL(automatically set to 1/1000th of the y-coordinate range).- training.fraction
Numeric value between 0.1 and 0.9 specifying the fraction of records to include in the training fold. Default:
0.8.
Value
List with two elements:
training: Integer vector of record IDs (fromxy$id) in the training fold.testing: Integer vector of record IDs (fromxy$id) in the testing fold.
Details
This function creates spatially independent training and testing folds for spatial cross-validation. The algorithm works as follows:
Starts with a small rectangular buffer centered on the focal point (
xy.i)Grows the buffer incrementally by
distance.step.xanddistance.step.yContinues growing until the buffer contains the desired number of records (
training.fraction * total records)Assigns records inside the buffer to training and records outside to testing
Special handling for binary response variables:
When data and dependent.variable.name are provided and the response is binary (0/1), the function ensures that training.fraction applies to the number of presences (1s), not total records. This prevents imbalanced sampling in presence-absence models.
Examples
data(plants_df, plants_xy)
# Create spatial fold centered on first coordinate
fold <- make_spatial_fold(
xy.i = plants_xy[1, ],
xy = plants_xy,
training.fraction = 0.6
)
# View training and testing record IDs
fold$training
#> [1] 1 2 4 5 6 7 10 12 14 15 19 20 21 22 23 26 28 31
#> [19] 32 33 34 35 36 37 45 47 48 50 51 54 56 58 59 62 63 64
#> [37] 65 66 67 71 73 74 77 78 79 80 81 83 87 94 95 96 97 98
#> [55] 99 100 102 104 106 107 108 109 110 111 112 113 114 115 118 119 122 125
#> [73] 126 127 129 131 135 136 138 139 141 145 151 153 154 155 156 157 159 160
#> [91] 161 162 163 165 166 168 171 174 177 178 179 181 182 185 186 187 188 189
#> [109] 191 192 193 194 195 198 201 204 205 206 207 208 209 210 212 213 214 215
#> [127] 216 218 219 220 221 222 223 224 226 227
fold$testing
#> [1] 3 8 9 11 13 16 17 18 24 25 27 29 30 38 39 40 41 42 43
#> [20] 44 46 49 52 53 55 57 60 61 68 69 70 72 75 76 82 84 85 86
#> [39] 88 89 90 91 92 93 101 103 105 116 117 120 121 123 124 128 130 132 133
#> [58] 134 137 140 142 143 144 146 147 148 149 150 152 158 164 167 169 170 172 173
#> [77] 175 176 180 183 184 190 196 197 199 200 202 203 211 217 225
# Visualize the spatial split (training = red, testing = blue, center = black)
if (interactive()) {
plot(plants_xy[c("x", "y")], type = "n", xlab = "", ylab = "")
points(plants_xy[fold$training, c("x", "y")], col = "red4", pch = 15)
points(plants_xy[fold$testing, c("x", "y")], col = "blue4", pch = 15)
points(plants_xy[1, c("x", "y")], col = "black", pch = 15, cex = 2)
}