A convenient wrapper for ranger that completes its output by providing the Moran's I of the residuals for different distance thresholds, the rmse and nrmse (as computed by root_mean_squared_error()), and variable importance scores based on a scaled version of the data generated by scale.
Usage
rf(
data = NULL,
dependent.variable.name = NULL,
predictor.variable.names = NULL,
distance.matrix = NULL,
distance.thresholds = NULL,
xy = NULL,
ranger.arguments = NULL,
scaled.importance = FALSE,
seed = 1,
verbose = TRUE,
n.cores = parallel::detectCores() - 1,
cluster = NULL
)Arguments
- data
Data frame with a response variable and a set of predictors. Default:
NULL- dependent.variable.name
Character string with the name of the response variable. Must be in the column names of
data. If the dependent variable is binary with values 1 and 0, the argumentcase.weightsofrangeris populated by the functioncase_weights(). Default:NULL- predictor.variable.names
Character vector with the names of the predictive variables. Every element of this vector must be in the column names of
data. Optionally, the result ofauto_cor()orauto_vif(). Default:NULL- distance.matrix
Squared matrix with the distances among the records in
data. The number of rows ofdistance.matrixanddatamust be the same. If not provided, the computation of the Moran's I of the residuals is omitted. Default:NULL- distance.thresholds
Numeric vector with neighborhood distances. All distances in the distance matrix below each value in
dustance.thresholdsare set to 0 for the computation of Moran's I. IfNULL, it defaults to seq(0, max(distance.matrix), length.out = 4). Default:NULL- xy
(optional) Data frame or matrix with two columns containing coordinates and named "x" and "y". It is not used by this function, but it is stored in the slot
ranger.arguments$xyof the model, so it can be used byrf_evaluate()andrf_tuning(). Default:NULL- ranger.arguments
Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. The ranger arguments
x,y, andformulaare disabled. Please, consult the help file of ranger if you are not familiar with the arguments of this function.- scaled.importance
Logical, if
TRUE, the function scalesdatawith scale and fits a new model to compute scaled variable importance scores. This makes variable importance scores of different models somewhat comparable. Default:FALSE- seed
Integer, random seed to facilitate reproducibility. If set to a given number, the returned model is always the same. Default:
1- verbose
Boolean. If TRUE, messages and plots generated during the execution of the function are displayed. Default:
TRUE- n.cores
Integer, number of cores to use. Default:
parallel::detectCores() - 1- cluster
A cluster definition generated with
parallel::makeCluster(). This function does not use the cluster, but can pass it on to other functions when using the%>%pipe. It will be stored in the slotclusterof the output list. Default:NULL
Value
A ranger model with several extra slots:
ranger.arguments: Stores the values of the arguments used to fit the ranger model.importance: A list containing a data frame with the predictors ordered by their importance, a ggplot showing the importance values, and local importance scores (difference in accuracy between permuted and non permuted variables for every case, computed on the out-of-bag data).performance: performance scores: R squared on out-of-bag data, R squared (cor(observed, predicted) ^ 2), pseudo R squared (cor(observed, predicted)), RMSE, and normalized RMSE (NRMSE).residuals: residuals, normality test of the residuals computed withresiduals_test(), and spatial autocorrelation of the residuals computed withmoran_multithreshold().
Details
Please read the help file of ranger for further details. Notice that the formula interface of ranger is supported through ranger.arguments, but variable interactions are not allowed (but check the_feature_engineer()).
Examples
#loading example data
data("plant_richness_df")
data("distance_matrix")
#fittind random forest model
out <- rf(
data = plant_richness_df,
dependent.variable.name = "richness_species_vascular",
predictor.variable.names = colnames(plant_richness_df)[5:21],
distance.matrix = distance_matrix,
distance.thresholds = 0,
n.cores = 1
)
#> Model type
#> - Fitted with: ranger()
#> - Response variable: richness_species_vascular
#>
#> Random forest parameters
#> - Type: Regression
#> - Number of trees: 500
#> - Sample size: 227
#> - Number of predictors: 17
#> - Mtry: 4
#> - Minimum node size: 5
#>
#>
#> Model performance
#> - R squared (oob): 0.5622641
#> - R squared (cor(obs, pred)^2): 0.9513407
#> - Pseudo R squared (cor(obs, pred)):0.975367
#> - RMSE (oob): 2229.675
#> - RMSE: 976.0133
#> - Normalized RMSE: 0.281759
#>
#> Model residuals
#> - Stats:
#> ┌──────────┬─────────┬─────────┬────────┬────────┬─────────┐
#> │ Min. │ 1st Q. │ Median │ Mean │ 3rd Q. │ Max. │
#> ├──────────┼─────────┼─────────┼────────┼────────┼─────────┤
#> │ -1751.61 │ -473.69 │ -195.00 │ -31.43 │ 142.83 │ 8504.61 │
#> └──────────┴─────────┴─────────┴────────┴────────┴─────────┘
#> - Normality:
#> - Shapiro-Wilks W: 0.711
#> - p-value : 0
#> - Interpretation : Residuals are not normal
#>
#> - Spatial autocorrelation:
#> ┌──────────┬───────────┬─────────┬──────────────────┐
#> │ Distance │ Moran's I │ P value │ Interpretation │
#> ├──────────┼───────────┼─────────┼──────────────────┤
#> │ 0.0 │ 0.149 │ 0.000 │ Positive spatial │
#> │ │ │ │ correlation │
#> └──────────┴───────────┴─────────┴──────────────────┘
#>
#> Variable importance:
#> ┌─────────────────────────────────┬────────────┐
#> │ Variable │ Importance │
#> ├─────────────────────────────────┼────────────┤
#> │ human_population │ 2026.102 │
#> ├─────────────────────────────────┼────────────┤
#> │ climate_bio1_average │ 1831.466 │
#> ├─────────────────────────────────┼────────────┤
#> │ climate_hypervolume │ 1444.210 │
#> ├─────────────────────────────────┼────────────┤
#> │ human_population_density │ 1359.632 │
#> ├─────────────────────────────────┼────────────┤
#> │ bias_area_km2 │ 1209.525 │
#> ├─────────────────────────────────┼────────────┤
#> │ human_footprint_average │ 976.333 │
#> ├─────────────────────────────────┼────────────┤
#> │ neighbors_count │ 846.447 │
#> ├─────────────────────────────────┼────────────┤
#> │ bias_species_per_record │ 719.243 │
#> ├─────────────────────────────────┼────────────┤
#> │ climate_aridity_index_average │ 695.705 │
#> ├─────────────────────────────────┼────────────┤
#> │ neighbors_area │ 676.818 │
#> ├─────────────────────────────────┼────────────┤
#> │ climate_velocity_lgm_average │ 631.907 │
#> ├─────────────────────────────────┼────────────┤
#> │ neighbors_percent_shared_edge │ 628.314 │
#> ├─────────────────────────────────┼────────────┤
#> │ fragmentation_cohesion │ 619.986 │
#> ├─────────────────────────────────┼────────────┤
#> │ topography_elevation_average │ 615.590 │
#> ├─────────────────────────────────┼────────────┤
#> │ fragmentation_division │ 483.461 │
#> ├─────────────────────────────────┼────────────┤
#> │ climate_bio15_minimum │ 373.095 │
#> ├─────────────────────────────────┼────────────┤
#> │ landcover_herbs_percent_average │ 358.212 │
#> └─────────────────────────────────┴────────────┘
class(out)
#> [1] "rf" "ranger"
#data frame with ordered variable importance
out$importance$per.variable
#> variable importance
#> 1 human_population 2026.102
#> 2 climate_bio1_average 1831.466
#> 3 climate_hypervolume 1444.210
#> 4 human_population_density 1359.632
#> 5 bias_area_km2 1209.525
#> 6 human_footprint_average 976.333
#> 7 neighbors_count 846.447
#> 8 bias_species_per_record 719.243
#> 9 climate_aridity_index_average 695.705
#> 10 neighbors_area 676.818
#> 11 climate_velocity_lgm_average 631.907
#> 12 neighbors_percent_shared_edge 628.314
#> 13 fragmentation_cohesion 619.986
#> 14 topography_elevation_average 615.590
#> 15 fragmentation_division 483.461
#> 16 climate_bio15_minimum 373.095
#> 17 landcover_herbs_percent_average 358.212
#variable importance plot
out$importance$per.variable.plot
#performance
out$performance
#> $r.squared.oob
#> [1] 0.5622641
#>
#> $r.squared
#> [1] 0.9513407
#>
#> $pseudo.r.squared
#> [1] 0.975367
#>
#> $rmse.oob
#> [1] 2229.675
#>
#> $rmse
#> [1] 976.0133
#>
#> $nrmse
#> [1] 0.281759
#>
#> $auc
#> [1] NA
#>
#spatial correlation of the residuals
out$spatial.correlation.residuals$per.distance
#> NULL
#plot of the Moran's I of the residuals for different distance thresholds
out$spatial.correlation.residuals$plot
#> NULL
#predictions for new data as done with ranger models:
predicted <- stats::predict(
object = out,
data = plant_richness_df,
type = "response"
)$predictions
#alternative data input methods
###############################
#ranger.arguments can contain ranger arguments and any other rf argument
my.ranger.arguments <- list(
data = plant_richness_df,
dependent.variable.name = "richness_species_vascular",
predictor.variable.names = colnames(plant_richness_df)[8:21],
distance.matrix = distance_matrix,
distance.thresholds = c(0, 1000)
)
#fitting model with these ranger arguments
out <- rf(
ranger.arguments = my.ranger.arguments,
n.cores = 1
)
#> Model type
#> - Fitted with: ranger()
#> - Response variable: richness_species_vascular
#>
#> Random forest parameters
#> - Type: Regression
#> - Number of trees: 500
#> - Sample size: 227
#> - Number of predictors: 14
#> - Mtry: 3
#> - Minimum node size: 5
#>
#>
#> Model performance
#> - R squared (oob): 0.5234123
#> - R squared (cor(obs, pred)^2): 0.9419549
#> - Pseudo R squared (cor(obs, pred)):0.9705436
#> - RMSE (oob): 2326.52
#> - RMSE: 1042.679
#> - Normalized RMSE: 0.3010044
#>
#> Model residuals
#> - Stats:
#> ┌──────────┬─────────┬─────────┬────────┬────────┬─────────┐
#> │ Min. │ 1st Q. │ Median │ Mean │ 3rd Q. │ Max. │
#> ├──────────┼─────────┼─────────┼────────┼────────┼─────────┤
#> │ -2044.30 │ -480.86 │ -196.58 │ -32.92 │ 168.62 │ 8253.26 │
#> └──────────┴─────────┴─────────┴────────┴────────┴─────────┘
#> - Normality:
#> - Shapiro-Wilks W: 0.767
#> - p-value : 0
#> - Interpretation : Residuals are not normal
#>
#> - Spatial autocorrelation:
#> ┌──────────┬───────────┬─────────┬──────────────────┐
#> │ Distance │ Moran's I │ P value │ Interpretation │
#> ├──────────┼───────────┼─────────┼──────────────────┤
#> │ 0.0 │ 0.160 │ 0.000 │ Positive spatial │
#> │ │ │ │ correlation │
#> ├──────────┼───────────┼─────────┼──────────────────┤
#> │ 1000.0 │ 0.050 │ 0.000 │ Positive spatial │
#> │ │ │ │ correlation │
#> └──────────┴───────────┴─────────┴──────────────────┘
#>
#> Variable importance:
#> ┌─────────────────────────────────┬────────────┐
#> │ Variable │ Importance │
#> ├─────────────────────────────────┼────────────┤
#> │ human_population │ 1991.952 │
#> ├─────────────────────────────────┼────────────┤
#> │ climate_bio1_average │ 1721.156 │
#> ├─────────────────────────────────┼────────────┤
#> │ climate_hypervolume │ 1581.268 │
#> ├─────────────────────────────────┼────────────┤
#> │ human_population_density │ 1404.890 │
#> ├─────────────────────────────────┼────────────┤
#> │ human_footprint_average │ 1228.636 │
#> ├─────────────────────────────────┼────────────┤
#> │ neighbors_area │ 925.862 │
#> ├─────────────────────────────────┼────────────┤
#> │ neighbors_count │ 863.227 │
#> ├─────────────────────────────────┼────────────┤
#> │ fragmentation_cohesion │ 715.115 │
#> ├─────────────────────────────────┼────────────┤
#> │ neighbors_percent_shared_edge │ 705.063 │
#> ├─────────────────────────────────┼────────────┤
#> │ topography_elevation_average │ 656.831 │
#> ├─────────────────────────────────┼────────────┤
#> │ climate_velocity_lgm_average │ 565.010 │
#> ├─────────────────────────────────┼────────────┤
#> │ climate_bio15_minimum │ 509.466 │
#> ├─────────────────────────────────┼────────────┤
#> │ landcover_herbs_percent_average │ 480.689 │
#> ├─────────────────────────────────┼────────────┤
#> │ fragmentation_division │ 437.163 │
#> └─────────────────────────────────┴────────────┘