Added the function
rf_importance(). It fits models with and without each predictor, compares them via spatial cross validation with
rf_evaluate(), and returns the increase/decrease in performance when a given variable is included in the model.
The default random seed for all functions have changed from
1 to facilitate reproducibility.
rf_evaluate() has a new argument named
grow.testing.folds. When set to
TRUE, it uses
1 - training.fraction instead of
training.fraction to grow the spatial folds, and then flips the names of the training and testing folds. As a result, the testing folds are generally surrounded by the training folds (just the opposite of the default behavior of the function), which might be beneficial for particular spatial structures of the training data. Thanks to Aleksandra Kulawska for the suggestion!
Overhaul of the methods used for parallelization. The functions
rf_interactions() can now accept a cluster definition generated with
parallel::makeCluster() via the
cluster argument. Also, models resulting from these functions and
rf() carry the cluster definition with themselves in the slot
model$cluster, so the cluster definition can be passed from function to function using a pipe, as shown below:
library(spatialRF) library(magrittr) #loading the example data data(plant_richness_df) data("distance_matrix") xy <- plant_richness_df[, c("x", "y")] dependent.variable.name <- "richness_species_vascular" predictor.variable.names <- colnames(plant_richness_df)[5:21] #creating cluster my.cluster <- parallel::makeCluster( 4, type = "PSOCK" ) #registering cluster (rf functions register it anyway) doParallel::registerDoParallel(cl = cluster) #fitting model m <- rf( data = plant_richness_df, dependent.variable.name = dependent.variable.name, predictor.variable.names = predictor.variable.names, distance.matrix = distance_matrix, xy = xy, cluster = my.cluster ) %>% rf_spatial() %>% rf_tuning() %>% rf_evaluate() %>% rf_repeat() #stopping cluster parallel::stopCluster(cl = my.cluster)
The system works as follows: If
cluster is not
model is provided, the function looks into the model. If there is a cluster definition there, it is used to parallelize computations, but the cluster is not stopped within the function. If there is not a cluster in
model, then the function falls back to the argument
n.cores to generate a cluster that is stopped when the function ends its operations.
These changes should improve performance when working with several functions in the same script, becuase these functions do not have to waste time in generating their own clusters.
rf_interactions() is now named
rf_repeat() now generates a proper “importance” slot for models fitted with rf_spatial(), and preserves the “evaluation” and “tuning” slots if they exist.
Simplified rf_spatial() by removing options to generate an rf_repeat() model on the fly. rf_repeat() should only be used now at the end of a workflow, as described in the documentation.
Fixed issue with the area of the violin plots generated by plot_importance().
Improved the function rf_interactions() with a new type of interaction (first factor of a PCA between two predictors), added criteria to reduce multicollinearity among interactions, and between interactions and predictors, and now the function returns data helpful to fit models right away.
Added new residuals diagnostics with the functions residuals_diagnostics() and plot_residuals_diagnostics(). This changed the name of the slot “spatial.autocorrelation.residuals” to “residuals”, that now stores all the information relative to the residuals.
All plotting functions now allow to change the color of their key components.
Changed the names of function arguments from ‘x’ to ‘model’ or ‘distance.matrix’ for consistency. This might break code written previously, but I hope argument names are more self-explanatory now.
The function rf_spatial() now fits a non-spatial model first, and only generates spatial predictors for these distance.thresholds that show positive spatial autocorrelation.
Added a new function named filter_spatial_predictors(), that removes redundant spatial predictors within rf_spatial(). It shouldn’t lead to changes in the spatial models fitted with previous versions, but it will make them more parsimonious.
Changed the style of the package’s boxplots.
When using rf_repeat(), the median of the variable importance scores, performance scores, and Moran’s I is reported, instead of the mean.
Added the functions plot_training_data() and plot_moran_training_data() to help explore the training data prior to modeling.
Also fixed an issue where response variables could be identified as binary by mistake.
A bug regarding the predictions generated by
rf() that affected every other function fitting models has been fixed. Previously, the model predictions came from the “predictions” slot produced by
ranger(). Such predictions are produced from the out-of-bag data during model training, and are different and lead to lower R squared values than those produced with predict(). Now the predictions yielded by rf() are generated with predict(), and therefore you might notice that now models fitted with spatialRF functions perform better than before, because they do.
print_evaluation() does not use huxtable any longer to print the evaluation results, and only shows the results of the testing model.
Added support for binary data (0 and 1). The function
rf() now tests if the data is binary, and if so, it populates the
case.weights argument of
ranger with the new function
case_weights() to minimize the side effects of unbalanced data.
Fixed an issue where rf() applied the wrong is.numeric check to the response variable and the predictors that caused issues with tibbles.
Removed the function scale_robust() from rf(), and replaced it with scale(). It was giving more troubles than benefits.
Modified rf_tuning() to better tune models fitted with rf_spatial().
Minor fixes in several other functions.