NEWS.md
Added the function rf_importance()
. It fits models with and without each predictor, compares them via spatial cross validation with rf_evaluate()
, and returns the increase/decrease in performance when a given variable is included in the model.
The default random seed for all functions have changed from NULL
to 1
to facilitate reproducibility.
The function rf_evaluate()
has a new argument named grow.testing.folds
. When set to TRUE
, it uses 1 - training.fraction
instead of training.fraction
to grow the spatial folds, and then flips the names of the training and testing folds. As a result, the testing folds are generally surrounded by the training folds (just the opposite of the default behavior of the function), which might be beneficial for particular spatial structures of the training data. Thanks to Aleksandra Kulawska for the suggestion!
Overhaul of the methods used for parallelization. The functions rf_spatial()
, rf_repeat()
, rf_evaluate()
, rf_tuning()
, rf_compare()
, and rf_interactions()
can now accept a cluster definition generated with parallel::makeCluster()
via the cluster
argument. Also, models resulting from these functions and rf()
carry the cluster definition with themselves in the slot model$cluster
, so the cluster definition can be passed from function to function using a pipe, as shown below:
library(spatialRF)
library(magrittr)
#loading the example data
data(plant_richness_df)
data("distance_matrix")
xy <- plant_richness_df[, c("x", "y")]
dependent.variable.name <- "richness_species_vascular"
predictor.variable.names <- colnames(plant_richness_df)[5:21]
#creating cluster
my.cluster <- parallel::makeCluster(
4,
type = "PSOCK"
)
#registering cluster (rf functions register it anyway)
doParallel::registerDoParallel(cl = cluster)
#fitting model
m <- rf(
data = plant_richness_df,
dependent.variable.name = dependent.variable.name,
predictor.variable.names = predictor.variable.names,
distance.matrix = distance_matrix,
xy = xy,
cluster = my.cluster
) %>%
rf_spatial() %>%
rf_tuning() %>%
rf_evaluate() %>%
rf_repeat()
#stopping cluster
parallel::stopCluster(cl = my.cluster)
The system works as follows: If cluster
is not NULL
and model
is provided, the function looks into the model. If there is a cluster definition there, it is used to parallelize computations, but the cluster is not stopped within the function. If there is not a cluster in model
, then the function falls back to the argument n.cores
to generate a cluster that is stopped when the function ends its operations.
These changes should improve performance when working with several functions in the same script, becuase these functions do not have to waste time in generating their own clusters.
The function rf_interactions()
is now named the_feature_engineer()
.
The function cluster_definition()
is now named beowulf_cluster()
, and returns a cluster instead of a cluster definition to be used as input for parallel::makeCluster()
.
rf_repeat() now generates a proper “importance” slot for models fitted with rf_spatial(), and preserves the “evaluation” and “tuning” slots if they exist.
Simplified rf_spatial() by removing options to generate an rf_repeat() model on the fly. rf_repeat() should only be used now at the end of a workflow, as described in the documentation.
Fixed issue with the area of the violin plots generated by plot_importance().
Improved the function rf_interactions() with a new type of interaction (first factor of a PCA between two predictors), added criteria to reduce multicollinearity among interactions, and between interactions and predictors, and now the function returns data helpful to fit models right away.
Added new residuals diagnostics with the functions residuals_diagnostics() and plot_residuals_diagnostics(). This changed the name of the slot “spatial.autocorrelation.residuals” to “residuals”, that now stores all the information relative to the residuals.
All plotting functions now allow to change the color of their key components.
Changed the names of function arguments from ‘x’ to ‘model’ or ‘distance.matrix’ for consistency. This might break code written previously, but I hope argument names are more self-explanatory now.
The function rf_spatial() now fits a non-spatial model first, and only generates spatial predictors for these distance.thresholds that show positive spatial autocorrelation.
Added a new function named filter_spatial_predictors(), that removes redundant spatial predictors within rf_spatial(). It shouldn’t lead to changes in the spatial models fitted with previous versions, but it will make them more parsimonious.
Changed the style of the package’s boxplots.
When using rf_repeat(), the median of the variable importance scores, performance scores, and Moran’s I is reported, instead of the mean.
Added the functions plot_training_data() and plot_moran_training_data() to help explore the training data prior to modeling.
Also fixed an issue where response variables could be identified as binary by mistake.
A bug regarding the predictions generated by rf()
that affected every other function fitting models has been fixed. Previously, the model predictions came from the “predictions” slot produced by ranger()
. Such predictions are produced from the out-of-bag data during model training, and are different and lead to lower R squared values than those produced with predict(). Now the predictions yielded by rf() are generated with predict(), and therefore you might notice that now models fitted with spatialRF functions perform better than before, because they do.
The function print_evaluation()
does not use huxtable any longer to print the evaluation results, and only shows the results of the testing model.
Added support for binary data (0 and 1). The function rf()
now tests if the data is binary, and if so, it populates the case.weights
argument of ranger
with the new function case_weights()
to minimize the side effects of unbalanced data.
Fixed an issue where rf() applied the wrong is.numeric check to the response variable and the predictors that caused issues with tibbles.
Removed the function scale_robust() from rf(), and replaced it with scale(). It was giving more troubles than benefits.