`NEWS.md`

Added the function `rf_importance()`

. It fits models with and without each predictor, compares them via spatial cross validation with `rf_evaluate()`

, and returns the increase/decrease in performance when a given variable is included in the model.

The default random seed for all functions have changed from `NULL`

to `1`

to facilitate reproducibility.

The function `rf_evaluate()`

has a new argument named `grow.testing.folds`

. When set to `TRUE`

, it uses `1 - training.fraction`

instead of `training.fraction`

to grow the spatial folds, and then flips the names of the training and testing folds. As a result, the testing folds are generally surrounded by the training folds (just the opposite of the default behavior of the function), which might be beneficial for particular spatial structures of the training data. Thanks to **Aleksandra Kulawska** for the suggestion!

Overhaul of the methods used for parallelization. The functions `rf_spatial()`

, `rf_repeat()`

, `rf_evaluate()`

, `rf_tuning()`

, `rf_compare()`

, and `rf_interactions()`

can now accept a cluster definition generated with `parallel::makeCluster()`

via the `cluster`

argument. Also, models resulting from these functions and `rf()`

carry the cluster definition with themselves in the slot `model$cluster`

, so the cluster definition can be passed from function to function using a pipe, as shown below:

```
library(spatialRF)
library(magrittr)
#loading the example data
data(plant_richness_df)
data("distance_matrix")
xy <- plant_richness_df[, c("x", "y")]
dependent.variable.name <- "richness_species_vascular"
predictor.variable.names <- colnames(plant_richness_df)[5:21]
#creating cluster
my.cluster <- parallel::makeCluster(
4,
type = "PSOCK"
)
#registering cluster (rf functions register it anyway)
doParallel::registerDoParallel(cl = cluster)
#fitting model
m <- rf(
data = plant_richness_df,
dependent.variable.name = dependent.variable.name,
predictor.variable.names = predictor.variable.names,
distance.matrix = distance_matrix,
xy = xy,
cluster = my.cluster
) %>%
rf_spatial() %>%
rf_tuning() %>%
rf_evaluate() %>%
rf_repeat()
#stopping cluster
parallel::stopCluster(cl = my.cluster)
```

The system works as follows: If `cluster`

is not `NULL`

and `model`

is provided, the function looks into the model. If there is a cluster definition there, it is used to parallelize computations, but the cluster is not stopped within the function. If there is not a cluster in `model`

, then the function falls back to the argument `n.cores`

to generate a cluster that is stopped when the function ends its operations.

These changes should improve performance when working with several functions in the same script, becuase these functions do not have to waste time in generating their own clusters.

The function `rf_interactions()`

is now named `the_feature_engineer()`

.

The function `cluster_definition()`

is now named `beowulf_cluster()`

, and returns a cluster instead of a cluster definition to be used as input for `parallel::makeCluster()`

.

rf_repeat() now generates a proper “importance” slot for models fitted with rf_spatial(), and preserves the “evaluation” and “tuning” slots if they exist.

Simplified rf_spatial() by removing options to generate an rf_repeat() model on the fly. rf_repeat() should only be used now at the end of a workflow, as described in the documentation.

Fixed issue with the area of the violin plots generated by plot_importance().

Improved the function rf_interactions() with a new type of interaction (first factor of a PCA between two predictors), added criteria to reduce multicollinearity among interactions, and between interactions and predictors, and now the function returns data helpful to fit models right away.

Added new residuals diagnostics with the functions residuals_diagnostics() and plot_residuals_diagnostics(). This changed the name of the slot “spatial.autocorrelation.residuals” to “residuals”, that now stores all the information relative to the residuals.

All plotting functions now allow to change the color of their key components.

Changed the names of function arguments from ‘x’ to ‘model’ or ‘distance.matrix’ for consistency. This might break code written previously, but I hope argument names are more self-explanatory now.

The function rf_spatial() now fits a non-spatial model first, and only generates spatial predictors for these distance.thresholds that show positive spatial autocorrelation.

Added a new function named filter_spatial_predictors(), that removes redundant spatial predictors within rf_spatial(). It shouldn’t lead to changes in the spatial models fitted with previous versions, but it will make them more parsimonious.

Changed the style of the package’s boxplots.

When using rf_repeat(), the median of the variable importance scores, performance scores, and Moran’s I is reported, instead of the mean.

Added the functions plot_training_data() and plot_moran_training_data() to help explore the training data prior to modeling.

Also fixed an issue where response variables could be identified as binary by mistake.

A bug regarding the predictions generated by `rf()`

that affected every other function fitting models has been fixed. Previously, the model predictions came from the “predictions” slot produced by `ranger()`

. Such predictions are produced from the out-of-bag data during model training, and are different and lead to lower R squared values than those produced with predict(). Now the predictions yielded by rf() are generated with predict(), and therefore you might notice that now models fitted with spatialRF functions perform better than before, because they do.

The function `print_evaluation()`

does not use huxtable any longer to print the evaluation results, and only shows the results of the testing model.

Added support for binary data (0 and 1). The function `rf()`

now tests if the data is binary, and if so, it populates the `case.weights`

argument of `ranger`

with the new function `case_weights()`

to minimize the side effects of unbalanced data.

Fixed an issue where rf() applied the wrong is.numeric check to the response variable and the predictors that caused issues with tibbles.

Removed the function scale_robust() from rf(), and replaced it with scale(). It was giving more troubles than benefits.