Introduction to statistical learning using mlr3

Source: Chapters 1 and 2 of the textbook Applied Machine Learning Using mlr3 in R.

Simple ML workflow

  1. Define Task (e.g. do we want to do regression or classification? on what data? what are the response/predictors?)
  2. Learn/Predict (specify learners, fit learners to training data, predict on test data)
  3. Evaluate (specify error metrics, compute them for each model)

Example of benchmarking three models on two classification tasks

library(mlr3)
library(mlr3learners)
set.seed(37)

tasks <- tsks(c("pima", "sonar", "zoo", "spam", "wine"))  # use `as.data.table(mlr_tasks)` to see built-in tasks
learner <- lrns(c("classif.featureless", "classif.rpart", "classif.xgboost"), predict_type = "prob") 
resampling <- rsmps("cv")  
bmr <- benchmark(benchmark_grid(tasks, learner, resampling))
INFO  [16:05:36.680] [mlr3] Running benchmark with 150 resampling iterations
INFO  [16:05:36.716] [mlr3] Applying learner 'classif.featureless' on task 'pima' (iter 1/10)
INFO  [16:05:36.727] [mlr3] Applying learner 'classif.featureless' on task 'pima' (iter 2/10)
INFO  [16:05:36.732] [mlr3] Applying learner 'classif.featureless' on task 'pima' (iter 3/10)
INFO  [16:05:36.737] [mlr3] Applying learner 'classif.featureless' on task 'pima' (iter 4/10)
INFO  [16:05:36.741] [mlr3] Applying learner 'classif.featureless' on task 'pima' (iter 5/10)
INFO  [16:05:36.746] [mlr3] Applying learner 'classif.featureless' on task 'pima' (iter 6/10)
INFO  [16:05:36.750] [mlr3] Applying learner 'classif.featureless' on task 'pima' (iter 7/10)
INFO  [16:05:36.757] [mlr3] Applying learner 'classif.featureless' on task 'pima' (iter 8/10)
INFO  [16:05:36.762] [mlr3] Applying learner 'classif.featureless' on task 'pima' (iter 9/10)
INFO  [16:05:36.766] [mlr3] Applying learner 'classif.featureless' on task 'pima' (iter 10/10)
INFO  [16:05:36.771] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 1/10)
INFO  [16:05:36.788] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 2/10)
INFO  [16:05:36.801] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 3/10)
INFO  [16:05:36.810] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 4/10)
INFO  [16:05:36.818] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 5/10)
INFO  [16:05:36.828] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 6/10)
INFO  [16:05:36.836] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 7/10)
INFO  [16:05:36.844] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 8/10)
INFO  [16:05:36.854] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 9/10)
INFO  [16:05:36.862] [mlr3] Applying learner 'classif.rpart' on task 'pima' (iter 10/10)
INFO  [16:05:36.870] [mlr3] Applying learner 'classif.xgboost' on task 'pima' (iter 1/10)
INFO  [16:05:37.638] [mlr3] Applying learner 'classif.xgboost' on task 'pima' (iter 2/10)
INFO  [16:05:38.049] [mlr3] Applying learner 'classif.xgboost' on task 'pima' (iter 3/10)
INFO  [16:05:38.471] [mlr3] Applying learner 'classif.xgboost' on task 'pima' (iter 4/10)
INFO  [16:05:38.884] [mlr3] Applying learner 'classif.xgboost' on task 'pima' (iter 5/10)
INFO  [16:05:39.308] [mlr3] Applying learner 'classif.xgboost' on task 'pima' (iter 6/10)
INFO  [16:05:39.720] [mlr3] Applying learner 'classif.xgboost' on task 'pima' (iter 7/10)
INFO  [16:05:40.126] [mlr3] Applying learner 'classif.xgboost' on task 'pima' (iter 8/10)
INFO  [16:05:40.537] [mlr3] Applying learner 'classif.xgboost' on task 'pima' (iter 9/10)
INFO  [16:05:40.947] [mlr3] Applying learner 'classif.xgboost' on task 'pima' (iter 10/10)
INFO  [16:05:41.463] [mlr3] Applying learner 'classif.featureless' on task 'sonar' (iter 1/10)
INFO  [16:05:41.474] [mlr3] Applying learner 'classif.featureless' on task 'sonar' (iter 2/10)
INFO  [16:05:41.482] [mlr3] Applying learner 'classif.featureless' on task 'sonar' (iter 3/10)
INFO  [16:05:41.488] [mlr3] Applying learner 'classif.featureless' on task 'sonar' (iter 4/10)
INFO  [16:05:41.500] [mlr3] Applying learner 'classif.featureless' on task 'sonar' (iter 5/10)
INFO  [16:05:41.508] [mlr3] Applying learner 'classif.featureless' on task 'sonar' (iter 6/10)
INFO  [16:05:41.517] [mlr3] Applying learner 'classif.featureless' on task 'sonar' (iter 7/10)
INFO  [16:05:41.526] [mlr3] Applying learner 'classif.featureless' on task 'sonar' (iter 8/10)
INFO  [16:05:41.538] [mlr3] Applying learner 'classif.featureless' on task 'sonar' (iter 9/10)
INFO  [16:05:41.551] [mlr3] Applying learner 'classif.featureless' on task 'sonar' (iter 10/10)
INFO  [16:05:41.564] [mlr3] Applying learner 'classif.rpart' on task 'sonar' (iter 1/10)
INFO  [16:05:41.582] [mlr3] Applying learner 'classif.rpart' on task 'sonar' (iter 2/10)
INFO  [16:05:41.599] [mlr3] Applying learner 'classif.rpart' on task 'sonar' (iter 3/10)
INFO  [16:05:41.616] [mlr3] Applying learner 'classif.rpart' on task 'sonar' (iter 4/10)
INFO  [16:05:41.636] [mlr3] Applying learner 'classif.rpart' on task 'sonar' (iter 5/10)
INFO  [16:05:41.656] [mlr3] Applying learner 'classif.rpart' on task 'sonar' (iter 6/10)
INFO  [16:05:41.681] [mlr3] Applying learner 'classif.rpart' on task 'sonar' (iter 7/10)
INFO  [16:05:41.695] [mlr3] Applying learner 'classif.rpart' on task 'sonar' (iter 8/10)
INFO  [16:05:41.714] [mlr3] Applying learner 'classif.rpart' on task 'sonar' (iter 9/10)
INFO  [16:05:41.732] [mlr3] Applying learner 'classif.rpart' on task 'sonar' (iter 10/10)
INFO  [16:05:41.747] [mlr3] Applying learner 'classif.xgboost' on task 'sonar' (iter 1/10)
INFO  [16:05:41.902] [mlr3] Applying learner 'classif.xgboost' on task 'sonar' (iter 2/10)
INFO  [16:05:42.054] [mlr3] Applying learner 'classif.xgboost' on task 'sonar' (iter 3/10)
INFO  [16:05:42.340] [mlr3] Applying learner 'classif.xgboost' on task 'sonar' (iter 4/10)
INFO  [16:05:42.534] [mlr3] Applying learner 'classif.xgboost' on task 'sonar' (iter 5/10)
INFO  [16:05:42.688] [mlr3] Applying learner 'classif.xgboost' on task 'sonar' (iter 6/10)
INFO  [16:05:42.843] [mlr3] Applying learner 'classif.xgboost' on task 'sonar' (iter 7/10)
INFO  [16:05:43.015] [mlr3] Applying learner 'classif.xgboost' on task 'sonar' (iter 8/10)
INFO  [16:05:43.179] [mlr3] Applying learner 'classif.xgboost' on task 'sonar' (iter 9/10)
INFO  [16:05:43.331] [mlr3] Applying learner 'classif.xgboost' on task 'sonar' (iter 10/10)
INFO  [16:05:43.488] [mlr3] Applying learner 'classif.featureless' on task 'zoo' (iter 1/10)
INFO  [16:05:43.493] [mlr3] Applying learner 'classif.featureless' on task 'zoo' (iter 2/10)
INFO  [16:05:43.499] [mlr3] Applying learner 'classif.featureless' on task 'zoo' (iter 3/10)
INFO  [16:05:43.506] [mlr3] Applying learner 'classif.featureless' on task 'zoo' (iter 4/10)
INFO  [16:05:43.512] [mlr3] Applying learner 'classif.featureless' on task 'zoo' (iter 5/10)
INFO  [16:05:43.517] [mlr3] Applying learner 'classif.featureless' on task 'zoo' (iter 6/10)
INFO  [16:05:43.522] [mlr3] Applying learner 'classif.featureless' on task 'zoo' (iter 7/10)
INFO  [16:05:43.529] [mlr3] Applying learner 'classif.featureless' on task 'zoo' (iter 8/10)
INFO  [16:05:43.535] [mlr3] Applying learner 'classif.featureless' on task 'zoo' (iter 9/10)
INFO  [16:05:43.540] [mlr3] Applying learner 'classif.featureless' on task 'zoo' (iter 10/10)
INFO  [16:05:43.546] [mlr3] Applying learner 'classif.rpart' on task 'zoo' (iter 1/10)
INFO  [16:05:43.554] [mlr3] Applying learner 'classif.rpart' on task 'zoo' (iter 2/10)
INFO  [16:05:43.562] [mlr3] Applying learner 'classif.rpart' on task 'zoo' (iter 3/10)
INFO  [16:05:43.569] [mlr3] Applying learner 'classif.rpart' on task 'zoo' (iter 4/10)
INFO  [16:05:43.608] [mlr3] Applying learner 'classif.rpart' on task 'zoo' (iter 5/10)
INFO  [16:05:43.616] [mlr3] Applying learner 'classif.rpart' on task 'zoo' (iter 6/10)
INFO  [16:05:43.624] [mlr3] Applying learner 'classif.rpart' on task 'zoo' (iter 7/10)
INFO  [16:05:43.632] [mlr3] Applying learner 'classif.rpart' on task 'zoo' (iter 8/10)
INFO  [16:05:43.639] [mlr3] Applying learner 'classif.rpart' on task 'zoo' (iter 9/10)
INFO  [16:05:43.646] [mlr3] Applying learner 'classif.rpart' on task 'zoo' (iter 10/10)
INFO  [16:05:43.653] [mlr3] Applying learner 'classif.xgboost' on task 'zoo' (iter 1/10)
INFO  [16:05:43.824] [mlr3] Applying learner 'classif.xgboost' on task 'zoo' (iter 2/10)
INFO  [16:05:44.007] [mlr3] Applying learner 'classif.xgboost' on task 'zoo' (iter 3/10)
INFO  [16:05:44.175] [mlr3] Applying learner 'classif.xgboost' on task 'zoo' (iter 4/10)
INFO  [16:05:44.341] [mlr3] Applying learner 'classif.xgboost' on task 'zoo' (iter 5/10)
INFO  [16:05:44.514] [mlr3] Applying learner 'classif.xgboost' on task 'zoo' (iter 6/10)
INFO  [16:05:44.684] [mlr3] Applying learner 'classif.xgboost' on task 'zoo' (iter 7/10)
INFO  [16:05:44.852] [mlr3] Applying learner 'classif.xgboost' on task 'zoo' (iter 8/10)
INFO  [16:05:45.019] [mlr3] Applying learner 'classif.xgboost' on task 'zoo' (iter 9/10)
INFO  [16:05:45.188] [mlr3] Applying learner 'classif.xgboost' on task 'zoo' (iter 10/10)
INFO  [16:05:45.346] [mlr3] Applying learner 'classif.featureless' on task 'spam' (iter 1/10)
INFO  [16:05:45.353] [mlr3] Applying learner 'classif.featureless' on task 'spam' (iter 2/10)
INFO  [16:05:45.358] [mlr3] Applying learner 'classif.featureless' on task 'spam' (iter 3/10)
INFO  [16:05:45.365] [mlr3] Applying learner 'classif.featureless' on task 'spam' (iter 4/10)
INFO  [16:05:45.371] [mlr3] Applying learner 'classif.featureless' on task 'spam' (iter 5/10)
INFO  [16:05:45.376] [mlr3] Applying learner 'classif.featureless' on task 'spam' (iter 6/10)
INFO  [16:05:45.385] [mlr3] Applying learner 'classif.featureless' on task 'spam' (iter 7/10)
INFO  [16:05:45.391] [mlr3] Applying learner 'classif.featureless' on task 'spam' (iter 8/10)
INFO  [16:05:45.397] [mlr3] Applying learner 'classif.featureless' on task 'spam' (iter 9/10)
INFO  [16:05:45.403] [mlr3] Applying learner 'classif.featureless' on task 'spam' (iter 10/10)
INFO  [16:05:45.409] [mlr3] Applying learner 'classif.rpart' on task 'spam' (iter 1/10)
INFO  [16:05:45.456] [mlr3] Applying learner 'classif.rpart' on task 'spam' (iter 2/10)
INFO  [16:05:45.504] [mlr3] Applying learner 'classif.rpart' on task 'spam' (iter 3/10)
INFO  [16:05:45.549] [mlr3] Applying learner 'classif.rpart' on task 'spam' (iter 4/10)
INFO  [16:05:45.596] [mlr3] Applying learner 'classif.rpart' on task 'spam' (iter 5/10)
INFO  [16:05:45.643] [mlr3] Applying learner 'classif.rpart' on task 'spam' (iter 6/10)
INFO  [16:05:45.689] [mlr3] Applying learner 'classif.rpart' on task 'spam' (iter 7/10)
INFO  [16:05:45.733] [mlr3] Applying learner 'classif.rpart' on task 'spam' (iter 8/10)
INFO  [16:05:45.800] [mlr3] Applying learner 'classif.rpart' on task 'spam' (iter 9/10)
INFO  [16:05:45.845] [mlr3] Applying learner 'classif.rpart' on task 'spam' (iter 10/10)
INFO  [16:05:45.890] [mlr3] Applying learner 'classif.xgboost' on task 'spam' (iter 1/10)
INFO  [16:05:53.590] [mlr3] Applying learner 'classif.xgboost' on task 'spam' (iter 2/10)
INFO  [16:06:01.466] [mlr3] Applying learner 'classif.xgboost' on task 'spam' (iter 3/10)
INFO  [16:06:09.310] [mlr3] Applying learner 'classif.xgboost' on task 'spam' (iter 4/10)
INFO  [16:06:16.781] [mlr3] Applying learner 'classif.xgboost' on task 'spam' (iter 5/10)
INFO  [16:06:25.024] [mlr3] Applying learner 'classif.xgboost' on task 'spam' (iter 6/10)
INFO  [16:06:33.151] [mlr3] Applying learner 'classif.xgboost' on task 'spam' (iter 7/10)
INFO  [16:06:41.316] [mlr3] Applying learner 'classif.xgboost' on task 'spam' (iter 8/10)
INFO  [16:06:49.482] [mlr3] Applying learner 'classif.xgboost' on task 'spam' (iter 9/10)
INFO  [16:06:57.416] [mlr3] Applying learner 'classif.xgboost' on task 'spam' (iter 10/10)
INFO  [16:07:05.166] [mlr3] Applying learner 'classif.featureless' on task 'wine' (iter 1/10)
INFO  [16:07:05.172] [mlr3] Applying learner 'classif.featureless' on task 'wine' (iter 2/10)
INFO  [16:07:05.177] [mlr3] Applying learner 'classif.featureless' on task 'wine' (iter 3/10)
INFO  [16:07:05.182] [mlr3] Applying learner 'classif.featureless' on task 'wine' (iter 4/10)
INFO  [16:07:05.188] [mlr3] Applying learner 'classif.featureless' on task 'wine' (iter 5/10)
INFO  [16:07:05.193] [mlr3] Applying learner 'classif.featureless' on task 'wine' (iter 6/10)
INFO  [16:07:05.198] [mlr3] Applying learner 'classif.featureless' on task 'wine' (iter 7/10)
INFO  [16:07:05.216] [mlr3] Applying learner 'classif.featureless' on task 'wine' (iter 8/10)
INFO  [16:07:05.222] [mlr3] Applying learner 'classif.featureless' on task 'wine' (iter 9/10)
INFO  [16:07:05.228] [mlr3] Applying learner 'classif.featureless' on task 'wine' (iter 10/10)
INFO  [16:07:05.235] [mlr3] Applying learner 'classif.rpart' on task 'wine' (iter 1/10)
INFO  [16:07:05.244] [mlr3] Applying learner 'classif.rpart' on task 'wine' (iter 2/10)
INFO  [16:07:05.252] [mlr3] Applying learner 'classif.rpart' on task 'wine' (iter 3/10)
INFO  [16:07:05.263] [mlr3] Applying learner 'classif.rpart' on task 'wine' (iter 4/10)
INFO  [16:07:05.271] [mlr3] Applying learner 'classif.rpart' on task 'wine' (iter 5/10)
INFO  [16:07:05.279] [mlr3] Applying learner 'classif.rpart' on task 'wine' (iter 6/10)
INFO  [16:07:05.287] [mlr3] Applying learner 'classif.rpart' on task 'wine' (iter 7/10)
INFO  [16:07:05.295] [mlr3] Applying learner 'classif.rpart' on task 'wine' (iter 8/10)
INFO  [16:07:05.303] [mlr3] Applying learner 'classif.rpart' on task 'wine' (iter 9/10)
INFO  [16:07:05.311] [mlr3] Applying learner 'classif.rpart' on task 'wine' (iter 10/10)
INFO  [16:07:05.321] [mlr3] Applying learner 'classif.xgboost' on task 'wine' (iter 1/10)
INFO  [16:07:05.448] [mlr3] Applying learner 'classif.xgboost' on task 'wine' (iter 2/10)
INFO  [16:07:05.835] [mlr3] Applying learner 'classif.xgboost' on task 'wine' (iter 3/10)
INFO  [16:07:06.227] [mlr3] Applying learner 'classif.xgboost' on task 'wine' (iter 4/10)
INFO  [16:07:06.395] [mlr3] Applying learner 'classif.xgboost' on task 'wine' (iter 5/10)
INFO  [16:07:06.537] [mlr3] Applying learner 'classif.xgboost' on task 'wine' (iter 6/10)
INFO  [16:07:06.893] [mlr3] Applying learner 'classif.xgboost' on task 'wine' (iter 7/10)
INFO  [16:07:07.032] [mlr3] Applying learner 'classif.xgboost' on task 'wine' (iter 8/10)
INFO  [16:07:07.164] [mlr3] Applying learner 'classif.xgboost' on task 'wine' (iter 9/10)
INFO  [16:07:07.290] [mlr3] Applying learner 'classif.xgboost' on task 'wine' (iter 10/10)
INFO  [16:07:07.427] [mlr3] Finished benchmark
mlr3viz::autoplot(bmr, type = "boxplot")

mlr3 supports many learning algorithms (some with multiple implementations) as Learners. These are primarily provided by the mlr3, mlr3learners and mlr3extralearners packages.

library(tidyverse)

Task

In the example above, we used tasks that are built into mlr3.

To create your own regression task, you will need to construct a new instance of TaskRegr. The simplest way to do this is with the function as_task_regr() to convert a data.frame type object to a regression task, specifying the target feature by passing this to the target argument.

Example: using the datasets::mtcars dataset, suppose we want to predict miles per gallon (target = "mpg") from the number of cylinders ("cyl") and displacement ("disp"):

mtcars  # our data frame
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
tsk_mtcars <- mtcars |> 
    select(mpg, cyl, disp) |> 
    as_task_regr(target = "mpg", id = "cars")
tsk_mtcars
<TaskRegr:cars> (32 x 3)
* Target: mpg
* Properties: -
* Features (2):
  - dbl (2): cyl, disp

We can plot the task using the mlr3viz package:

mlr3viz::autoplot(tsk_mtcars, type = "pairs")

Task mutators

Mutators modify a given Task:

tsk_mtcars_deep <- tsk_mtcars$clone()  # create a deep copy of the `tsk_mtcars` object
tsk_mtcars_deep$select("cyl")  # keep only one feature
tsk_mtcars_deep$filter(2:3)  # keep only these two rows
tsk_mtcars_deep$data()
     mpg   cyl
   <num> <num>
1:  21.0     6
2:  22.8     4

We created a deep copy of tsk_mtcars, named it tsk_mtcars_deep, then modified the deep copy using the mutators $select() and $filter(). A deep copy does not change the original tsk_mtcars object:

tsk_mtcars$data() |> str()
Classes 'data.table' and 'data.frame':  32 obs. of  3 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 - attr(*, ".internal.selfref")=<externalptr> 

What happens if we create a shallow copy of tsk_mtcars, then modify the deep copy using the mutators $select() and $filter()?

tsk_mtcars_shallow <- tsk_mtcars  # create a shallow copy of the `tsk_mtcars` object
tsk_mtcars_shallow$select("cyl")  # keep only one feature
tsk_mtcars_shallow$filter(2:3)  # keep only these two rows
tsk_mtcars_shallow$data()
     mpg   cyl
   <num> <num>
1:  21.0     6
2:  22.8     4
tsk_mtcars$data()  # the above code block has changed `tsk_mtcars`
     mpg   cyl
   <num> <num>
1:  21.0     6
2:  22.8     4

Learner

Objects of class Learner provide a unified interface to many popular machine learning algorithms in R.

# load mtcars task
tsk_mtcars <- tsk("mtcars")
# load a regression tree
lrn_rpart <- lrn("regr.rpart")

We could train and test on the same data, but the error metric value would not reflect the true generalization ability of the fitted model.

Here we will do a simple train/test split.

splits <- partition(tsk_mtcars, ratio=0.67)
str(splits)
List of 3
 $ train     : int [1:21] 1 2 4 5 8 9 11 12 13 14 ...
 $ test      : int [1:11] 3 6 7 10 15 16 18 24 25 29 ...
 $ validation: int(0) 
lrn_rpart$train(tsk_mtcars, row_ids = splits$train)  # train
prediction <- lrn_rpart$predict(tsk_mtcars, row_ids = splits$test)
prediction
<PredictionRegr> for 11 observations:
 row_ids truth response
       3  22.8 27.08750
       6  18.1 17.60769
       7  14.3 17.60769
     ---   ---      ---
      25  19.2 17.60769
      29  15.8 17.60769
      32  21.4 27.08750
prediction$response[1:2]  # can access values
[1] 27.08750 17.60769
mlr3viz::autoplot(prediction)

We can also predict on a completely separate data.frame type object. (The truth column values are all NA, as we did not include a target column in the generated data.)

mtcars_new <- data.table(cyl = c(5, 6), disp = c(100, 120),
  hp = c(100, 150), drat = c(4, 3.9), wt = c(3.8, 4.1),
  qsec = c(18, 19.5), vs = c(1, 0), am = c(1, 1),
  gear = c(6, 4), carb = c(3, 5))
prediction <- lrn_rpart$predict_newdata(mtcars_new)
prediction
<PredictionRegr> for 2 observations:
 row_ids truth response
       1    NA 17.60769
       2    NA 17.60769

Changing the prediction type

Several regression models can also predict standard errors. If we also want the SE values at each test point:

library(mlr3learners)
lrn_lm <- lrn("regr.lm", predict_type = "se")
lrn_lm$train(tsk_mtcars, splits$train)
lrn_lm$predict(tsk_mtcars, splits$test)
<PredictionRegr> for 11 observations:
 row_ids truth response       se
       3  22.8 30.30380 1.868736
       6  18.1 22.12807 1.823571
       7  14.3 13.21343 2.098653
     ---   ---      ---      ---
      25  19.2 17.26430 1.541597
      29  15.8 23.91043 3.282927
      32  21.4 28.89246 2.015573

Baseline learners

For regression, the baseline lrn("regr.featureless") always predicts new values to be the mean (or median, if the robust hyperparameter is set to TRUE) of the target in the training data:

# generate data
df = as_task_regr(data.frame(x = runif(1000), y = rnorm(1000, 2, 1)), target = "y")
lrn("regr.featureless")$train(df, 1:995)$predict(df, 996:1000)
<PredictionRegr> for 5 observations:
 row_ids     truth response
     996 1.9830601 2.024395
     997 0.2229826 2.024395
     998 1.5608587 2.024395
     999 0.5014346 2.024395
    1000 1.4223985 2.024395

It is good practice to test all new models against a baseline, and also to include baselines in experiments with many other models. In general,

  • a model that does not outperform a baseline is a ‘bad’ model,
  • but a model is not necessarily ‘good’ if it outperforms the baseline.

Evaluation

# Same decision tree example as above
lrn_rpart <- lrn("regr.rpart")
tsk_mtcars <- tsk("mtcars")
splits <- partition(tsk_mtcars)
lrn_rpart$train(tsk_mtcars, splits$train)
prediction <- lrn_rpart$predict(tsk_mtcars, splits$test)

For regression, we’ll likely just use MSE and/or MAE:

measures <- msrs(c("regr.mse", "regr.mae"))
prediction$score(measures)
regr.mse regr.mae 
19.09496  3.75000 

Regression experiment

The following experiment will compare the predictive performance of two models (featureless and decision tree) on the mtcars dataset.

library(mlr3)
set.seed(37)
# load and partition our task
tsk_mtcars <- tsk("mtcars")
splits <- partition(tsk_mtcars)
# load featureless learner
lrn_featureless <- lrn("regr.featureless")
# load decision tree and set hyperparameters
lrn_rpart <- lrn("regr.rpart", cp = 0.2, maxdepth = 5)
# load MSE and MAE measures
measures <- msrs(c("regr.mse", "regr.mae"))
# train learners
lrn_featureless$train(tsk_mtcars, splits$train)
lrn_rpart$train(tsk_mtcars, splits$train)
# make and score predictions
lrn_featureless$predict(tsk_mtcars, splits$test)$score(measures)
 regr.mse  regr.mae 
15.622051  3.229004 
lrn_rpart$predict(tsk_mtcars, splits$test)$score(measures)
 regr.mse  regr.mae 
17.381816  3.211869 

Classification experiment

The following experiment will compare the predictive performance of two models (featureless and decision tree) on the palmerpenguins::penguins dataset.

library(mlr3)
set.seed(37)
# load and partition our task
tsk_penguins <- tsk("penguins")
splits <- partition(tsk_penguins)
# load featureless learner
lrn_featureless <- lrn("classif.featureless")
# load decision tree and set hyperparameters
lrn_rpart <- lrn("classif.rpart", cp = 0.2, maxdepth = 5)
# load accuracy measure
measure <- msr("classif.acc")
# train learners
lrn_featureless$train(tsk_penguins, splits$train)
lrn_rpart$train(tsk_penguins, splits$train)
# make and score predictions
lrn_featureless$predict(tsk_penguins, splits$test)$score(measure)
classif.acc 
  0.4912281 
lrn_rpart$predict(tsk_penguins, splits$test)$score(measure)
classif.acc 
  0.9385965 

Classification tasks are very similar to regression tasks, except that the target variable is of type factor.

To create your own classification task, you will need to construct a new instance of TaskClassif. The simplest way to do this is with the function as_task_classif() to convert a data.frame type object to a classification task, specifying the target feature by passing this to the target argument.

as_task_classif(palmerpenguins::penguins, target = "species")
<TaskClassif:palmerpenguins::penguins> (344 x 8)
* Target: species
* Properties: multiclass
* Features (7):
  - int (3): body_mass_g, flipper_length_mm, year
  - dbl (2): bill_depth_mm, bill_length_mm
  - fct (2): island, sex

Plotting is possible with autoplot.TaskClassif; below we plot a comparison between the target column and features.

library(ggplot2)
autoplot(tsk("penguins"), type = "duo") +
  theme(strip.text.y = element_text(angle = -45, size = 8))

Summary

From Chapter 2 of https://mlr3book.mlr-org.com