<!–
–>
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Goal
Our goal for this exercise sheet is to learn the basics of mlr3 for supervised learning by training a first simple model on training data and by evaluating its performance on hold-out/test data.
German Credit Dataset
The German credit dataset was donated by Prof. Dr. Hans Hoffman of the University of Hamburg in 1994 and contains 1000 datapoints reflecting bank customers. The goal is to classify people as a good or bad credit risk based on 20 personal, demographic and financial features. The dataset is available at the UCI repository as Statlog (German Credit Data) Data Set.
Motivation of Risk Prediction
Customers who do not repay the distributed loan on time represent an enormous risk for a bank: First, because they create an unintended gap in the bank’s planning, and second, because the collection of the repayment amount additionally causes additional time and cost for the bank.
On the other hand, (interest rates for) loans are an important revenue stream for banks. If a person’s loan is rejected, even though they would have met the repayment deadlines, revenue is lost, as well as potential upselling opportunities.
Banks are therefore highly interested in a risk prediction model that accurately predicts the risk of future customers. This is where supervised learning models come into play.
Data Overview
n = 1,000 observations of bank customers
credit_risk
: is the customer a good or bad credit risk?age
: age in yearsamount
: amount asked by applicantcredit_history
: past credit history of applicant at this bankduration
: duration of the credit in monthsemployment_duration
: present employment sinceforeign_worker
: is applicant foreign worker?housing
: type of apartment rented, owned, for free / no paymentinstallment_rate
: installment rate in percentage of disposable incomejob
: current job informationnumber_credits
: number of existing credits at this bankother_debtors
: other debtors/guarantors present?other_installment_plans
: other installment plans the applicant is payingpeople_liable
: number of people being liable to provide maintenancepersonal_status_sex
: combination of sex and personal status of applicantpresent_residence
: present residence sinceproperty
: properties that applicant haspurpose
: reason customer is applying for a loansavings
: savings accounts/bonds at this bankstatus
: status/balance of checking account at this banktelephone
: is there any telephone registered for this customer?
Preprocessing
We first load the data from the rchallenge
package (you may need to install it first) and get a brief overview.
# install.packages("rchallenge") library("rchallenge") data("german") skimr::skim(german)
Name | german |
Number of rows | 1000 |
Number of columns | 21 |
_______________________ | |
Column type frequency: | |
factor | 18 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
status | 0 | 1 | FALSE | 4 | …: 394, no : 274, …: 269, 0<=: 63 |
credit_history | 0 | 1 | FALSE | 5 | no : 530, all: 293, exi: 88, cri: 49 |
purpose | 0 | 1 | FALSE | 10 | fur: 280, oth: 234, car: 181, car: 103 |
savings | 0 | 1 | FALSE | 5 | unk: 603, …: 183, …: 103, 100: 63 |
employment_duration | 0 | 1 | FALSE | 5 | 1 = : 253, 4 <: 174, < 1: 172 |
installment_rate | 0 | 1 | TRUE | 4 | = : 136 |
personal_status_sex | 0 | 1 | FALSE | 4 | mal: 548, fem: 310, fem: 92, mal: 50 |
other_debtors | 0 | 1 | FALSE | 3 | non: 907, gua: 52, co-: 41 |
present_residence | 0 | 1 | TRUE | 4 | >= : 413, 1 <: 308, 4 <: 149, < 1: 130 |
property | 0 | 1 | FALSE | 4 | bui: 332, unk: 282, car: 232, rea: 154 |
other_installment_plans | 0 | 1 | FALSE | 3 | non: 814, ban: 139, sto: 47 |
housing | 0 | 1 | FALSE | 3 | ren: 714, for: 179, own: 107 |
number_credits | 0 | 1 | TRUE | 4 | 1: 633, 2-3: 333, 4-5: 28, >= : 6 |
job | 0 | 1 | FALSE | 4 | ski: 630, uns: 200, man: 148, une: 22 |
people_liable | 0 | 1 | FALSE | 2 | 0 t: 845, 3 o: 155 |
telephone | 0 | 1 | FALSE | 2 | no: 596, yes: 404 |
foreign_worker | 0 | 1 | FALSE | 2 | no: 963, yes: 37 |
credit_risk | 0 | 1 | FALSE | 2 | goo: 700, bad: 300 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
duration | 0 | 1 | 20.90 | 12.06 | 4 | 12.0 | 18.0 | 24.00 | 72 | ▇▇▂▁▁ |
amount | 0 | 1 | 3271.25 | 2822.75 | 250 | 1365.5 | 2319.5 | 3972.25 | 18424 | ▇▂▁▁▁ |
age | 0 | 1 | 35.54 | 11.35 | 19 | 27.0 | 33.0 | 42.00 | 75 | ▇▆▃▁▁ |
Exercises:
Now, we can start building a model. To do so, we need to address the following questions:
- What is the problem we are trying to solve?
- What is an appropriate learning algorithm?
- How do we evaluate “good” performance?
More systematically in mlr3
they can be expressed via five components:
- The
Task
definition. - The
Learner
definition. - The training via
$train()
. - The prediction via
$predict()
. - The evaluation via one
$score()
.
Split Data in Training and Test Data
Your task is to split the german
dataset into 70 % training data and 30 % test data by randomly sampling rows. Later, we will use the training data to learn an ML model and use the test data to assess its performance.
Recap: Why do we need train and test data?
We use part of the available data (the training data) to train our model. The remaining/hold-out data (test data) is used to evaluate the trained model. This is exactly how we anticipate using the model in practice: We want to fit the model to existing data and then make predictions on new, unseen data points for which we do not know the outcome/target values.
Note: Hold-out splitting requires a dataset that is sufficiently large such that both the training and test dataset are suitable representations of the target population. What “sufficiently large” means depends on the dataset at hand and the complexity of the problem.
The ratio of training to test data is also context dependent. In practice, a 70% to 30% (~ 2:1) ratio is a good starting point.
Hint 1:
Use sample()
to sample 70 % of the data ids as training data ids from row.names(german)
. The remaining row ids are obtained via setdiff()
. Based on the ids, set up two datasets, one for training and one for testing/evaluating.
Set a seed (e.g, set.seed(100L)
) to make your results reproducible.
Hint 2:
# Sample ids for training and test split set.seed(100L) train_ids = sample(row.names(german), 0.7*nrow(...)) test_ids = setdiff(..., train_ids) # Create two datasets based on ids train_set = german[...,] test_set = german[...,]
We first sample row ids by using sample()
and identify the non-selected rows via setdiff()
.
set.seed(100L) train_ids = sample(row.names(german), 0.7*nrow(german)) test_ids = setdiff(row.names(german), train_ids) str(train_ids)
chr [1:700] "714" "503" "358" "624" "985" "718" "919" "470" "966" "516" "823" "838" "98" "903" "7" "183" "299" ...
str(test_ids)
chr [1:300] "3" "5" "8" "17" "20" "22" "25" "29" "33" "35" "39" "40" "41" "43" "44" "49" "52" "57" "58" "59" "62" ...
Based on that, we create two datasets: one for training and one for testing.
train_set = german[train_ids,] test_set = german[test_ids, ]
Create a Classification Task
Install and load the mlr3verse
package which is a collection of multiple add-on packages in the mlr3
universe (if you fail installing mlr3verse
, try to install and load only the mlr3
and mlr3learners
packages). Then, create a classification task using the training data as an input and credit_risk
as the target variable (with the class label good
as the positive class). By defining an mlr3
task, we conceptualize the ML problem we want to solve (here we face a classification task). As we have a classification task here, make sure you properly specify the class that should be used as the positive class (i.e., the class label for which we would like to predict probabilities – here good
if you are interested in predicting a probability for the creditworthiness of customers).
Hint 1:
Use e.g. as_task_classif()
to create a classification task.
Hint 2:
library(mlr3verse) task = as_task_classif(x = ..., target = ..., ... = "good")
To initialize a TaskClassif
object, two equivalent calls exist:
library("mlr3verse")
Lade nötiges Paket: mlr3
task = TaskClassif$new("german_credit", backend = train_set, target = "credit_risk", positive = "good") task = as_task_classif(train_set, target = "credit_risk", positive = "good") task
<TaskClassif:train_set> (700 x 21) * Target: credit_risk * Properties: twoclass * Features (20): - fct (14): credit_history, employment_duration, foreign_worker, housing, job, other_debtors, other_installment_plans, people_liable, personal_status_sex, property, purpose, savings, status, telephone - int (3): age, amount, duration - ord (3): installment_rate, number_credits, present_residence
Train a Model on the Training Dataset
The created Task
contains the data we want to work with. Now that we conceptualized the ML task (i.e., classification) in a Task
object, it is time to train our first supervised learning method. We start with a simple classifier: a logistic regression model. During this course, you will, of course, also gain experience with more complex models.
Fit a logistic regression model to the german_credit
training task.
Hint 1:
Use lrn()
to initialize a Learner
object. The short cut and therefore input to this method is "classif.log_reg"
.
To train a model, use the $train()
method of your instantiated learner with the task of the previous exercise as an input.
Hint 2:
logreg = lrn("classif.log_reg") logreg$train(...)
By using the syntactic sugar method lrn()
, we first initialize a LearnerClassif
model. Using the $train()
method, we derive optimal hyperparameters (i.e., coefficients) for our logistic regression model.
logreg = lrn("classif.log_reg") logreg$train(task)
Inspect the Model
Have a look at the coefficients by using summary()
. Name at least two features that have a significant effect on the outcome.
Hint 1:
Use the summary()
method of the model
field of our trained model. By looking on task$positive
, we could see which of the two classes good
or bad
is used as the positive class (i.e., the class to which the model predictions will refer).
Hint 2:
summary(yourmodel$model)
Similar to models fitted via glm()
or lm()
, we could receive a summary of the coefficients (including p-values) using summary()
.
summary(logreg$model)
Call: stats::glm(formula = form, family = "binomial", data = data, model = FALSE) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.342e+00 1.402e+00 -1.671 0.094676 . age 1.999e-02 1.154e-02 1.733 0.083116 . amount -4.414e-05 5.413e-05 -0.815 0.414900 credit_historycritical account/other credits elsewhere -2.267e-01 7.109e-01 -0.319 0.749824 credit_historyno credits taken/all credits paid back duly 8.517e-01 5.311e-01 1.604 0.108781 credit_historyexisting credits paid back duly till now 1.278e+00 5.799e-01 2.204 0.027546 * credit_historyall credits at this bank paid back duly 1.444e+00 5.311e-01 2.718 0.006569 ** duration -3.674e-02 1.124e-02 -3.268 0.001083 ** employment_duration< 1 yr -5.909e-01 5.535e-01 -1.068 0.285694 employment_duration1 <= ... < 4 yrs -1.410e-01 5.197e-01 -0.271 0.786181 employment_duration4 <= ... < 7 yrs 2.575e-01 5.676e-01 0.454 0.650134 employment_duration>= 7 yrs -2.026e-01 5.284e-01 -0.383 0.701449 foreign_workerno -1.298e+00 7.418e-01 -1.749 0.080235 . housingrent 2.712e-01 2.800e-01 0.969 0.332719 housingown 3.154e-01 6.180e-01 0.510 0.609785 installment_rate.L -5.018e-01 2.645e-01 -1.897 0.057784 . installment_rate.Q -3.089e-01 2.410e-01 -1.282 0.199878 installment_rate.C 8.274e-02 2.496e-01 0.332 0.740232 jobunskilled - resident 6.929e-01 8.182e-01 0.847 0.397122 jobskilled employee/official 7.769e-01 7.896e-01 0.984 0.325125 jobmanager/self-empl./highly qualif. employee 7.129e-01 7.923e-01 0.900 0.368240 number_credits.L -1.358e-01 7.661e-01 -0.177 0.859348 number_credits.Q 6.595e-02 6.395e-01 0.103 0.917854 number_credits.C 7.483e-02 4.877e-01 0.153 0.878071 other_debtorsco-applicant 2.186e-01 5.266e-01 0.415 0.678024 other_debtorsguarantor 7.834e-01 4.968e-01 1.577 0.114862 other_installment_plansstores 1.454e-01 5.422e-01 0.268 0.788528 other_installment_plansnone 3.874e-01 3.077e-01 1.259 0.208020 people_liable0 to 2 2.244e-01 3.184e-01 0.705 0.481005 personal_status_sexfemale : non-single or male : single 6.476e-01 4.842e-01 1.338 0.181034 personal_status_sexmale : married/widowed 1.204e+00 4.825e-01 2.494 0.012621 * personal_status_sexfemale : single 8.236e-01 5.724e-01 1.439 0.150153 present_residence.L -3.690e-01 2.592e-01 -1.424 0.154559 present_residence.Q 4.980e-01 2.443e-01 2.038 0.041546 * present_residence.C -4.572e-01 2.462e-01 -1.857 0.063326 . propertycar or other 1.244e-01 3.036e-01 0.410 0.682060 propertybuilding soc. savings agr./life insurance 1.432e-01 2.891e-01 0.495 0.620375 propertyreal estate -2.818e-01 5.615e-01 -0.502 0.615706 purposecar (new) 1.774e+00 4.793e-01 3.702 0.000214 *** purposecar (used) 5.974e-01 3.227e-01 1.851 0.064101 . purposefurniture/equipment 8.264e-01 3.091e-01 2.674 0.007494 ** purposeradio/television -9.668e-02 8.379e-01 -0.115 0.908140 purposedomestic appliances 6.065e-01 6.489e-01 0.935 0.349946 purposerepairs -2.953e-01 4.570e-01 -0.646 0.518163 purposevacation 1.592e+00 1.272e+00 1.251 0.210816 purposeretraining 6.450e-01 4.038e-01 1.597 0.110245 purposebusiness 1.340e+00 8.364e-01 1.601 0.109268 savings... < 100 DM 2.723e-01 3.449e-01 0.789 0.429824 savings100 <= ... < 500 DM 1.511e+00 6.117e-01 2.471 0.013477 * savings500 <= ... < 1000 DM 1.715e+00 7.436e-01 2.306 0.021093 * savings... >= 1000 DM 1.195e+00 3.255e-01 3.671 0.000242 *** status... < 0 DM 3.827e-01 2.642e-01 1.449 0.147465 status0<= ... < 200 DM 9.451e-01 4.254e-01 2.222 0.026301 * status... >= 200 DM / salary for at least 1 year 1.813e+00 2.866e-01 6.325 2.53e-10 *** telephoneyes (under customer name) 4.773e-03 2.427e-01 0.020 0.984311 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 868.33 on 699 degrees of freedom Residual deviance: 622.78 on 645 degrees of freedom AIC: 732.78 Number of Fisher Scoring iterations: 5
According to the summary, e.g., credit_history
and status
significantly influence the creditworthiness and the bank’s risk assessment. By looking on task$positive
, we see that the class good
(creditworthy client) is the positive class. This means that a positive sign of the estimated coefficient of a feature means that the feature has a positive influence on being a creditworthy client (while a negative sign will have a negative influence).
task$positive
[1] "good"
For example, the negative sign of the coefficients of credit_history = delay in paying off in the past
and credit_history = critical account/other credit elsewhere
, indicate a negative influence and therefore lower probability of being a creditworthy client compared to their reference class credit_history = all credits at this bank paid back duly
. The positive sign of the coefficient of status >= 200 DM / salary for at least 1 year
and status = 0 <= ... < 200 DM
, therefore, indicate a positive influence w.r.t to its reference class status < 0 DM
.
Predict on the Test Dataset
Use the trained model to predict on the hold-out/test dataset.
Hint 1
Since we have a new tabular dataset as an input (and not a task), we need to use $predict_newdata()
(instead of $predict()
) to derive a PredictionClassif
object.
Hint 2
pred = yourmodel$predict_newdata(...)
pred_logreg = logreg$predict_newdata(test_set)
Evaluation
What is the classification error on the test data (200 observations)?
Hint 1:
The classification error gives the rate of observations that were misclassified. Use the $score()
method on the corresponding PredictionClassif
object of the previous exercise.
Hint 2:
pred_logreg$score()
By using the $score()
method, we obtain an estimate for the classification error of our model.
pred_logreg$score()
classif.ce 0.7733333
The classification error is 0.255 – so 25.5 % of the test instances were misclassified by our logistic regression model.
Predicting probabilities instead of labels
Similarly, we can assess the performance of our model using the AUC. However, this requires predicted probabilities instead of predicted labels. Evaluate the model using the AUC. To do so, retrain the model with a learner that returns probabilities.
Hint 1:
You can generate predictions with probabilities by specifying a predict_type
argument inside the lrn()
function call when constructing a learner.
Hint 2:
You can get an overview of performance measures in mlr3 using as.data.table(msr())
.
# Train a learner logreg = lrn("classif.log_reg", predict_type = "prob") logreg$train(task) # Generate predictions pred_logreg = logreg$predict_newdata(test_set) # Evaluate performance using AUC measure = msrs(c("classif.auc")) pred_logreg$score(measure)
classif.auc 0.2351757
Summary
In this exercise sheet we learned how to fit a logistic regression model on a training task and how to assess its performance on unseen test data with the help of mlr3
. We showed how to split data manually into training and test data, but in most scenarios it is a call to resample or benchmark. We will learn more on this in the next sections.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: Train Predict Evaluate Basics Solution