Business Analytics | Classifier Model and Return of Investment Case

Giuliano Sposito included in data science advanced business analytics

2022-01-18 1776 words 9 minutes

/2022-01-18-business-analytics-classifier-model-and-return-of-investment-case/images/post_cover.png

Contents

Can you imagine a real business case where you apply machine learning to build and Random Forest classifier and its accuracy of the model isn’t the (only) main metric to pay attention? In real case scenarios the cost and benefits can affect a model in different aspects, this post exercises a business case where the return of an investment is dependent of the behavior of the precision metric.

Intro

Missed appointments cost the US healthcare system over $150 billion a year. Missed appointments directly cause loss of revenue and under-utilization of precious medical resources. It also leads to long patient waiting times and in the long run, leads to higher medical costs.

In this context lets build a predictive model to estimate if an patient will miss an appointment and use the prediction to take an action to try to avoid the cancellation.

Dataset

We obtained a data set with 7,463 medical appointments over a three year period at a specialized care clinic. In this data set, each row corresponds to an appointment and indicates whether it was cancelled or not.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


library(xlsx)
library(tidyverse)

# the life, the universe an everything else...
set.seed(42)

# data set
rawdata <- xlsx::read.xlsx("./assets/data.xlsx", sheetIndex = 1)

# basic clean_up
appdata <- rawdata %>% 
  as_tibble() %>% 
  janitor::clean_names() %>% 
  mutate_if(is.character, as.factor)

dim(appdata)

1

## [1] 7463   12

Data Overview

Lets see the overall aspect of the data set

1

skimr::skim(appdata)

Table: Table 1: Data summary


Name	appdata
Number of rows	7463
Number of columns	12
_______________________
Column type frequency:
factor	7
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
month	1	FALSE	12	Aug: 815, May: 791, Jun: 721, Mar: 715
weekday	1	FALSE	5	Fri: 1733, Tue: 1715, Thu: 1640, Wed: 1448
gender	1	FALSE	2	F: 5176, M: 2287
marital_status	1	FALSE	4	MAR: 2504, SIN: 2434, OTH: 1505, DIV: 1020
employment	1	FALSE	4	UNE: 3824, RET: 1692, FUL: 1370, OTH: 577
insurance	1	FALSE	4	OTH: 2453, MED: 2366, HMO: 1982, PPO: 662
status	1	FALSE	2	Arr: 5801, Can: 1662

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
date_id	1	525.13	303.16	1	267	505	791	1081	▇▇▇▆▆
lag	1	27.24	24.47	0	13	20	30	126	▇▂▁▁▁
mrn	1	34251294.01	12244201.39	7967755	28165486	41065688	42621914	44150534	▂▁▁▁▇
age	1	54.09	18.52	6	47	57	67	90	▂▂▆▇▂
time_since_reg	1	4754.47	2399.85	179	4374	5686	6196	8421	▂▁▁▇▁

In total, 1662 out of 7463 appointments were cancelled, we can see it from column status (our target).

Model

Since we are interested in appointment cancellations, the target variable is whether an appointment is cancelled or not and success in this particular context means that an appointment is cancelled.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


library(tidymodels)

# training & test data partition
appsplit <- initial_split(appdata)

# basic transformation
apprecp <- appsplit %>% 
  training() %>% 
  recipe(status ~ ., data=.) %>% 
  update_role(date_id, new_role = "id variable") %>% 
  update_role(status, new_role = "outcome") %>% 
  step_log(lag) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  prep() 

# training & test set
app_tr <- juice(apprecp)
app_ts <- bake(apprecp, testing(appsplit))

# fit a model
app_model <- rand_forest(trees = 100, mode="classification") %>% 
  set_engine("ranger") %>% 
  fit(status ~ . - date_id, data=app_tr)

We build a simple and direct random forest model using Ranger implementation through Tidymodels package.

1

app_model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


## parsnip model object
## 
## Fit time:  436ms 
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~100,      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  100 
## Sample size:                      5597 
## Number of independent variables:  30 
## Mtry:                             5 
## Target node size:                 10 
## Variable importance mode:         none 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1617746

Evaluation

How good is our prediction model?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# eval it
app_pred <- predict(app_model, app_ts) %>%  # class outcome
  bind_cols(predict(app_model, app_ts, type = "prob")) %>% # class prob
  bind_cols(select(app_ts,status)) %>%  # true value
  relocate(status, everything())

# performance
cm <- app_pred %>% 
  conf_mat(status, .pred_class)

cm %>% 
  summary() %>% 
  select(-.estimator) %>% 
  knitr::kable()

.metric	.estimate
accuracy	0.7599143
kap	0.1613651
sens	0.9216366
spec	0.2099057
ppv	0.7986779
npv	0.4405941
mcc	0.1774102
j_index	0.1315423
bal_accuracy	0.5657711
detection_prevalence	0.8917471
precision	0.7986779
recall	0.9216366
f_meas	0.8557630

1
2
3
4
5


# AUC
app_pred %T>% 
  roc_auc(.pred_Arrived, truth=status) %>% 
  roc_curve(.pred_Arrived, truth=status) %>% 
  autoplot()

Taking a Business Decision to action

With the prediction in hands, we can make a business action to try revert appointments cancellations. Let’s assume we can make a phone call to anyone predicted as “Cancelled” one or two days early and this will revert around 30% of the cancelled appointments. Is it a viable approach? Is it economically viable? At least, better than let the patient cancel his appointment?

To decide this we need to characterize the value of some variables, for this exercise we can assume:

Cost of a phone call: $5
Revert appointment cancellation rate: 30%
Benefit of an appointment: $60

So, we can calculate the Return of Investment (RoI) of this action:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


# business variable
phone_cost   <- 5
reverse_rate <- .3
benefit      <- 60

# business case

# cost: phone to all patient predicted as "canceled"
total_cost  <- sum(cm$table["Cancelled",]) * phone_cost

# benefit: the reverse of 30% of patient that (in fact) would cancel the appointment
total_benefit <- cm$table["Cancelled","Cancelled"] * reverse_rate * benefit

# return of the investment
RoI <- total_benefit - total_cost

So we come out with this result:

Total Cost: 202 * $5 = $1010
Total Benefit: 89 * 0.3 * $60 = $1602
Return: $1602 - $1010 = $592

As we saw, with a RoI of $592 this action worth to be take.

One aspect essential to pay attention, we call to all patient predict as “cancellation” but we only get return over 30% (reversion rate) of those truly identified as “cancellation”, a.k.a, True Positives. In other words, the cost of the action is function of True Positives plus False Positives and the benefit of the action is function only of True Positives. This is because a False Positive is a patient predicted as “cancellation” but he´ll go to the appointment, so the phone call does not bring any benefit, only cost in those cases.

Improving Business Performance

Can we do better return without improving the model? As the RoI is function of the rate between True Positives and True Positives + False Positives (a.k.a Precision Metric) in proportion of the benefits and cost we can tune our classifier changing the logistic threshold to change the precision towards to try maximize the RoI.

To see this in practice, calculating what happens with our RoI changing the classifier threshold from 0 to 1 in increments of 0.01:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44


# generate the confusion metrics in function of an threshold
genConfMatrix <- function(.threshold, .evalData){
  # reply as a tibble row
  # new truth vs prediction table
  tibble(
      truth=.evalData$status, 
      # the prediction as function of the customized threshold
      estimate=unique(.evalData$status)[as.integer(.evalData$.pred_Cancelled>=.threshold)+1]
    ) %>%
    # gen the confusion matrix
    conf_mat(truth, estimate) %>% 
    return()
}

# calculates the RoI based in the result of a confusion matrix
calcRoi <- function(.cm, .benefit=60, .cost=5, .rev_rate=.3){
  tibble(
    TP = .cm$table["Cancelled","Cancelled"],
    FP = .cm$table["Cancelled","Arrived"] ) %>% 
    mutate(
      # cost: phone to all patient predicted as "canceled"
      cost    = (TP+FP) * .cost,
      # benefit: the reverse of 30% of patient that (in fact) would cancel the appointment
      benefit = TP * .rev_rate * .benefit,
      # return of the investment
      roi = benefit-cost
    ) %>% 
    return()
}

# using threshold from 0 to 1 in 0.01 increments
simulations <- tibble(threshold = seq(0,1,.01)) %>% 
  mutate( 
    # gen the confusion matrix and roi values for this threshold
    cm  = map(threshold, genConfMatrix, .evalData=app_pred),
    roi = map(cm, calcRoi)
  ) %>% 
  unnest(roi)

# what we get
simulations %>% 
  select(-cm) %>% 
  head(10) %>% 
  knitr::kable()

threshold	TP	FP	cost	benefit	roi
0.00	424	1442	9330	7632	-1698
0.01	422	1098	7600	7596	-4
0.02	408	908	6580	7344	764
0.03	404	861	6325	7272	947
0.04	402	843	6225	7236	1011
0.05	401	829	6150	7218	1068
0.06	401	825	6130	7218	1088
0.07	400	818	6090	7200	1110
0.08	398	812	6050	7164	1114
0.09	393	797	5950	7074	1124

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


# visualizing
simulations %>%           
  ggplot(aes(x=threshold, y=roi)) +
    # geom_point(size=2) +
    geom_line() +
    geom_hline(yintercept = 0, linetype="dashed", color="red") +
    geom_vline(xintercept = 0.5, linetype="dashed", color="darkgrey") +
    scale_x_continuous(breaks=seq(0,1,.1)) +
    ylim(c(-100,NA)) +
    labs(title="Return of Investment", subtitle = "Influence of the Threshold Parameter")

We can see the the RoI is better using a threshold between 0.2 and 0.3, and not 0.5 as usually set, more than that, at this level we get worst accuracy for the model, take a look:

1
2
3
4
5
6
7


simulations %>% 
  filter(threshold==.3) %>% 
  pull(cm) %>% 
  .[[1]] %>% 
  summary() %>% 
  select(-.estimator) %>% 
  knitr::kable()

.metric	.estimate
accuracy	0.6918542
kap	0.2451063
sens	0.7330097
spec	0.5518868
ppv	0.8476343
npv	0.3780291
mcc	0.2535561
j_index	0.2848965
bal_accuracy	0.6424483
detection_prevalence	0.6682744
precision	0.8476343
recall	0.7330097
f_meas	0.7861659

Compare this the accuracy metric (0.6918542) with the value obtained in the first calculation above (0.7599143).

Conclusion and Classification Metrics

As we saw, a business decision is not only function of the accuracy of the model, costs and benefits can affect different aspects of the model. Remember to return to real life when applying the model in business case scenarios, to optimize the correct target.

And, be aware of the classification metrics.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


simulations %>% 
  mutate(
    metrics = map(cm, function(.x){
      .x %>% 
        summary() %>% 
        select(-.estimator) %>% 
        pivot_wider(names_from = .metric, values_from = .estimate) %>% 
        return()
    })) %>% 
  unnest(metrics) %>%
  select(threshold, accuracy, sens, spec, precision, recall) %>%
  pivot_longer(cols = -threshold, names_to = "metric", values_to = "value") %>% 
  ggplot(aes(x=threshold, y=value, color=metric))+
  geom_line() +
  labs(title="Main Classification Matrics", subtitle="Behavior in function of the logistic threshold") +
  theme_minimal()