Using Facebook's Prophet to forecast my weight loss

Giuliano Sposito included in data science

2018-07-01 1758 words 9 minutes

/2018-07-01-using-facebook-s-prophet-to-forecast-my-weight-loss/images/cover.png

Contents

In this post, we’ll try to forecast my weight using Forecast and Facebook’s Prophet packages. We’ll see what is the performance from Facebook’s method in a simple case of forecast.

Recently, in the beginning of march, I went to a Nutritionist who recommended me to start a regime to lost some weight. As a good practice in these situations, short feedback cycles are essential to (re)build good habits, so I start to weigh myself almost daily, and record the values in a spreadsheet to follow my progress.

I kept the record until end of may, when my vacations started and I travel for three weeks, and now, at end of June, I restart to record my weight again. Between this time, I saw the Bruno Rodrigue’s post where he try to forecast his weight using the Forecast package, and I was inspired to do the same, but using my own data, and see how the Facebook’s Prophet package performs trying to predict my weight in the final of June using the data recorded between March and May.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


#setup

library(googlesheets4) # I keep my records in a google spreadsheet
library(tibbletime)   # We'll use tibble time and mice to fill the gap in the 
library(mice)         # weighting records
library(tsibble)      # TS Tibble is a 'time aware tibble' to keep time series data   
library(lubridate)    # lubridate to manipulate easly date-time info 
library(tidyverse)    # tidyr, dplyr and magrittr
library(forecast)     # package 'standard' to forecast time series
library(prophet)      # the 'facebook' method

Loading the dataset and filling the gaps

I weigh myself almost daily (but, in the weekends I’m usually away from home) and keep the weight records in a Google Spreadsheet, so let’s get the data set using the googlesheets package and fill the gap using mice package.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


# download data from google spreadsheets
gs4_auth(email = "gsposito@gmail.com") 
raw_data <- read_sheet("1P1q58DYs4Jy5cXKXCrdl11ru4Rop1Mu7r8fXEraCX9M")

# handles date/weight
raw_data %>%                # the dataset has record for date, weigth, fat, 
  select(1:2) %>%           # water, muscle and bones, filtering first two
  mutate(Peso=Peso/10) %>%  # to make it Kilograms
  set_names(c("date","weight")) -> measures 

head(measures, 10) %>%
  kable(align = "c",bootstrap_options = "striped", full_width = F) %>% 
  kableExtra::kable_styling(font_size = 11)

date	weight
2018-03-05	10.14
2018-03-06	10.15
2018-03-07	10.12
2018-03-08	10.06
2018-03-09	9.99
2018-03-10	9.98
2018-03-12	9.90
2018-03-13	9.94
2018-03-14	9.84
2018-03-15	9.82

We can see there is no data recorded at days 11, 16, 17 and 18 and go on. Also, there is a big gap in June.

1
2
3


tail(measures) %>%
  kable() %>% 
  kableExtra::kable_styling(font_size = 11)

date	weight
2018-05-25	9.07
2018-05-28	8.93
2018-05-29	8.95
2018-05-30	8.95
2018-06-25	8.70
2018-06-29	8.71

Let’s separate the last two points, in June, from the remaining data, so we’ll have something like a “training” and a “test” data sets.

1
2
3
4
5
6
7
8


# taking the June measures as a "test" points
weight.target <- measures %>%
  filter( date >= ymd(20180601) ) %>% 
  mutate(date = as.Date(date))

# and the previous as "training" points to be used in Forecast and Prophet
measures <- measures %>%
  filter( date < ymd(20180601) )

Let’s make the gaps in the “training” data set explicit, so we can fill’in them using mice.

1
2
3
4
5
6
7
8
9


# explicit NA
measures %>%
  mutate( date = as.Date(date) ) %>% 
  as_tsibble() %>%
  fill_gaps() -> measures

head(measures,20) %>%
  kable() %>% 
  kableExtra::kable_styling(font_size = 11)

date	weight
2018-03-05	10.14
2018-03-06	10.15
2018-03-07	10.12
2018-03-08	10.06
2018-03-09	9.99
2018-03-10	9.98
2018-03-11	NA
2018-03-12	9.90
2018-03-13	9.94
2018-03-14	9.84
2018-03-15	9.82
2018-03-16	NA
2018-03-17	NA
2018-03-18	NA
2018-03-19	9.78
2018-03-20	9.79
2018-03-21	9.70
2018-03-22	9.70
2018-03-23	NA
2018-03-24	NA

Now, with “NA” explicit in the time series we can use [mice].

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


# complete values
measures %>%
  mice(method = "pmm", m=5, maxit = 50, seed=42, printFlag= F) %>% # five imputation for NA
  mice::complete("long") %>% # fill the NA
  group_by(date) %>% # average them (5 points for missing data)
  summarise( weight = mean(weight) ) -> measures_completed

# compare original data and missing values
measures_completed %>%
  inner_join(measures, by="date") %>%   # join with original (with NA) dataset
  set_names(c("date","inputted","original")) %>%
  tidyr::gather(type,weight,-date) %>% # pivot-it
  ggplot() + geom_point(aes(date,weight,color=type))

Well, the mice package did a remarkable job, the inputted values (red ones) seems like real measures, now with data set completed we convert it to a time series and use forecast to predict the weight behavior in June.

Forecasting with Forecast Package

The Forecast package implements ARIMA models for time series data. In statistics and econometrics, and in particular in time series analysis, an autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving average (ARMA) model. Both of these models are fitted to time series data either to better understand the data or to predict future points in the series (forecasting). (more on ARIMA)

Let’s use this model to forecast.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


# models the time series
model <- measures_completed %>%
  pull(weight) %>%  # convert to a vector
  as.ts() %>%       # transform to a Time Serie
  auto.arima()      # fit the model

# make de predicion for 30 days
prediction <- model %>%
  forecast(h=31) %>%  # forecast next 30 measures
  as.tibble() %>%     # covert to tibble
  mutate( date = max(measures_completed$date) + 1:31 ) # add the dates

# prediction dataset
head(prediction) %>%
  kable() %>% 
  kableExtra::kable_styling(font_size = 11)

Point Forecast	Lo 80	Hi 80	Lo 95	Hi 95	date
8.928799	8.882932	8.974666	8.858651	8.998947	2018-05-31
8.914873	8.862890	8.966856	8.835372	8.994374	2018-06-01
8.900948	8.843496	8.958399	8.813083	8.988812	2018-06-02
8.887022	8.824579	8.949465	8.791524	8.982520	2018-06-03
8.873097	8.806033	8.940161	8.770531	8.975662	2018-06-04
8.859171	8.787785	8.930558	8.749995	8.968347	2018-06-05

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


# plot to compare the prediction with the real values
prediction %>%
  rename( weight = `Point Forecast`) %>% # rename the forecast column
  mutate( origin = "prediction" ) %>%    # mark the data as 'prediction'
  bind_rows( measures_completed %>% mutate(origin="measures") ) %>% # join with real data
  ggplot(aes(x=date)) + 
  geom_point(aes(y=weight,color=origin)) + 
  geom_ribbon(aes(ymin=`Lo 80`, ymax=`Hi 80`), alpha=0.2) +
  geom_ribbon(aes(ymin=`Lo 95`, ymax=`Hi 95`), alpha=0.2) +
  geom_point(data=weight.target, mapping = aes(date, weight)) +
  theme_bw()

The Forecast package did a “OK” job, the first real measure are in the 80% certainty range and the second in the 95% range, what is, for predictions, a good job. But the model miss the two points, they are at the edge of the certainty interval.

1
2
3
4
5
6
7


# comparing the real and predicted values
prediction %>%
  inner_join(weight.target, by="date") %>%
  select(date, `Lo 95`, forecast=`Point Forecast`, weight,`Hi 95`) %>%
  mutate( interval_size = `Hi 95` - `Lo 95` ) %>% 
  kable() %>% 
  kableExtra::kable_styling(font_size = 11)

date	Lo 95	forecast	weight	Hi 95	interval_size
2018-06-25	8.380875	8.580660	8.70	8.780446	0.3995711
2018-06-29	8.311620	8.524958	8.71	8.738296	0.4266767

Can the Facebook’s Prophet do a better job?

Prophet

The Facebook’s Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

Facebook implemented the procedure in R and Python and make it public in 2017, the package Prophet gives you access to it. If you want to know more about it, check this.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# by definition we need to pass a df with 2 columns "ds" (datestamp) and "y" (target var)
measures_completed %>%
  set_names(c("ds","y")) %>%
  prophet() -> pmodel

# we use the model to make the prediction
pmodel %>%
  make_future_dataframe(30) %>%
  predict(pmodel,.) -> pprediction

# what is the output format
pprediction %>%
  as.tibble() %>%
  glimpse()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


## Rows: 117
## Columns: 16
## $ ds                         <dttm> 2018-03-05, 2018-03-06, 2018-03-07, 2018-0~
## $ trend                      <dbl> 10.155044, 10.123293, 10.091542, 10.059791,~
## $ additive_terms             <dbl> -0.007936320, -0.002856559, -0.009109722, 0~
## $ additive_terms_lower       <dbl> -0.007936320, -0.002856559, -0.009109722, 0~
## $ additive_terms_upper       <dbl> -0.007936320, -0.002856559, -0.009109722, 0~
## $ weekly                     <dbl> -0.007936320, -0.002856559, -0.009109722, 0~
## $ weekly_lower               <dbl> -0.007936320, -0.002856559, -0.009109722, 0~
## $ weekly_upper               <dbl> -0.007936320, -0.002856559, -0.009109722, 0~
## $ multiplicative_terms       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ multiplicative_terms_lower <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ multiplicative_terms_upper <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ yhat_lower                 <dbl> 10.113679, 10.084900, 10.048282, 10.042038,~
## $ yhat_upper                 <dbl> 10.180122, 10.153888, 10.116648, 10.108169,~
## $ trend_lower                <dbl> 10.155044, 10.123293, 10.091542, 10.059791,~
## $ trend_upper                <dbl> 10.155044, 10.123293, 10.091542, 10.059791,~
## $ yhat                       <dbl> 10.147108, 10.120436, 10.082432, 10.077176,~

As you see, the prophet’s prediction give us a lot of information, let’s check how it performed.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


#plot the prediction against the target
pprediction %>%
  as.tibble() %>%
  mutate(ds=as_date(ds)) %>%
  filter(ds > max(measures_completed$date) ) %>%
  select(ds, trend, yhat, yhat_lower, yhat_upper) %>%
  ggplot() + 
  geom_line(aes(x=ds, y=yhat), color="blue") +
  geom_ribbon(aes(x=ds, ymin=yhat_lower, ymax=yhat_upper), alpha=0.2) +
  geom_point(data=measures_completed, aes(x=date, y=weight), color="salmon") +
  geom_point(data=weight.target, mapping=aes(x=date, y=weight), color="black") +
  theme_bw()

Wow, the prophet to an excellent job, the real measures are in the center of the range and the prediction values are close to it.

1
2
3
4
5
6
7
8
9


# checking the values in the June measures
pprediction %>%
  as.tibble() %>% 
  mutate(date=as_date(ds)) %>%
  inner_join(weight.target, by="date") %>%
  select(date=ds, yhat_lower, yhat, weight, yhat_upper ) %>%
  mutate( interval_size = yhat_upper - yhat_lower ) %>% 
  kable() %>% 
  kableExtra::kable_styling(font_size = 11)

date	yhat_lower	yhat	weight	yhat_upper	interval_size
2018-06-25	8.637765	8.748224	8.70	8.870323	0.2325575
2018-06-29	8.584754	8.723976	8.71	8.872009	0.2872546

As you saw, the prediction points made by prophet are remarkable close to the real values and also the size of certainty interval is very narrow, almost half forecast´s.

Of course, we just use these packages “out-of-the-box”, we didn’t tunning the Forecast parameters, maybe it can be a better job, but this don’t invalidate the results of Prophet.

The performance of the prophet was great in the test, for sure, the this package deserves another post in the future, to explore other features available.