ASK

Introduction

Bellabeat produces health-focused smart products with beautifully designed technology tailored to women’s needs, offering users insights into their activity, sleep, stress, and reproductive health, empowering them with valuable knowledge. While Bellabeat has become a leading tech-driven wellness brand for women, founder Urška Sršen recognizes the potential for fresh growth opportunities through consumer insights, and has tasked the marketing analytics team with delving into smart device usage data to shape Bellabeat’s marketing strategy.

Stakeholders

Primary

  • Urška Sršen: Bellabeat’s co-founder and Chief Creative Officer
  • Sando Mur: Mathematician and Bellabeat’s co-founder; key member of the Bellabeat executive team

Secondary

  • Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

Guiding questions

These questions will guide the analysis:

  1. What are some trends in smart device usage?

  2. How could these trends apply to Bellabeat customers?

  3. How could these trends help influence Bellabeat marketing strategy?

Business task

PREPARE

Using R

We’ll be using R via RStudio for our analysis as it offers numerous advantages due to its robust statistical capabilities and extensive collection of packages for data manipulation, visualization, and modeling. RStudio has an intuitive interface, seamless integration with R, and a comprehensive suite of tools that streamline workflow. R Markdown is also a fantastic tool for presenting data analysis projects because it seamlessly integrates code, visualizations, and narrative text, making it easy to share a comprehensivereport of your analysis process and findings in one place.

Installing and loading packages

We’ll start by installing the following packages:

#install.packages("tidyverse")
install.packages(c(
  "tidyr", "dplyr", "knitr", "skimr", "stringr", "hms", "janitor", "lubridate", "ggplot2", "ggthemes", "ggforce", "gridExtra", "patchwork", "paletteer", "crayon", "sqldf", "plotrix"
))
devtools::install_github("nsgrantham/ggdark")

And then loading them:

# Load libraries ----
library("tidyverse")
library("tidyr") 
library("dplyr")  
library("knitr")   
library("skimr")
library("stringr")
library("hms")
library("janitor")
library("lubridate")
library("ggplot2") 
library("ggthemes")
library("ggdark")
library("ggforce")
library("gridExtra")
library("patchwork")
library("paletteer")
library("crayon")       
library("sqldf")
library("plotrix")

Data source

We’ll be using the following data: FitBit Fitness Tracker Data (CC0: Public Domain, available via Mobius). This Kaggle dataset contains personal fitness tracker data from thirty FitBit users. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. The dataset contains 18 CSV files organized in long format.

Let’s consider: Does our data ROCCC?

  • Reliable - LOW: 30 individuals is a very small sample size that is not representative of the entire population of FitBit users.

  • Original - LOW: Amazon Mechanical Turk is a third-party data source.

  • Comprehensive - MED: The dataset contains a wide range of physical fitness data from the participants, but lacks demographic data to make meaningful marketing insights.

  • Current - LOW: This data is from March 2016 through May 2016. The data is from over 7 years ago— quite old.

  • Cited - LOW: Data was collected from a third party.

As we can see, the strength of this data is pretty low across the board— not ideal for analysis. The scenario prompt suggests adding another dataset to help address the limitations of this one, but after searching various databases I have found no other datasets that have comparable data.

Looking at the data

We’ll be delving into data spread across fourteen CSV files (tossed four due to having the same data as other files, but in narrow format rather than wide). The data is structured, organized in rows and columns. Most datasets are in long format, with each row corresponding to one time point per session ID, and each session ID having multiple rows of data. Reports can be parsed by export ID or timestamp. These datasets are categorized based on the time increments they were tracked by: daily, hourly, and minute data. Analyzing data at these different levels allows us to uncover trends and insights applicable to Bellabeat customers.

Let’s take a look at file sizes to see what we want to import so as to not overload RStudio.

## dailyActivity_merged.csv = 111288 bytes
## dailyCalories_merged.csv = 25125 bytes
## dailyIntensities_merged.csv = 70581 bytes
## dailySteps_merged.csv = 25175 bytes
## sleepDay_merged.csv = 18100 bytes
## weightLogInfo_merged.csv = 6725 bytes
## hourlyCalories_merged.csv = 801486 bytes
## hourlyIntensities_merged.csv = 898778 bytes
## hourlySteps_merged.csv = 796562 bytes
## minuteCaloriesNarrow_merged.csv = 66443854 bytes
## minuteIntensitiesNarrow_merged.csv = 46358789 bytes
## minuteMETsNarrow_merged.csv = 47686683 bytes
## minuteSleep_merged.csv = 8848370 bytes
## minuteStepsNarrow_merged.csv = 46531957 bytes

The files of the datasets taken per minute are much too big to handle— the smallest minute file minuteSleep_merged.csv comes in at 8848370 bytes, almost 10 times larger than the largest hourly file hourlyIntensities_merged.csv at 898778 bytes. It is then best to focus on the daily and hourly datasets, which we load below.

daily_activity <- read_csv("../fitbitdata/dailyActivity_merged.csv")
daily_calories <- read_csv("../fitbitdata/dailyCalories_merged.csv")
daily_intensities <- read_csv("../fitbitdata/dailyIntensities_merged.csv")
daily_steps <- read_csv("../fitbitdata/dailySteps_merged.csv")
daily_sleep <- read_csv("../fitbitdata/sleepDay_merged.csv")
weight_log <- read_csv("../fitbitdata/weightLogInfo_merged.csv")
hourly_calories <- read_csv("../fitbitdata/hourlyCalories_merged.csv")
hourly_intensities <- read_csv("../fitbitdata/hourlyIntensities_merged.csv")
hourly_steps <- read_csv("../fitbitdata/hourlySteps_merged.csv")

We’ll also load the smallest minute dataset.

minute_sleep <- read_csv("../fitbitdata/minuteSleep_merged.csv")

PROCESS

Cleaning the data

Let’s clean the data in the ten datasets we’ve chosen to use.

Creating tibbles

We’re going to use head() to get a glimpse of the data structure and content to get a better initial understanding of the variables and values at hand:

Daily Dataframes

head(daily_activity)

## # A tibble: 6 × 15
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
##        <dbl> <chr>             <dbl>         <dbl>           <dbl>
## 1 1503960366 4/12/2016         13162          8.5             8.5 
## 2 1503960366 4/13/2016         10735          6.97            6.97
## 3 1503960366 4/14/2016         10460          6.74            6.74
## 4 1503960366 4/15/2016          9762          6.28            6.28
## 5 1503960366 4/16/2016         12669          8.16            8.16
## 6 1503960366 4/17/2016          9705          6.48            6.48
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## #   VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>

head(daily_calories)

## # A tibble: 6 × 3
##           Id ActivityDay Calories
##        <dbl> <chr>          <dbl>
## 1 1503960366 4/12/2016       1985
## 2 1503960366 4/13/2016       1797
## 3 1503960366 4/14/2016       1776
## 4 1503960366 4/15/2016       1745
## 5 1503960366 4/16/2016       1863
## 6 1503960366 4/17/2016       1728

head(daily_intensities)

## # A tibble: 6 × 10
##         Id ActivityDay SedentaryMinutes LightlyActiveMinutes FairlyActiveMinutes
##      <dbl> <chr>                  <dbl>                <dbl>               <dbl>
## 1   1.50e9 4/12/2016                728                  328                  13
## 2   1.50e9 4/13/2016                776                  217                  19
## 3   1.50e9 4/14/2016               1218                  181                  11
## 4   1.50e9 4/15/2016                726                  209                  34
## 5   1.50e9 4/16/2016                773                  221                  10
## 6   1.50e9 4/17/2016                539                  164                  20
## # ℹ 5 more variables: VeryActiveMinutes <dbl>, SedentaryActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   VeryActiveDistance <dbl>

head(daily_steps)

## # A tibble: 6 × 3
##           Id ActivityDay StepTotal
##        <dbl> <chr>           <dbl>
## 1 1503960366 4/12/2016       13162
## 2 1503960366 4/13/2016       10735
## 3 1503960366 4/14/2016       10460
## 4 1503960366 4/15/2016        9762
## 5 1503960366 4/16/2016       12669
## 6 1503960366 4/17/2016        9705

head(daily_sleep)

## # A tibble: 6 × 5
##           Id SleepDay        TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
##        <dbl> <chr>                       <dbl>              <dbl>          <dbl>
## 1 1503960366 4/12/2016 12:0…                 1                327            346
## 2 1503960366 4/13/2016 12:0…                 2                384            407
## 3 1503960366 4/15/2016 12:0…                 1                412            442
## 4 1503960366 4/16/2016 12:0…                 2                340            367
## 5 1503960366 4/17/2016 12:0…                 1                700            712
## 6 1503960366 4/19/2016 12:0…                 1                304            320

head(weight_log)

## # A tibble: 6 × 8
##           Id Date       WeightKg WeightPounds   Fat   BMI IsManualReport   LogId
##        <dbl> <chr>         <dbl>        <dbl> <dbl> <dbl> <lgl>            <dbl>
## 1 1503960366 5/2/2016 …     52.6         116.    22  22.6 TRUE           1.46e12
## 2 1503960366 5/3/2016 …     52.6         116.    NA  22.6 TRUE           1.46e12
## 3 1927972279 4/13/2016…    134.          294.    NA  47.5 FALSE          1.46e12
## 4 2873212765 4/21/2016…     56.7         125.    NA  21.5 TRUE           1.46e12
## 5 2873212765 5/12/2016…     57.3         126.    NA  21.7 TRUE           1.46e12
## 6 4319703577 4/17/2016…     72.4         160.    25  27.5 TRUE           1.46e12

Hourly Dataframes

head(hourly_calories)

## # A tibble: 6 × 3
##           Id ActivityHour          Calories
##        <dbl> <chr>                    <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM       81
## 2 1503960366 4/12/2016 1:00:00 AM        61
## 3 1503960366 4/12/2016 2:00:00 AM        59
## 4 1503960366 4/12/2016 3:00:00 AM        47
## 5 1503960366 4/12/2016 4:00:00 AM        48
## 6 1503960366 4/12/2016 5:00:00 AM        48

head(hourly_intensities

## # A tibble: 6 × 4
##           Id ActivityHour          TotalIntensity AverageIntensity
##        <dbl> <chr>                          <dbl>            <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM             20            0.333
## 2 1503960366 4/12/2016 1:00:00 AM               8            0.133
## 3 1503960366 4/12/2016 2:00:00 AM               7            0.117
## 4 1503960366 4/12/2016 3:00:00 AM               0            0    
## 5 1503960366 4/12/2016 4:00:00 AM               0            0    
## 6 1503960366 4/12/2016 5:00:00 AM               0            0

head(hourly_steps)

## # A tibble: 6 × 3
##           Id ActivityHour          StepTotal
##        <dbl> <chr>                     <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM       373
## 2 1503960366 4/12/2016 1:00:00 AM        160
## 3 1503960366 4/12/2016 2:00:00 AM        151
## 4 1503960366 4/12/2016 3:00:00 AM          0
## 5 1503960366 4/12/2016 4:00:00 AM          0
## 6 1503960366 4/12/2016 5:00:00 AM          0

Minute Dataframe

head(minute_sleep)

## # A tibble: 6 × 4
##           Id date                 value       logId
##        <dbl> <chr>                <dbl>       <dbl>
## 1 1503960366 4/12/2016 2:47:30 AM     3 11380564589
## 2 1503960366 4/12/2016 2:48:30 AM     2 11380564589
## 3 1503960366 4/12/2016 2:49:30 AM     1 11380564589
## 4 1503960366 4/12/2016 2:50:30 AM     1 11380564589
## 5 1503960366 4/12/2016 2:51:30 AM     1 11380564589
## 6 1503960366 4/12/2016 2:52:30 AM     1 11380564589

Column names

Let’s use colnames() to get an even better idea of our data:

Daily Dataframes

colnames(daily_activity)

##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"

colnames(daily_calories)

## [1] "Id"          "ActivityDay" "Calories"

colnames(daily_intensities)

##  [1] "Id"                       "ActivityDay"             
##  [3] "SedentaryMinutes"         "LightlyActiveMinutes"    
##  [5] "FairlyActiveMinutes"      "VeryActiveMinutes"       
##  [7] "SedentaryActiveDistance"  "LightActiveDistance"     
##  [9] "ModeratelyActiveDistance" "VeryActiveDistance"

colnames(daily_steps)

## [1] "Id"          "ActivityDay" "StepTotal"

colnames(daily_sleep)

## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"

colnames(weight_log)

## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"

Hourly Dataframes

colnames(hourly_calories)

## [1] "Id"           "ActivityHour" "Calories"

colnames(hourly_intensities)

## [1] "Id"               "ActivityHour"     "TotalIntensity"   "AverageIntensity"

colnames(hourly_steps)

## [1] "Id"           "ActivityHour" "StepTotal"

Minute Dataframe

colnames(minute_sleep)

## [1] "Id"    "date"  "value" "logId"

View variables to merge on

We’ll lastly be looking at the Id column and time/date variables for parallels, keeping in mind time/date variable format will be different between the dataframes.

Daily Dataframes

daily_activity %>%
  select(Id, ActivityDate)
## # A tibble: 940 × 2
##            Id ActivityDate
##         <dbl> <chr>       
##  1 1503960366 4/12/2016   
##  2 1503960366 4/13/2016   
##  3 1503960366 4/14/2016   
##  4 1503960366 4/15/2016   
##  5 1503960366 4/16/2016   
##  6 1503960366 4/17/2016   
##  7 1503960366 4/18/2016   
##  8 1503960366 4/19/2016   
##  9 1503960366 4/20/2016   
## 10 1503960366 4/21/2016   
## # ℹ 930 more rows
daily_calories %>%
  select(Id, ActivityDay)
## # A tibble: 940 × 2
##            Id ActivityDay
##         <dbl> <chr>      
##  1 1503960366 4/12/2016  
##  2 1503960366 4/13/2016  
##  3 1503960366 4/14/2016  
##  4 1503960366 4/15/2016  
##  5 1503960366 4/16/2016  
##  6 1503960366 4/17/2016  
##  7 1503960366 4/18/2016  
##  8 1503960366 4/19/2016  
##  9 1503960366 4/20/2016  
## 10 1503960366 4/21/2016  
## # ℹ 930 more rows
daily_intensities %>%
  select(Id, ActivityDay)
## # A tibble: 940 × 2
##            Id ActivityDay
##         <dbl> <chr>      
##  1 1503960366 4/12/2016  
##  2 1503960366 4/13/2016  
##  3 1503960366 4/14/2016  
##  4 1503960366 4/15/2016  
##  5 1503960366 4/16/2016  
##  6 1503960366 4/17/2016  
##  7 1503960366 4/18/2016  
##  8 1503960366 4/19/2016  
##  9 1503960366 4/20/2016  
## 10 1503960366 4/21/2016  
## # ℹ 930 more rows
daily_steps %>%
  select(Id, ActivityDay)
## # A tibble: 940 × 2
##            Id ActivityDay
##         <dbl> <chr>      
##  1 1503960366 4/12/2016  
##  2 1503960366 4/13/2016  
##  3 1503960366 4/14/2016  
##  4 1503960366 4/15/2016  
##  5 1503960366 4/16/2016  
##  6 1503960366 4/17/2016  
##  7 1503960366 4/18/2016  
##  8 1503960366 4/19/2016  
##  9 1503960366 4/20/2016  
## 10 1503960366 4/21/2016  
## # ℹ 930 more rows
daily_sleep %>%
  select(Id, SleepDay)
## # A tibble: 413 × 2
##            Id SleepDay             
##         <dbl> <chr>                
##  1 1503960366 4/12/2016 12:00:00 AM
##  2 1503960366 4/13/2016 12:00:00 AM
##  3 1503960366 4/15/2016 12:00:00 AM
##  4 1503960366 4/16/2016 12:00:00 AM
##  5 1503960366 4/17/2016 12:00:00 AM
##  6 1503960366 4/19/2016 12:00:00 AM
##  7 1503960366 4/20/2016 12:00:00 AM
##  8 1503960366 4/21/2016 12:00:00 AM
##  9 1503960366 4/23/2016 12:00:00 AM
## 10 1503960366 4/24/2016 12:00:00 AM
## # ℹ 403 more rows
weight_log %>%
  select(Id, Date)
## # A tibble: 67 × 2
##            Id Date                 
##         <dbl> <chr>                
##  1 1503960366 5/2/2016 11:59:59 PM 
##  2 1503960366 5/3/2016 11:59:59 PM 
##  3 1927972279 4/13/2016 1:08:52 AM 
##  4 2873212765 4/21/2016 11:59:59 PM
##  5 2873212765 5/12/2016 11:59:59 PM
##  6 4319703577 4/17/2016 11:59:59 PM
##  7 4319703577 5/4/2016 11:59:59 PM 
##  8 4558609924 4/18/2016 11:59:59 PM
##  9 4558609924 4/25/2016 11:59:59 PM
## 10 4558609924 5/1/2016 11:59:59 PM 
## # ℹ 57 more rows

Looking at the column names and row counts, it seems that all data contained in dataframes daily_calories, daily_intensities, and daily_steps are already contained in daily_activity, and upon further investigation, this is the case. Let’s remove them:

rm("daily_calories", "daily_intensities", "daily_steps")

Since we can see the weight_log dataframe only contains 67 observations, let’s look a little harder at it:

weight_log %>%
  count(Id)
## # A tibble: 8 × 2
##           Id     n
##        <dbl> <int>
## 1 1503960366     2
## 2 1927972279     1
## 3 2873212765     2
## 4 4319703577     2
## 5 4558609924     5
## 6 5577150313     1
## 7 6962181067    30
## 8 8877689391    24

There are only 8 users that self-report their weight anywhere from once to thirty times, rendering it unusable— too little users, too much variance in report frequencies, too little time for significant change. Let’s scrub it:

rm("weight_log")

Hourly Dataframes

hourly_calories %>%
  select(Id, ActivityHour)
## # A tibble: 22,099 × 2
##            Id ActivityHour         
##         <dbl> <chr>                
##  1 1503960366 4/12/2016 12:00:00 AM
##  2 1503960366 4/12/2016 1:00:00 AM 
##  3 1503960366 4/12/2016 2:00:00 AM 
##  4 1503960366 4/12/2016 3:00:00 AM 
##  5 1503960366 4/12/2016 4:00:00 AM 
##  6 1503960366 4/12/2016 5:00:00 AM 
##  7 1503960366 4/12/2016 6:00:00 AM 
##  8 1503960366 4/12/2016 7:00:00 AM 
##  9 1503960366 4/12/2016 8:00:00 AM 
## 10 1503960366 4/12/2016 9:00:00 AM 
## # ℹ 22,089 more rows
hourly_intensities %>%
  select(Id, ActivityHour)
## # A tibble: 22,099 × 2
##            Id ActivityHour         
##         <dbl> <chr>                
##  1 1503960366 4/12/2016 12:00:00 AM
##  2 1503960366 4/12/2016 1:00:00 AM 
##  3 1503960366 4/12/2016 2:00:00 AM 
##  4 1503960366 4/12/2016 3:00:00 AM 
##  5 1503960366 4/12/2016 4:00:00 AM 
##  6 1503960366 4/12/2016 5:00:00 AM 
##  7 1503960366 4/12/2016 6:00:00 AM 
##  8 1503960366 4/12/2016 7:00:00 AM 
##  9 1503960366 4/12/2016 8:00:00 AM 
## 10 1503960366 4/12/2016 9:00:00 AM 
## # ℹ 22,089 more rows
hourly_steps %>%
  select(Id, ActivityHour)
## # A tibble: 22,099 × 2
##            Id ActivityHour         
##         <dbl> <chr>                
##  1 1503960366 4/12/2016 12:00:00 AM
##  2 1503960366 4/12/2016 1:00:00 AM 
##  3 1503960366 4/12/2016 2:00:00 AM 
##  4 1503960366 4/12/2016 3:00:00 AM 
##  5 1503960366 4/12/2016 4:00:00 AM 
##  6 1503960366 4/12/2016 5:00:00 AM 
##  7 1503960366 4/12/2016 6:00:00 AM 
##  8 1503960366 4/12/2016 7:00:00 AM 
##  9 1503960366 4/12/2016 8:00:00 AM 
## 10 1503960366 4/12/2016 9:00:00 AM 
## # ℹ 22,089 more rows

Minute Dataframe

minute_sleep %>%
  select(Id, date)
## # A tibble: 188,521 × 2
##            Id date                
##         <dbl> <chr>               
##  1 1503960366 4/12/2016 2:47:30 AM
##  2 1503960366 4/12/2016 2:48:30 AM
##  3 1503960366 4/12/2016 2:49:30 AM
##  4 1503960366 4/12/2016 2:50:30 AM
##  5 1503960366 4/12/2016 2:51:30 AM
##  6 1503960366 4/12/2016 2:52:30 AM
##  7 1503960366 4/12/2016 2:53:30 AM
##  8 1503960366 4/12/2016 2:54:30 AM
##  9 1503960366 4/12/2016 2:55:30 AM
## 10 1503960366 4/12/2016 2:56:30 AM
## # ℹ 188,511 more rows

We can see some time/date formatting inconsistencies between the dataframes, which will be addressed below, and varying numbers of data rows per unique Id.

After removing some dataframes, we are left with the following:

  • daily_activity

  • daily_sleep

  • hourly_calories

  • hourly_intensities

  • hourly_steps

  • minute_sleep

Data cleaning

Having gotten to know the data, we can now pinpoint several data cleaning tasks to proceed with. Here are the primary areas we will focus on:

  1. Standardize naming conventions: we’ll adjust the variable names in adherence to the tidyverse style guide. All variable names will use only lowercase letters and underscores _ (known as snake case) to separate words within a name. This will enhance code readability.

  2. Standardize date formatting: we’ll make the date-time format consistent across the dataframes. This will help avoid potential issues in subsequent analysis and when we start merging the data.

  3. Create new variables: we’ll create new columns for analysis in each dataframe.

  4. Create merged sleep dataframe: to better analyze the minute_sleep data, we’ll have to do some transformation by reformatting and creating new variables, and aggregate all the sleep data we have into one dataframe.

Daily Dataframes

# daily_activity ----
daily_activity <-
  daily_activity %>% 
  rename(
    date = ActivityDate, 
    total_steps = TotalSteps, 
    total_distance = TotalDistance,
    tracker_distance = TrackerDistance,
    logged_activities_d = LoggedActivitiesDistance,
    very_active_d = VeryActiveDistance, 
    moderately_active_d = ModeratelyActiveDistance, 
    light_active_d = LightActiveDistance, 
    sedentary_d = SedentaryActiveDistance, 
    very_active_m = VeryActiveMinutes, 
    moderately_active_m = FairlyActiveMinutes, 
    lightly_active_m = LightlyActiveMinutes, 
    sedentary_m = SedentaryMinutes, 
    calories = Calories
  ) %>% 
  rename_with(
    tolower, starts_with("Id")
  ) %>% 
  mutate(
    # reformat variable as POSIXct
    date = parse_date_time(date, "%m/%d/%y"),
    # create new date only variable
    date_ymd = as.Date(date, "%Y/%m/%d"),
    # create new DOW and TOW variables
    day_of_week = weekdays(as.Date(date)),
    part_of_week = case_when(
      day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday", 
      day_of_week %in% c("Saturday", "Sunday") ~ "Weekend")
  )
# daily_sleep ----
daily_sleep <-
  daily_sleep %>% 
  rename(
    date = SleepDay,
    total_sleep_records = TotalSleepRecords,
    total_minutes_asleep = TotalMinutesAsleep,
    total_time_in_bed = TotalTimeInBed
  ) %>% 
   rename_with(
    tolower, starts_with("Id")
  ) %>% 
  mutate(
    date = parse_date_time(date, "%m/%d/%Y %I:%M:%S %p"),
    date_ymd = as.Date(date, "%Y/%m/%d"),
    day_of_week = weekdays(as.Date(date)),
    part_of_week = case_when(
      day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday", 
      day_of_week %in% c("Saturday", "Sunday") ~ "Weekend")
  )

Hourly Dataframes

# hourly_calories ----
hourly_calories <-
  hourly_calories %>% 
  rename(
    hour = ActivityHour,
    calories = Calories
  ) %>% 
   rename_with(
    tolower, starts_with("Id")
  ) %>% 
  mutate(
    hour = parse_date_time(hour, "%m/%d/%Y %I:%M:%S %p"),
    date_ymd = as.Date(hour, "%Y/%m/%d"), 
    time = format(hour, format = "%I:%M:%S %p"),
    day_of_week = weekdays(as.Date(hour)),
    part_of_week = case_when(
      day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday", 
      day_of_week %in% c("Saturday", "Sunday") ~ "Weekend"),
    hour_of_day = as.POSIXct(time, format = "%I:%M:%S %p")
  )
# hourly_intensities ----
hourly_intensities <- 
  hourly_intensities %>% 
  rename(
    hour = ActivityHour,
    total_intensity = TotalIntensity,
    average_intensity = AverageIntensity
  ) %>% 
   rename_with(
    tolower, starts_with("Id")
  ) %>% 
  mutate(
    hour = parse_date_time(hour, "%m/%d/%Y %I:%M:%S %p"),
    date_ymd = as.Date(hour, "%Y/%m/%d"), 
    time = format(hour, format = "%I:%M:%S %p"),
    day_of_week = weekdays(as.Date(hour)),
    part_of_week = case_when(
      day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday", 
      day_of_week %in% c("Saturday", "Sunday") ~ "Weekend"),
    hour_of_day = as.POSIXct(time, format = "%I:%M:%S %p")
  )
# hourly_steps ----
hourly_steps <- 
  hourly_steps %>% 
  rename(
    hour = ActivityHour,
    step_total = StepTotal
  ) %>% 
   rename_with(
    tolower, starts_with("Id")
  ) %>% 
  mutate(
    hour = parse_date_time(hour, "%m/%d/%Y %I:%M:%S %p"),
    date_ymd = as.Date(hour, "%Y/%m/%d"), 
    time = format(hour, format = "%I:%M:%S %p"),
    day_of_week = weekdays(as.Date(hour)),
    part_of_week = case_when(
      day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday", 
      day_of_week %in% c("Saturday", "Sunday") ~ "Weekend"),
    hour_of_day = as.POSIXct(time, format = "%I:%M:%S %p")
  )

Minute Dataframes

# minute_sleep ----
minute_sleep <-
  minute_sleep %>% 
  rename(
    sleep_value = value,
    log_id = logId
  ) %>% 
   rename_with(
    tolower, starts_with("Id")
  ) %>% 
  mutate(
    date = parse_date_time(date, "%m/%d/%Y %I:%M:%S %p"),
    date_ymd = as.Date(date, "%Y/%m/%d"),
    time = format(date, format = "%I:%M:00 %p"), 
    day_of_week = weekdays(as.Date(date)),
    part_of_week = case_when(
      day_of_week %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") ~ "Weekday", 
      day_of_week %in% c("Saturday", "Sunday") ~ "Weekend"),
    hour_of_day = as.POSIXct(time, format = "%I:%M:%S %p"),
    # sleep_id will be used generate totals per sleep log
    sleep_id = str_c(id, "-", log_id), 
    # Create new variables for sleep values - asleep, restless, and awake
    asleep = ifelse(sleep_value == 1, 1, 0),
    restless = ifelse(sleep_value == 2, 1, 0),
    awake = ifelse(sleep_value == 3, 1, 0)
  )
# Create sleep_summary_0 df ----
# shows totals for 3 sleep values per sleep log
sleep_summary_0 <-
  minute_sleep %>%
  # id_date will be used to generate a daily total
  mutate(
    id_date = str_c(id, "-", date_ymd)
  ) %>% 
  group_by(sleep_id, date_ymd, id_date, id) %>% 
  dplyr::summarize(
    total_asleep = sum(sleep_value == "1"),
    total_restless = sum(sleep_value == "2"),
    total_awake = sum(sleep_value == "3")
  )
# Create sleep_summary df ----
# that combines totals for each day per id
sleep_summary <-
  sleep_summary_0 %>%
  # date will be used to merge with daily_sleep df
  mutate(
    date = parse_date_time(date_ymd, "%Y/%m/%d")
  ) %>%
  group_by(id_date, date, id) %>%
  dplyr::summarize(
    total_asleep_merged = sum(total_asleep),
    total_restless_merged = sum(total_restless),
    total_awake_merged = sum(total_awake)
  )
# Merge these two daily sleep dfs into one ----
sleep_data <- merge(x = daily_sleep, y = sleep_summary, by = c("id", "date"), all = TRUE)

ANALYZE

Summary statistics

Now that we’ve done some cleaning and formatting, we can look at this data from a broad perspective. We’ll do this by obtaining summary statistics that provide an overview of how the dataframes interrelate and potentially reveal trends within each.

Unique participants

Daily Dataframes

# There are 33 users in the daily activity dataframe
n_distinct(daily_activity$id)
## [1] 33
# There are 24 users in the sleep dataframes
n_distinct(daily_sleep$id)
## [1] 24
n_distinct(sleep_data$id)
## [1] 24

Hourly Dataframes

# There are 33 users in the hourly dataframes
n_distinct(hourly_calories$id)
## [1] 33
n_distinct(hourly_intensities$id)
## [1] 33
n_distinct(hourly_steps$id)
## [1] 33

Minute Dataframe

# There are 24 users in the minute dataframe
n_distinct(minute_sleep$id)
## [1] 24

Total number of observations

Daily Dataframes

The observation counts vary between the daily dataframes:

nrow(daily_activity)
## [1] 940
nrow(daily_sleep)
## [1] 413
nrow(sleep_data)
## [1] 452

Hourly Dataframes

There are 22099 observations in each hourly dataframe:

nrow(hourly_calories)
## [1] 22099
nrow(hourly_intensities)
## [1] 22099
nrow(hourly_steps)
## [1] 22099

Minute Dataframe

There are 188521 observations in the minute dataframe:

nrow(minute_sleep)
## [1] 188521

Overview

Daily Dataframes

# Daily totals for steps, distance, calories ----
daily_activity %>%
  select(
    total_steps,
    total_distance,
    calories
  ) %>%
  summary()
##   total_steps    total_distance      calories   
##  Min.   :    0   Min.   : 0.000   Min.   :   0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.:1828  
##  Median : 7406   Median : 5.245   Median :2134  
##  Mean   : 7638   Mean   : 5.490   Mean   :2304  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:2793  
##  Max.   :36019   Max.   :28.030   Max.   :4900
# Minutes per activity level ----
daily_activity %>%
  select(
    very_active_m,
    moderately_active_m,
    lightly_active_m,
    sedentary_m
  ) %>%
  summary()
##  very_active_m    moderately_active_m lightly_active_m  sedentary_m    
##  Min.   :  0.00   Min.   :  0.00      Min.   :  0.0    Min.   :   0.0  
##  1st Qu.:  0.00   1st Qu.:  0.00      1st Qu.:127.0    1st Qu.: 729.8  
##  Median :  4.00   Median :  6.00      Median :199.0    Median :1057.5  
##  Mean   : 21.16   Mean   : 13.56      Mean   :192.8    Mean   : 991.2  
##  3rd Qu.: 32.00   3rd Qu.: 19.00      3rd Qu.:264.0    3rd Qu.:1229.5  
##  Max.   :210.00   Max.   :143.00      Max.   :518.0    Max.   :1440.0
# Sleep totals ----
sleep_data %>%
  select(
    total_minutes_asleep, 
    total_time_in_bed,
    total_asleep_merged,
    total_restless_merged,
    total_awake_merged
  ) %>%
  summary()
##  total_minutes_asleep total_time_in_bed total_asleep_merged
##  Min.   : 58.0        Min.   : 61.0     Min.   :  0.0      
##  1st Qu.:361.0        1st Qu.:403.0     1st Qu.:325.0      
##  Median :433.0        Median :463.0     Median :407.5      
##  Mean   :419.5        Mean   :458.6     Mean   :384.8      
##  3rd Qu.:490.0        3rd Qu.:526.0     3rd Qu.:477.0      
##  Max.   :796.0        Max.   :961.0     Max.   :778.0      
##  NA's   :39           NA's   :39                           
##  total_restless_merged total_awake_merged
##  Min.   :  0.00        Min.   :  0.000   
##  1st Qu.: 11.00        1st Qu.:  0.000   
##  Median : 20.00        Median :  3.000   
##  Mean   : 31.17        Mean   :  4.469   
##  3rd Qu.: 31.00        3rd Qu.:  5.000   
##  Max.   :227.00        Max.   :129.000   
## 

Hourly Dataframes

# Hourly calories summary ----
hourly_calories %>%
  select(calories) %>%
  summary()
##     calories     
##  Min.   : 42.00  
##  1st Qu.: 63.00  
##  Median : 83.00  
##  Mean   : 97.39  
##  3rd Qu.:108.00  
##  Max.   :948.00
# Hourly intensities summary ----
hourly_intensities %>%
  select(
    total_intensity,
    average_intensity
  ) %>%
  summary()
##  total_intensity  average_intensity
##  Min.   :  0.00   Min.   :0.0000   
##  1st Qu.:  0.00   1st Qu.:0.0000   
##  Median :  3.00   Median :0.0500   
##  Mean   : 12.04   Mean   :0.2006   
##  3rd Qu.: 16.00   3rd Qu.:0.2667   
##  Max.   :180.00   Max.   :3.0000
# Hourly steps summary ----
hourly_steps %>%
  select(step_total) %>%
  summary()
##    step_total     
##  Min.   :    0.0  
##  1st Qu.:    0.0  
##  Median :   40.0  
##  Mean   :  320.2  
##  3rd Qu.:  357.0  
##  Max.   :10554.0

Minute Dataframe

# Minute sleep summary ----
minute_sleep %>%
  select(
    sleep_value,
    asleep,
    restless,
    awake
  ) %>%
  summary()
##   sleep_value        asleep          restless           awake       
##  Min.   :1.000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :1.000   Median :1.0000   Median :0.00000   Median :0.0000  
##  Mean   :1.096   Mean   :0.9149   Mean   :0.07438   Mean   :0.0107  
##  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :3.000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000

Some findings

What does this tell us about this sample of people’s activities? A few interesting things:

  • 7,638 – average total steps per day, which is below the generally recommended 10,000 goal.
  • 991.2 – average “Sedentary” minutes per day; this is just over 16.5 hours per day.
  • 192.8 – average “Lightly Active” minutes per day.
  • 34.72 – combined daily average of “Very Active” (21.16) and “Moderately Active” (13.56) minutes per day. This indicates that, on average, these individuals may be just about getting the WHO recommended 150 minutes of moderate-intensity physical activity per week.
  • 91.5% – average percentage of time asleep while in bed. This is calculated by dividing “Total Minutes Asleep” (419.5) by “Total Time In Bed” (458.6).
  • 97.39 – average calories burned per hour.

Plotting a few explorations

Steps and active minutes

The goal of 10,000 steps per day has been widely proposed across the smart device industry, but is daily step count truly a good metric of health? Let’s consider the relationship between total steps and calories burned:

# Total steps vs calories plot ----
ggplot(data = daily_activity, aes(x = total_steps, y = calories)) +
  geom_point() +
  geom_smooth(method = "lm") + # Linear trendline with 95% confidence interval
  labs(title = "Total steps vs. calories burned") +
  xlab("Total Steps") +
  ylab("Calories") +
  theme_minimal()

cor.test(daily_activity$total_steps, daily_activity$calories, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  daily_activity$total_steps and daily_activity$calories
## t = 22.472, df = 938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5483688 0.6316184
## sample estimates:
##       cor 
## 0.5915681

We see a positive correlation here between total_steps and calories. This can be expected; the more active someone is, the more calories they will burn. Let’s further consider the relationship between very_active_m and calories:

# Very active minutes vs calories plot ----
ggplot(data = daily_activity, aes(x = very_active_m, y = calories)) +
  geom_point() +
  geom_smooth(method = "lm") + # Linear trendline with 95% confidence interval
  labs(title = "Very active minutes vs. calories burned") +
  xlab("Very active minutes") +
  ylab("Calories") +
  theme_minimal()

cor.test(daily_activity$very_active_m, daily_activity$calories, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  daily_activity$very_active_m and daily_activity$calories
## t = 23.939, df = 938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5745197 0.6540260
## sample estimates:
##       cor 
## 0.6158383

Yet again we see another positive correlation. Very active people burn more calories. We know from our earlier summary statistics that the average very_active_m per day of 21.16 minutes is significantly lower than the average lightly_active_m per day of 192.80 minutes. Let’s look at the relationship between lightly_active_m and calories:

# Lightly active minutes vs calories plot ----
ggplot(data = daily_activity, aes(x = lightly_active_m, y = calories)) +
  geom_point() +
  geom_smooth(method = "lm") + # Linear trendline with  95% confidence interval
  labs(title = "Lightly active minutes vs. calories burned") +
  xlab("Lightly active minutes") +
  ylab("Calories") +
  theme_minimal()

cor.test(daily_activity$lightly_active_m, daily_activity$calories, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  daily_activity$lightly_active_m and daily_activity$calories
## t = 9.1661, df = 938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2269361 0.3443464
## sample estimates:
##       cor 
## 0.2867175

We see the degradation of strength in the relationship between lightly_active_m and calories. The correlation is still positive, albeit much weaker and possibly even negligible. This may indicate lightly_active_m has less of an impact on calories burned throughout a day.

Daily calories

Let’s look at the relationship between calories and day of the week:

# Create limits for plot ----
limits_weekdays <- c(
  "Monday", "Tuesday", "Wednesday", "Thursday", "Friday",
  "Saturday", "Sunday"
  )
# Calories vs day of week plot ----
ggplot(data = daily_activity) +
  geom_point(
    aes(x = day_of_week, y = calories, color = as.factor(part_of_week)),
    position = "jitter", 
    alpha = 0.3
    ) +
  labs(title = "Calories burned throughout the week", color = "Part of the week") +
  scale_x_discrete(
    "Day of the week", 
    limits = limits_weekdays
    ) +
  ylab("Total calories") +
  stat_summary(
    aes(x = day_of_week, y = calories),
    fun = median, 
    geom = "point",
    color = "black", 
    size = 2, 
    alpha = 0.7
    ) +
  theme_minimal()

There doesn’t seem to be a significant difference in the median calories burned between the days of the week. Seeing as there are some zero or near-zero values, we are using the medians, which are represented by the black dots on each day.

Daily sleep

Let’s look at the relationship between sleep quality and part of the week:

# Total minutes asleep vs part of week ----
ggplot(data = sleep_data) +
  geom_point(
    aes(
    x = weekdays.POSIXt(date), 
    y = total_asleep_merged,
    color = as.factor(part_of_week)
    ),
  position = "jitter",
  alpha = 0.3
  ) +
  labs(title = "Total minutes asleep throughout the week") +
  guides(color = "none") +
  ylab("Minutes asleep") +
  scale_x_discrete(
    "Day of the week",
    limits = limits_weekdays,
    ) +
  stat_summary(
    aes(x = weekdays.POSIXt(date), y = total_asleep_merged),
    fun = median, 
    geom = "point", 
    color = "black", 
    size = 2,
    alpha = 0.5
    ) +
  theme_minimal()

# Total minutes restless vs part of week ----
ggplot(data = sleep_data) +
  geom_point(
    aes(
    x = weekdays.POSIXt(date), 
    y = total_restless_merged,
    color = as.factor(part_of_week)
    ),
  position = "jitter",
  alpha = 0.3
  ) +
  labs(title = "Total minutes restless throughout the week") +
  guides(color = "none") +
  ylab("Minutes restless") +
  scale_x_discrete(
    "Day of the week", 
    limits = limits_weekdays,
    ) +
  stat_summary(
    aes(x = weekdays.POSIXt(date), y = total_restless_merged),
    fun = mean, 
    geom = "point", 
    color = "black", 
    size = 2,
    alpha = 0.5
    ) +
  theme_minimal()

# Total minutes awake vs part of week ----
ggplot(data = sleep_data) +
  geom_point(
    aes(
      x = weekdays.POSIXt(date), 
      y = total_awake_merged,
      color = as.factor(part_of_week)
      ),
    position = "jitter",
    alpha = 0.3
    ) +
  labs(title = "Total minutes awake throughout the week") +
  guides(color = "none") +
  ylab("Minutes awake") +
  scale_x_discrete(
    "Day of the week",
    limits = limits_weekdays,
    ) +
  stat_summary(
    aes(x = weekdays.POSIXt(date), y = total_awake_merged),
    fun = mean, 
    geom = "point", 
    color = "black", 
    size = 2,
    alpha = 0.5
    ) +
  theme_minimal()

Based on the averages for each day on these graphs, we can see that:

  • The median total minutes asleep were highest on Sunday, Wednesday and Saturday
  • The average total minutes awake and total minutes restless were highest on Sunday and Saturday

People do seem to get more sleep time in on the weekends, as well as more time awake and restless, indicating quality of sleep more or less is the same regardless.

Merging data

From these initial investigations, it appears that active minutes might provide more valuable customer insights compared to total steps, calories, or sleep. It would be advantageous for us to combine these datasets and delve deeper into analyzing this particular metric.

daily_data

We’ll start with combining daily_activity and sleep_data into one daily_data dataframe:

# Merging daily data into one df ----
daily_data <-
  merge(
    x = daily_activity, y = sleep_data,
    by = c("id", "date", "date_ymd", "day_of_week", "part_of_week"), 
    all.x = TRUE
  )
# Add two new activity level minute variables to daily data ----
daily_data <-
  daily_data %>%
  mutate(
    daily_mod_very_m = moderately_active_m + very_active_m, 
    daily_light_sed_m = lightly_active_m + sedentary_m,
    daily_all_m = very_active_m + moderately_active_m + lightly_active_m + sedentary_m
  )

hourly_data

We’ll also combine hourly_calories, hourly_intensities, and hourly_steps into one hourly_data dataframe:

# Merging hourly data into one df  ----
hourly_data_0 <-
  merge(
    x = hourly_calories, y = hourly_intensities,
    by = c(
      "id", "hour", "date_ymd", "time", "hour_of_day", 
      "day_of_week", "part_of_week"
    ),
    all.x = TRUE, no.dups = TRUE
  )
hourly_data <-
  merge(
    x = hourly_data_0, y = hourly_steps,
    by = c(
      "id", "hour", "date_ymd", "time", "hour_of_day", 
      "day_of_week", "part_of_week"
    ),
    all.x = TRUE
  )

weekly_data

Finally, we’ll create a weekly_data dataframe using data from daily_data:

# Create new weekly_data_0 df  ----
weekly_data_0 <-
  daily_data %>%
  select(
    id, date_ymd, day_of_week, date, total_steps, total_distance,
    very_active_d, moderately_active_d, light_active_d,
    sedentary_d, very_active_m, moderately_active_m,
    lightly_active_m, sedentary_m, calories
  ) %>%
  mutate(
    week_number = paste0(
      "Week ", (as.numeric(format(date_ymd, "%U"))), " of ", 
      format(date_ymd, "%Y")
      )
  )
# Summarize totals by week in new weekly_data df  ----
weekly_data <-
  weekly_data_0 %>%
  group_by(id, week_number) %>%
  dplyr::summarize(
    weekly_total_steps = sum(total_steps),
    weekly_total_distance = sum(total_distance),
    weekly_very_d = sum(very_active_d),
    weekly_moderate_d = sum(moderately_active_d),
    weekly_light_d = sum(light_active_d),
    weekly_sedentary_d = sum(sedentary_d),
    weekly_very_m = sum(very_active_m),
    weekly_moderate_m = sum(moderately_active_m),
    weekly_mod_very_m = weekly_very_m + weekly_moderate_m,
    weekly_light_m = sum(lightly_active_m),
    weekly_sedentary_m = sum(sedentary_m),
    weekly_light_sed_m = weekly_light_m + weekly_sedentary_m,
    weekly_calories = sum(calories)
  )

Data cleaning, again

Before continuing our analysis with the merged dataframes, let’s do some more data cleaning.

Delete old dataframes

Let’s delete some old dataframes that we no longer need:

# List all objects in current R workspace ----
ls()
##  [1] "daily_activity"     "daily_data"         "daily_sleep"       
##  [4] "hourly_calories"    "hourly_data"        "hourly_data_0"     
##  [7] "hourly_intensities" "hourly_steps"       "limits_weekdays"   
## [10] "minute_sleep"       "sleep_data"         "sleep_summary"     
## [13] "sleep_summary_0"    "weekly_data"        "weekly_data_0"     
## [16] "weight_log"
# Remove objects no longer needed ----
rm("daily_sleep")
rm("hourly_calories", "hourly_intensities", "hourly_steps", "minute_sleep")
rm("hourly_data_0", "sleep_summary_0", "weekly_data_0")
# List all objects again in current R workspace  ----
ls()
## [1] "daily_activity"  "daily_data"      "hourly_data"     "limits_weekdays"
## [5] "sleep_data"      "sleep_summary"   "weekly_data"     "weight_log"

Create ordinal variables for faceting

In order to create plots that are faceted by day of week or time of day in correct order, we’ll need to add a new variable to each of the merged dataframes:

# Create day_list(s) for ordered facet grid/wrapping  ----
hourly_data$day_list <-
  factor(hourly_data$day_of_week, levels = c(
    "Monday", "Tuesday", "Wednesday", "Thursday",
    "Friday", "Saturday", "Sunday"
  ))
daily_data$day_list <-
  factor(daily_data$day_of_week, levels = c(
    "Monday", "Tuesday", "Wednesday", "Thursday",
    "Friday", "Saturday", "Sunday"
  ))
# Create time_list for ordered facet grid/wrapping ----
hourly_data$time_list <-
  factor(hourly_data$time,
    levels = c(
      "12:00:00 AM", "01:00:00 AM", "02:00:00 AM", "03:00:00 AM",
      "04:00:00 AM", "05:00:00 AM", "06:00:00 AM", "07:00:00 AM",
      "08:00:00 AM", "09:00:00 AM", "10:00:00 AM", "11:00:00 AM",
      "12:00:00 PM", "01:00:00 PM", "02:00:00 PM", "03:00:00 PM",
      "04:00:00 PM", "05:00:00 PM", "06:00:00 PM", "07:00:00 PM",
      "08:00:00 PM", "09:00:00 PM", "10:00:00 PM", "11:00:00 PM"
    )
  )

Exploring data record limitations

Let’s look at the aggregate of all activity minutes for each day:

daily_data[1:20, c("id", "daily_all_m")]
##            id daily_all_m
## 1  1503960366        1094
## 2  1503960366        1033
## 3  1503960366        1440
## 4  1503960366         998
## 5  1503960366        1040
## 6  1503960366         761
## 7  1503960366        1440
## 8  1503960366        1120
## 9  1503960366        1063
## 10 1503960366        1076
## 11 1503960366        1440
## 12 1503960366        1056
## 13 1503960366         991
## 14 1503960366        1117
## 15 1503960366        1166
## 16 1503960366        1440
## 17 1503960366        1047
## 18 1503960366        1086
## 19 1503960366        1015
## 20 1503960366        1044

We can see many of the days minutes do not add up to the 1440 minutes that make up an entire day— perhaps users take off their device during sleep, intensive exercise, or other activities where it is safer or more comfortable to take the watch off than keep it on. We should, as a precaution, take all analysis done with activity minutes with a grain of salt, and use this insight to inform business recommendations on encourage device usage consistency, or mitigating the personal analytical integrity loss for users that do not wear their device 24/7.

Let’s look at hourly observations to get an idea of when the device fails to get observations:

hour_count <- as.data.frame(table(hourly_data$hour_of_day))
colnames(hour_count) <- c("hour_of_the_day", "observation_count")
hour_count$hour_of_the_day <- gsub("^.{0,11}", "", hour_count$hour_of_the_day)
hour_count[hour_count == ""] <- "00:00:00"
hour_count
##    hour_of_the_day observation_count
## 1         00:00:00               934
## 2         01:00:00               933
## 3         02:00:00               933
## 4         03:00:00               933
## 5         04:00:00               932
## 6         05:00:00               932
## 7         06:00:00               931
## 8         07:00:00               931
## 9         08:00:00               931
## 10        09:00:00               931
## 11        10:00:00               929
## 12        11:00:00               927
## 13        12:00:00               922
## 14        13:00:00               921
## 15        14:00:00               921
## 16        15:00:00               915
## 17        16:00:00               907
## 18        17:00:00               906
## 19        18:00:00               906
## 20        19:00:00               906
## 21        20:00:00               906
## 22        21:00:00               905
## 23        22:00:00               904
## 24        23:00:00               903

Now this is very interesting. It seems that the highest observation count is during midnight, and the number falls throughout the day, until it is reset the next day. This presents an entirely different story from what we conjectured before from the activity minutes— it seems that for the data collection, participants did keep on their fitness devices for the entire time, and holes in data recording were a result of device failures to record rather than participant manipulation, which reset with the start of the next day.

This may result in a loss of data towards the end of the day; time spent relaxing after work, going to the gym, participating in a sport… something definitely to keep in mind as we move forward.

Mean or median?

Before we proceed with the new summary statistics, we must address whether to use the mean or the median as the representation of a whole.

The better choice depends on how strongly outliers influence the mean. Let’s visualize with histograms of the new dataframes:

daily_data

# Daily histograms ----
daily_hplot <- ggplot(data = daily_data)
# Histogram for daily_mod_very_m ----
bw <- nclass.FD(daily_data$daily_mod_very_m) # Compute optimal bin width for each plot
daily_hplot +
  geom_histogram(aes(x = daily_mod_very_m), bins = bw) +
  labs(
    x = "Minutes",
    y = "Count",
    title = "Daily moderately/very active minutes",
    caption = "Mean: 34.87  |  Median: 21.00"
  ) 

# Histogram for daily_light_sed_m ----
bw <- nclass.FD(daily_data$daily_light_sed_m)
daily_hplot +
  geom_histogram(aes(x = daily_light_sed_m), bins = bw) +
  labs(
    x = "Minutes",
    y = "Count",
    title = "Daily lightly active/sedentary minutes",
    caption = "Mean: 1183  |  Median: 1333"
  )

# Histogram for total_steps ----
bw <- nclass.FD(daily_data$total_steps)
daily_hplot +
  geom_histogram(aes(x = total_steps), bins = bw) +
  labs(
    x = "Total Steps",
    y = "Count",
    title = "Daily total steps",
    caption = "Mean: 7652  |  Median: 7439"
  )

# Histogram for calories ----
bw <- nclass.FD(daily_data$calories)
daily_hplot +
  geom_histogram(aes(x = calories), bins = bw) +
  labs(
    x = "Calories",
    y = "Count",
    title = "Daily total calories",
    caption = "Mean: 2308  |  Median: 2140"
  )

# Histogram for total_minutes_asleep ----
daily_hplot +
  geom_histogram(aes(x = total_minutes_asleep)) +
  labs(
    x = "Total Minutes Asleep",
    y = "Count",
    title = "Daily minutes asleep",
    caption = "Mean: 419.5  |  Median: 433.0"
  )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 530 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Histogram for total_time_in_bed ----
daily_hplot +
  geom_histogram(aes(x = total_time_in_bed)) +
  labs(
    x = "Total Time in Bed",
    y = "Count",
    title = "Daily total time in bed",
    caption = "Mean: 458.6  |  Median: 463.0"
  )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 530 rows containing non-finite outside the scale range
## (`stat_bin()`).

rm("daily_hplot")

Since a few outliers are present in a few variables and likely affect the mean, we’ll use median for these summary statistics: daily moderately/very active minutes, daily lightly active/sedentary minutes, and daily total steps. We can still use mean for the sleep variables and calories.

hourly_data

# Hourly histograms ----
hourly_hplot <- ggplot(data = hourly_data)
# Histogram for calories ----
(bw <- nclass.FD(hourly_data$calories))
## [1] 283
hourly_hplot +
  geom_histogram(aes(x = calories), bins = bw) +
  labs(
    title = "Hourly calories",
    caption = "Mean: 97.39  |  Median: 83.00",
    x = "Calories",
    y = "Count"
  )

# Histogram for total_intensity ----
(bw <- nclass.FD(hourly_data$total_intensity))
## [1] 158
hourly_hplot +
  geom_histogram(aes(x = total_intensity), bins = bw) +
  labs(
    title = "Hourly total intensity",
    caption = "Mean: 12.04  |  Median: 3.00",
    x = "Total Intensity",
    y = "Count"
  )

# Histogram for average_intensity ----
(bw <- nclass.FD(hourly_data$average_intensity))
## [1] 158
hourly_hplot +
  geom_histogram(aes(x = average_intensity), bins = bw) +
  labs(
    title = "Hourly average intensity",
    caption = "Mean: 0.2006  |  Median: 0.0500",
    x = "Average Intensity",
    y = "Count"
  )

# Histogram for step_total ----
(bw <- nclass.FD(hourly_data$step_total))
## [1] 415
hourly_hplot +
  geom_histogram(aes(x = step_total), bins = bw) +
  labs(
    x = "Step Total",
    y = "Count",
    title = "Hourly step total",
    caption = "Mean: 320.2  |  Median: 40.0"
  ) 

rm("hourly_hplot")

Using the median looks to be appropriate for all the hourly summary statistics: hourly calories, hourly total intensity, hourly average intensity and hourly step total.

weekly_data

# Weekly histograms ----
weekly_hplot <- ggplot(data = weekly_data)
# Histogram plot for weekly_very_m ----
(bw <- nclass.FD(weekly_data$weekly_very_m))
## [1] 12
weekly_hplot +
  geom_histogram(aes(x = weekly_very_m), bins = bw) +
  labs(
    x = "Minutes",
    y = "Count",
    title = "Weekly very active minutes",
    caption = "Mean: 130.10  |  Median: 68.50"
  )

# Histogram for weekly_moderate_m ----
(bw <- nclass.FD(weekly_data$weekly_moderate_m))
## [1] 15
weekly_hplot +
  geom_histogram(aes(x = weekly_moderate_m), bins = bw) +
  labs(
    x = "Minutes",
    y = "Count",
    title = "Weekly moderately active minutes",
    caption = "Mean: 83.45  |  Median: 65.50"
  )

# Histogram for weekly_light_m ----
(bw <- nclass.FD(weekly_data$weekly_light_m))
## [1] 10
weekly_hplot +
  geom_histogram(aes(x = weekly_light_m), bins = bw) +
  labs(
    x = "Minutes",
    y = "Count",
    title = "Weekly lightly active minutes",
    caption = "Mean: 1182  |  Median: 1188"
  )

# Histogram for weekly_sedentary_m ----
(bw <- nclass.FD(weekly_data$weekly_sedentary_m))
## [1] 9
weekly_hplot +
  geom_histogram(aes(x = weekly_sedentary_m), bins = bw) +
  labs(
    x = "Minutes",
    y = "Count",
    title = "Weekly sedentary minutes",
    caption = "Mean: 6064  |  Median: 5754"
  )

# Histogram for weekly_mod_very_m ----
(bw <- nclass.FD(weekly_data$weekly_mod_very_m))
## [1] 11
weekly_hplot +
  geom_histogram(aes(x = weekly_mod_very_m), bins = bw) +
  labs(
    x = "Minutes",
    y = "Count",
    title = "Weekly moderately/very active minutes",
    caption = "Mean: 213.51  |  Median: 182.00"
  )

# Histogram for weekly_light_sed_m ----
(bw <- nclass.FD(weekly_data$weekly_light_sed_m))
## [1] 7
weekly_hplot +
  geom_histogram(aes(x = weekly_light_sed_m), bins = bw) +
  labs(
    x = "Minutes",
    y = "Count",
    title = "Weekly lightly active/sedentary minutes",
    caption = "Mean: 7246   |  Median: 6956"
  )

# Histogram for weekly_total_steps ----
(bw <- nclass.FD(weekly_data$weekly_total_steps))
## [1] 10
weekly_hplot +
  geom_histogram(aes(x = weekly_total_steps), bins = bw) +
  labs(
    x = "Total Steps",
    y = "Count",
    title = "Weekly total steps",
    caption = "Mean: 46857  |  Median: 44903"
  )

# Histogram for weekly_calories ----
(bw <- nclass.FD(weekly_data$weekly_calories))
## [1] 10
weekly_hplot +
  geom_histogram(aes(x = weekly_calories), bins = bw) +
  labs(
    x = "Calories",
    y = "Count",
    title = "Weekly calories",
    caption = "Mean: 14130  |  Median: 13615"
  )

# Remove bin width object ----

rm("weekly_hplot")
rm(bw)

We can use mean for these five variables: weekly sedentary minutes, weekly lightly active minutes, weekly lightly active/sedentary minutes, weekly total steps and weekly calories. We should use median for the remaining three of the weekly summary statistics: weekly very active minutes, weekly moderate minutes, and weekly moderately/very active minutes.

Summary stats, again

New summary stats

Let’s look at some quick summary statistics we’d want to know about each new dataframe:

daily_data

# Daily steps, distance and calories ----
daily_data %>%
  select(
    total_steps,
    total_distance,
    calories
  ) %>%
  summary()
##   total_steps    total_distance      calories   
##  Min.   :    0   Min.   : 0.000   Min.   :   0  
##  1st Qu.: 3795   1st Qu.: 2.620   1st Qu.:1830  
##  Median : 7439   Median : 5.260   Median :2140  
##  Mean   : 7652   Mean   : 5.503   Mean   :2308  
##  3rd Qu.:10734   3rd Qu.: 7.720   3rd Qu.:2796  
##  Max.   :36019   Max.   :28.030   Max.   :4900
# Active minute levels per category ----
daily_data %>%
  select(
    very_active_m,
    moderately_active_m,
    lightly_active_m,
    sedentary_m,
    daily_mod_very_m,
    daily_light_sed_m
  ) %>%
  summary()
##  very_active_m    moderately_active_m lightly_active_m  sedentary_m    
##  Min.   :  0.00   Min.   :  0.00      Min.   :  0      Min.   :   0.0  
##  1st Qu.:  0.00   1st Qu.:  0.00      1st Qu.:127      1st Qu.: 729.0  
##  Median :  4.00   Median :  7.00      Median :199      Median :1057.0  
##  Mean   : 21.24   Mean   : 13.63      Mean   :193      Mean   : 990.4  
##  3rd Qu.: 32.00   3rd Qu.: 19.00      3rd Qu.:264      3rd Qu.:1229.0  
##  Max.   :210.00   Max.   :143.00      Max.   :518      Max.   :1440.0  
##  daily_mod_very_m daily_light_sed_m
##  Min.   :  0.00   Min.   :   2     
##  1st Qu.:  0.00   1st Qu.: 947     
##  Median : 21.00   Median :1333     
##  Mean   : 34.87   Mean   :1183     
##  3rd Qu.: 57.00   3rd Qu.:1432     
##  Max.   :275.00   Max.   :1440
# Daily overall sleep ----
daily_data %>%
  select(
    total_sleep_records,
    total_minutes_asleep,
    total_time_in_bed
  ) %>%
  summary()
##  total_sleep_records total_minutes_asleep total_time_in_bed
##  Min.   :1.000       Min.   : 58.0        Min.   : 61.0    
##  1st Qu.:1.000       1st Qu.:361.0        1st Qu.:403.0    
##  Median :1.000       Median :433.0        Median :463.0    
##  Mean   :1.119       Mean   :419.5        Mean   :458.6    
##  3rd Qu.:1.000       3rd Qu.:490.0        3rd Qu.:526.0    
##  Max.   :3.000       Max.   :796.0        Max.   :961.0    
##  NA's   :530         NA's   :530          NA's   :530
# Daily sleep by category ----
daily_data %>%
  select(
    total_asleep_merged,
    total_restless_merged,
    total_awake_merged
  ) %>%
  summary()
##  total_asleep_merged total_restless_merged total_awake_merged
##  Min.   : 35.0       Min.   :  0.00        Min.   :  0.000   
##  1st Qu.:353.0       1st Qu.: 13.00        1st Qu.:  1.000   
##  Median :420.0       Median : 21.00        Median :  3.000   
##  Mean   :413.2       Mean   : 32.76        Mean   :  4.208   
##  3rd Qu.:485.0       3rd Qu.: 32.00        3rd Qu.:  5.000   
##  Max.   :778.0       Max.   :227.00        Max.   :110.000   
##  NA's   :530         NA's   :530           NA's   :530

hourly_data

# Hourly calories, intensities, step totals ----
hourly_data %>%
  select(
    calories,
    total_intensity,
    average_intensity,
    step_total
  ) %>%
  summary()
##     calories      total_intensity  average_intensity   step_total     
##  Min.   : 42.00   Min.   :  0.00   Min.   :0.0000    Min.   :    0.0  
##  1st Qu.: 63.00   1st Qu.:  0.00   1st Qu.:0.0000    1st Qu.:    0.0  
##  Median : 83.00   Median :  3.00   Median :0.0500    Median :   40.0  
##  Mean   : 97.39   Mean   : 12.04   Mean   :0.2006    Mean   :  320.2  
##  3rd Qu.:108.00   3rd Qu.: 16.00   3rd Qu.:0.2667    3rd Qu.:  357.0  
##  Max.   :948.00   Max.   :180.00   Max.   :3.0000    Max.   :10554.0

weekly_data

# Weekly data for original active minute totals ----
# very, moderate, light and sedentary
weekly_data %>%
  ungroup() %>%
  select(
    weekly_very_m,
    weekly_moderate_m,
    weekly_light_m,
    weekly_sedentary_m
  ) %>%
  summary()
##  weekly_very_m   weekly_moderate_m weekly_light_m weekly_sedentary_m
##  Min.   :  0.0   Min.   :  0.00    Min.   :   0   Min.   :  642     
##  1st Qu.:  8.0   1st Qu.: 17.00    1st Qu.: 802   1st Qu.: 4778     
##  Median : 68.5   Median : 65.50    Median :1188   Median : 5754     
##  Mean   :130.1   Mean   : 83.45    Mean   :1182   Mean   : 6064     
##  3rd Qu.:182.5   3rd Qu.:122.00    3rd Qu.:1554   3rd Qu.: 7749     
##  Max.   :775.0   Max.   :549.00    Max.   :2590   Max.   :10080
# Weekly data for new combo active minute totals ----
# moderate/very and light/sedentary
weekly_data %>%
  ungroup() %>%
  select(
    weekly_mod_very_m,
    weekly_light_sed_m
  ) %>%
  summary()
##  weekly_mod_very_m weekly_light_sed_m
##  Min.   :   0.00   Min.   : 1014     
##  1st Qu.:  36.25   1st Qu.: 6029     
##  Median : 182.00   Median : 6956     
##  Mean   : 213.51   Mean   : 7246     
##  3rd Qu.: 303.00   3rd Qu.: 9543     
##  Max.   :1058.00   Max.   :10080
# Weekly data for total steps and calories ----
weekly_data %>%
  ungroup() %>%
  select(
    weekly_total_steps,
    weekly_calories
  ) %>%
  summary()
##  weekly_total_steps weekly_calories
##  Min.   :     0     Min.   : 1237  
##  1st Qu.: 30671     1st Qu.:10607  
##  Median : 44903     Median :13615  
##  Mean   : 46857     Mean   :14130  
##  3rd Qu.: 63219     3rd Qu.:17529  
##  Max.   :116758     Max.   :26101

New things to note

Wow! A few new fun things we’ve learned from the weekly data:

  • 46,857 steps - average total steps per week, which falls below the recommended weekly goal of 70,000 (10,000 goal per day x 7 days).
  • 6,064 min. - weekly average of sedentary minutes, which is the highest average of the tracked activity minutes. This is just over 4.2 days per week!
  • 1,182 min. - weekly average of light-intensity (Lightly Active) minutes, the highest among the non-sedentary activity minutes per week.
  • 182.00 min. - weekly median of combined moderate- and vigorous-intensity activity minutes.
  • 65.50 min. - weekly median of moderate-intensity (Moderately Active) activity minutes.
  • 68.5 min. - weekly median of vigorous-intensity (Very Active) activity minutes. It’s surprising that this number is higher than the moderate-intensity average!
  • 91.5% - average percentage of time asleep while in bed. This is calculated by dividing “Total Minutes Asleep” (419.5) by “Total Time In Bed” (458.6).
  • 83.00 - median calories burned per hour.

New exploration

Hour of the day

Let’s look at some possible relationships between the hour of the day and different variables. We’ll create labels_hour_of_day to use for these plots:

# Create labels for hours of the day ----
labels_hour_of_day <- c(
  "00:00",  "01:00",  "02:00",  "03:00",  "04:00",  "05:00",  "06:00",  "07:00",  "08:00",  "09:00",  "10:00",  "11:00",  "12:00",  "13:00",  "14:00",  "15:00",  "16:00",  "17:00",  "18:00",  "19:00",  "20:00",  "21:00",  "22:00",  "23:00",  "24:00"
)

Let’s examine the hours of the day with the highest observations per data point and identify periods of peak activity:

# step_total observations per hour ----
ggplot(data = hourly_data, aes(x = as.factor(hour_of_day))) +
  geom_bar(aes(weight = step_total), stat = "count") +
  labs(title = "'step_total' observations throughout a 24 hour day") +
  scale_x_discrete(
    "Hour of the day", 
    labels = labels_hour_of_day,
    guide = guide_axis(angle = 45)
    ) +
  scale_y_continuous(name = "Count", labels = scales::comma) +
  theme_minimal()

# calories observations per hour ----
ggplot(data = hourly_data, aes(x = as.factor(hour_of_day))) +
  geom_bar(aes(weight = calories), stat = "count") +
  labs(title = "'calories' observations throughout a 24 hour day") +
  scale_x_discrete(
    "Hour of the day",
    labels = labels_hour_of_day,
    guide = guide_axis(angle = 45)
    ) +
  scale_y_continuous(name = "Count", labels = scales::comma) +
  theme_minimal()

# average_intensity observations per hour ----
ggplot(data = hourly_data, aes(x = as.factor(hour_of_day))) +
  geom_bar(aes(weight = average_intensity), stat = "count") +
  labs(title = "'average_intensity' observations throughout a 24 hour day") +
  scale_x_discrete(
    "Hour of the day", 
    labels = labels_hour_of_day,
    guide = guide_axis(angle = 45)
    ) +
  scale_y_continuous(name = "Count", labels = scales::comma) +
  theme_minimal()

The distributions of all three variables — step_total, calories, and average_intensity — exhibit similar patterns. The lowest counts occur consistently between 12:00 AM and 4:00 AM, reflecting typical sleeping hours. Additionally, the two time intervals with the highest observation counts are consistent across all three graphs:

  • 12:00 - 2:00 PM (12:00 - 14:00)
  • 5:00 - 7:00 PM (17:00 - 19:00)

Average intensity vs. time

How does average intensity vary over time? Are activities typically more intense in the morning or the evening? Let’s compare this with the observations observed in the plots above:

# Average intensity throughout the day ----
average_intensity_p <- 
  ggplot(
    data = hourly_data, 
    aes(x = time_list, y = average_intensity, color = average_intensity)
         )
average_intensity_p +
  geom_jitter(width = 0.3, size = 0.2) +
  labs(
    x = "Hour", 
    y = "Average intensity",
    color = "Average Intensity",
    title = "Average intensity throughout the day"
  ) + 
  stat_summary(
    aes(x = time_list, y = average_intensity),
    fun = mean, 
    geom = "point", 
    color = "white", 
    size = 1,
    alpha = 0.7
    ) +
  scale_color_viridis_c() +
  dark_mode(theme_igray()) +
  theme(
    axis.title.x = element_text(
      margin = margin(5, 0, 5, 0),
      size = 12
    ),
    axis.title.y = element_text(
      margin = margin(0, 5, 0, 5),
      size = 12
    ),
    plot.title = element_text(
      margin = margin(10, 0, 10, 0),
      size = 14
    ),
    axis.text.x = element_text(
      angle = 45,
      vjust = 1,
      hjust = 1,
      size = 8,
      color = "lightgrey"
    )
  )
## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().

As anticipated, average intensity starts to increase around 5:00 AM, remaining relatively high during the workday and gradually declining around 8:00 PM. This trend is illustrated by the mean value for each hour, denoted by the white dots in each column. The peak hours appear to align with the previously observed peak times for observation counts.

  • 12:00 - 2:00 PM (12:00 - 14:00)
  • 5:00 - 7:00 PM (17:00 - 19:00)

Let’s look at day-to-day:

# Average intensity throughout the week ----
average_intensity_p <- 
  ggplot(
    data = hourly_data,
    aes(
      x = reorder(day_list, desc(day_list)),
      y = average_intensity, 
      color = average_intensity
      )
    )
average_intensity_p +
  geom_jitter(width = 0.3, size = 0.1) +
  labs(
    x = "Day", 
    y = "Average intensity",
    color = "Average Intensity",
    title = "Average intensity throughout the week"
  ) + 
  stat_summary(
    aes(
      x = reorder(day_list, desc(day_list)),
      y = average_intensity
      ),
    fun = mean, 
    geom = "point", 
    color = "white", 
    size = 1,
    alpha = 0.7
    ) +
  scale_color_viridis_c() +
  dark_mode(theme_igray()) +
  theme(
    axis.title.x = element_text(
      margin = margin(5, 0, 5, 0),
      size = 12
    ),
    axis.title.y = element_text(
      margin = margin(0, 5, 0, 5),
      size = 12
    ),
    plot.title = element_text(
      margin = margin(10, 0, 10, 0),
      size = 14
    ),
    axis.text.x = element_text(
      size = 12,
      color = "lightgrey"
    ),
    axis.text.y = element_text(
      color = "lightgrey"
    )
  ) +
  coord_flip()

Unlike the consistent patterns observed hour by hour, there isn’t a clear daily trend. This is once again noticeable when examining the mean for each day, depicted by the white dots in each column. The variations in the position of each dot are minimal.

Let’s go back to intensity and hour of day, and look at if the trends are similar between the weekend vs. week:

# average_intensity throughout a 24 hour day ----
hd_A <- dplyr::select(hourly_data, -part_of_week)
ggplot(
  data = hourly_data,
  aes(x = as.factor(hour_of_day), y = average_intensity)
  ) +
  geom_jitter(data = hd_A, color = "grey75", width = 0.4, alpha = 0.6, size = 2) +
  geom_jitter(aes(color = part_of_week), width = 0.4, alpha = 0.6, size = 2) +
  scale_color_viridis_d(begin = 0, end = .5) +
  labs(
    title = "Average intensity throughout a 24 hour day") +
  ylab("Average intensity") +
  scale_x_discrete(
    "Hour of the day",
    expand = c(0.0002, 0.0002),
    labels = labels_hour_of_day,
    guide = guide_axis(angle = 45)
    ) +
  stat_summary(
    aes(x = as.factor(hour_of_day), y = average_intensity),
    fun = mean, 
    geom = "point", 
    color = "gold1", 
    size = 4
    ) +
  theme_minimal() + 
  theme(
    legend.position = "none",
    axis.title.x = element_text(
      margin = margin(15, 0, 10, 0),
      size = 16
    ),
    axis.title.y = element_text(
      margin = margin(0, 10, 0, 10),
      size = 16
    ),
    plot.title = element_text(
      margin = margin(10, 0, 10, 0),
      size = 20
    ),
    strip.text.x = element_text(
      margin = margin(10, 0, 10, 0),
      size = 14
    ), 
    axis.text.x = element_text(
      size = 12
    ),
    axis.text.y = element_text(
      size = 12
    )
  ) +
  facet_wrap(vars(part_of_week), nrow = 2)

# total_intensity throughout a 24 hour day ----
jitter_pos <- 
  position_jitter(width = 0.25, height = 0, seed = 42)
set.seed(42)
hd_A <- 
  dplyr::select(hourly_data, -part_of_week)
set.seed(42)
hd_total_intensity_p <- 
  ggplot(
    data = hourly_data,
    aes(x = as.factor(hour_of_day), y = total_intensity, color = part_of_week)
    )
# total_intensity throughout a 24 hour day ----
hd_total_intensity_p +
  geom_point(
    data = transform(hourly_data, part_of_week = NULL),
    color = "grey75",
    position = position_jitter(width = 0.4, height = 0, seed = 42),
    size = 2
  ) +
  geom_point(
    position = position_jitter(width = 0.4, height = 0, seed = 42),
    size = 2,
    alpha = 0.6
  ) +
  labs(
    title = "Total intensity throughout a 24 hour day") +
  ylab("Total intensity") +
  scale_color_viridis_d(begin = 0, end = .5, guide = "legend") +
  scale_x_discrete(
    "Hour of the day",
    expand = c(0.0002, 0.0002),
    labels = labels_hour_of_day,
    guide = guide_axis(angle = 45)
    ) +
  stat_summary(
    aes(x = as.factor(hour_of_day), y = total_intensity),
    fun = mean,
    geom = "point", 
    color = "gold1",
    size = 4
    ) +
  theme_minimal() + 
  theme(
    legend.position = "none",
    axis.title.x = element_text(
      margin = margin(15, 0, 10, 0),
      size = 16
    ),
    axis.title.y = element_text(
      margin = margin(0, 10, 0, 10),
      size = 16
    ),
    plot.title = element_text(
      margin = margin(10, 0, 10, 0),
      size = 20
    ),
    strip.text.x = element_text(
      margin = margin(10, 0, 10, 0),
      size = 14
    ),
    axis.text.x = element_text(
      size = 12
    ),
    axis.text.y = element_text(
      size = 12
    )
  ) +
  facet_wrap(vars(part_of_week), nrow = 2)

rm("jitter_pos")

Active vs. sedentary

We’re now comparing active minutes and sedentary minutes throughout the week:

# Create theme for plot formatting ----
theme_margin <- 
  theme(
    axis.title.x = element_text(margin = margin(15, 0, 15, 0)),
    axis.title.y = element_text(margin = margin(0, 10, 0, 10)),
    plot.title = element_text(margin = margin(10, 0, 10, 0)),
    strip.text.x = element_text(margin = margin(10, 0, 10, 0)),
    )
# Very active minutes vs sedentary minutes ----
dd_A <- dplyr::select(daily_data, -part_of_week)
ggplot(daily_data, aes(x = very_active_m, y = sedentary_m)) +
  geom_jitter(data = dd_A, color = "grey75") +
  geom_jitter(
    aes(x = very_active_m, y = sedentary_m),
    color = "#002040", 
    width = 0.4, 
    size = 1
    ) +
  labs(title = "Very active minutes vs. sedentary minutes throughout the week") +
  ylab("Sedentary minutes") +
  xlab("Very active minutes") +
  geom_smooth(
    aes(x = very_active_m, y = sedentary_m),
    color = "#6EE5FF", 
    linetype = "longdash", 
    alpha = 0.4,
    se = FALSE
    ) +
  theme_minimal() + 
  theme_margin +
  facet_grid(~part_of_week ~ .)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Moderately active minutes vs sedentary minutes ----
dd_A <- dplyr::select(daily_data, -part_of_week)
ggplot(daily_data, aes(x = moderately_active_m, y = sedentary_m)) +
  geom_jitter(data = dd_A, color = "grey75") +
  geom_jitter(
    aes(x = moderately_active_m, y = sedentary_m),
    color = "#002040", 
    width = 0.4, 
    size = 1
    ) +
  labs(
    title = "Moderately active minutes vs. sedentary minutes throughout the week"
    ) +
  ylab("Sedentary minutes") +
  xlab("Moderately active minutes") +
  geom_smooth(
    aes(x = moderately_active_m, y = sedentary_m),
    color = "#6EE5FF", 
    linetype = "longdash", 
    alpha = 0.6,
    se = FALSE
    ) +
  theme_minimal() + 
  theme_margin +
  facet_grid(~part_of_week ~ .)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Lightly active minutes vs sedentary minutes ----
dd_A <- dplyr::select(daily_data, -part_of_week)
ggplot(daily_data, aes(x = lightly_active_m, y = sedentary_m)) +
  geom_jitter(data = dd_A, color = "grey75") +
  geom_jitter(
    aes(x = lightly_active_m, y = sedentary_m),
    color = "#002040", 
    width = 0.4, 
    size = 1
    ) +
  labs(
    title = "Lightly active minutes vs. sedentary minutes throughout the week"
    ) +
  ylab("Sedentary minutes") +
  xlab("Lightly active minutes") +
  geom_smooth(
    aes(x = lightly_active_m, y = sedentary_m),
    color = "#6EE5FF", 
    linetype = "longdash", 
    alpha = 0.6,
    se = FALSE
    ) +
  theme_minimal() + 
  theme_margin +
  facet_grid(~part_of_week ~ .)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Active minutes

A key health metric

Of all the metrics we’ve explored thus far, active minutes will be prioritized in the next steps. We’ll use this section to generate our supporting visualizations and key insights, and end by providing content recommendations based on our analysis to our key stakeholders.

SHARE

Stakeholder presentation

This is my presentation:

It it, you can find:

  • A summary of my analysis

  • Supporting visualizations and key findings

  • Marketing strategy recommendations for the Bellabeat App

Thank you for making it to the end.

Session info

sessionInfo()
## R version 4.3.2 (2023-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: Asia/Bangkok
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] plotrix_3.8-4   sqldf_0.4-11    RSQLite_2.3.5   gsubfn_0.7     
##  [5] proto_1.0.0     crayon_1.5.2    paletteer_1.6.0 patchwork_1.2.0
##  [9] gridExtra_2.3   ggforce_0.4.2   ggdark_0.2.1    ggthemes_5.1.0 
## [13] janitor_2.2.0   hms_1.1.3       skimr_2.1.5     knitr_1.45     
## [17] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
## [21] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
## [25] ggplot2_3.5.0   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.4      xfun_0.42         bslib_0.6.1       lattice_0.21-9   
##  [5] tzdb_0.4.0        vctrs_0.6.5       tools_4.3.2       generics_0.1.3   
##  [9] parallel_4.3.2    fansi_1.0.6       highr_0.10        chron_2.3-61     
## [13] blob_1.2.4        pkgconfig_2.0.3   Matrix_1.6-1.1    lifecycle_1.0.4  
## [17] compiler_4.3.2    farver_2.1.1      munsell_0.5.0     repr_1.1.6       
## [21] snakecase_0.11.1  htmltools_0.5.7   sass_0.4.8        yaml_2.3.8       
## [25] pillar_1.9.0      jquerylib_0.1.4   MASS_7.3-60       cachem_1.0.8     
## [29] nlme_3.1-163      tidyselect_1.2.0  digest_0.6.34     stringi_1.8.3    
## [33] rematch2_2.1.2    labeling_0.4.3    splines_4.3.2     polyclip_1.10-6  
## [37] fastmap_1.1.1     grid_4.3.2        colorspace_2.1-0  cli_3.6.2        
## [41] magrittr_2.0.3    base64enc_0.1-3   utf8_1.2.4        withr_3.0.0      
## [45] scales_1.3.0      bit64_4.0.5       timechange_0.3.0  rmarkdown_2.25   
## [49] bit_4.0.5         memoise_2.0.1     evaluate_0.23     tcltk_4.3.2      
## [53] viridisLite_0.4.2 mgcv_1.9-0        rlang_1.1.3       Rcpp_1.0.12      
## [57] DBI_1.2.2         glue_1.7.0        tweenr_2.0.2      vroom_1.6.5      
## [61] rstudioapi_0.15.0 jsonlite_1.8.8    R6_2.5.1