Introduction

This paper looks at the Weight Lifting Exercise Dataset and aims to build a model that can accurately predict the manner in which exercises were performed (as represented by ‘classe’). The dataset contains measurements of movements taken from participants as they perform an exercise by a range of sensors attached at different points of the body - namely the bicep, ankle, thigh and belt. Using these measurements, a range of models were constructed to predict the classe of the exercise with the best performing model being a Stochastic Gradient Boosting Machine. This model had an estimated out-of-bag error of just 0.35%.

1. Understanding the Data

The first step undertaken was understanding the data. From the documentation, the following important information about the dataset was noted:

2. Loading in the Data

The next step in the process was loading in the training data. This was done by loading it directly from the specified website location. Immediately after loading the data, it was split into training and validation datasets.

require(caret, quietly = TRUE)
rawTraining <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv")
set.seed(1000)
inTrain <- createDataPartition(rawTraining$classe, p = 0.75)[[1]]
training <- rawTraining[ inTrain,]
validation <- rawTraining[-inTrain,]

3. Exploring the Data

In step three, exploration of the dataset was undertaken. There were four goals of this exploration:

  1. Identify any load related errors
  2. Search for fields/columns with missing values, NAs, blanks and so on
  3. Understand the structure of the dataset - are the fields numeric/categorical/mixed
  4. Observe any identifiable patterns or correlations within the dataset

The following were the findings of the initial data assessment:

To assist with further understanding the dataset, the fields containing mostly NAs and blanks were removed, as well as X (the ID field), user_name, raw_timestamp_part_1, raw_timestamp_part_2, num_window and new_window. Finally, the remaining timestamp field cvtd_timestamp was converted into an hour of the day field for use in model building.

# Remove fields with mostly NAs and Blanks
training.no <- training[training$new_window == "no",]
NonNACols <- training.no[,!as.vector(apply(is.na(training.no), 2, sum) == nrow(training.no))]
ColsWithValues <- NonNACols[,as.vector(apply(NonNACols=="", 2, sum) < nrow(NonNACols))]
cleanTrain <- training[,colnames(ColsWithValues)]

#Remove Unneeded Fields
cleanTrain <- cleanTrain[,-c(1, 2, 3, 4, 6, 7)]

# Convert Timestamp into an Hour of Day Field
cleanTrain$cvtd_timestamp <- as.numeric(substr(cleanTrain$cvtd_timestamp, 12, 13))

With the cleaned training dataset (cleanTrain), further exploration was then conducted. The main focus of this exploration was to identify if any of the numerical fields were highly correlated. This was done as follows:

th <- 0.8 #Absolute threshold for "highly correlated"
correlations <- cor(cleanTrain[,-54])
results <- data.frame(var1 = as.character(), var2 = as.character())
for (i in 1:ncol(correlations)-1) {
  startRow <- i + 1
  absCorValues <- as.vector(abs(correlations[c(startRow:nrow(correlations)),i]))
  rawCorValues <- as.vector(correlations[c(startRow:nrow(correlations)),i])
  if (sum(absCorValues >= th) > 0) {
    var1 <- colnames(correlations)[i]
    var2 <- rownames(correlations)[c(rep(FALSE, i), as.vector(absCorValues >= th))]
    corr <- rawCorValues[absCorValues >= th]
    results <- rbind(results, data.frame(var1 = c(rep(var1, length(var2))), var2, corr))
  }
}
results
##                var1             var2       corr
## 1         roll_belt         yaw_belt  0.8145085
## 2         roll_belt total_accel_belt  0.9808390
## 3         roll_belt     accel_belt_y  0.9237447
## 4         roll_belt     accel_belt_z -0.9920088
## 5        pitch_belt     accel_belt_x -0.9659967
## 6        pitch_belt    magnet_belt_x -0.8828401
## 7  total_accel_belt     accel_belt_y  0.9272471
## 8  total_accel_belt     accel_belt_z -0.9748414
## 9      accel_belt_x    magnet_belt_x  0.8914844
## 10     accel_belt_y     accel_belt_z -0.9323955
## 11      gyros_arm_x      gyros_arm_y -0.9184159
## 12      accel_arm_x     magnet_arm_x  0.8134976
## 13     magnet_arm_y     magnet_arm_z  0.8125801
## 14   pitch_dumbbell accel_dumbbell_x  0.8063532
## 15     yaw_dumbbell accel_dumbbell_z  0.8488290
## 16 gyros_dumbbell_x gyros_dumbbell_z -0.9840649
## 17 gyros_dumbbell_x  gyros_forearm_z -0.9345919
## 18 gyros_dumbbell_z  gyros_forearm_z  0.9489422
## 19  gyros_forearm_y  gyros_forearm_z  0.8693196

As the results above show, there are some very strong correlations - 19 with values greater than 0.8 (or less than -0.8). The roll_belt field and the accel_belt fields in particular appear to have high levels of correlations with a range of other variables. Generally, high levels correlation between movements of the same body part in different directions was also observed. Intuitively this makes sense as movements of those body parts are likely to occur along multiple axes, as opposed to purely along one axis.

4. Data Pre-Processing

The fact there were a significant number of highly correlated fields in the dataset suggested that principal components analysis (pca) may be able to summarize the data effectively and reduce dimensionality. To assess this, the training dataset was pre-processed using the center, scale and pca methods. An original version of the training dataset was also maintained for comparison purposes.

pcaPreProc <- preProcess(cleanTrain[,-54], method = c('center', 'scale', 'pca'), thresh = 0.9)
pcaTrain <- predict(pcaPreProc, cleanTrain[,-54])
pcaTrain <- data.frame(pcaTrain, classe = cleanTrain[,54])
pcaPreProc
## 
## Call:
## preProcess.default(x = cleanTrain[, -54], method = c("center",
##  "scale", "pca"), thresh = 0.9)
## 
## Created from 14718 samples and 53 variables
## Pre-processing: centered, scaled, principal component signal extraction 
## 
## PCA needed 19 components to capture 90 percent of the variance

5. Model Selection

In order to identify the best model, three different models were tested. The three models were Random Forest (method = rf), Stochastic Gradient Boosting Machine (method = gbm) and Penalized Multinomial Regression Model (method = “multinom”). All models were tested with both the preprocessed data ‘pcaTraining’ and the unprocessed data ‘training’. In all cases, the accuracy of each model was validated using 10-fold cross validation.

CV10 <- trainControl(method = "cv"
                    , number = 10)

Due to the length of time needed to run the models, they have not all been included here. However, from the accuracy of the models, the Stochastic Gradient Boosting Machine (gbm) model using the unprocessed data produced the best results. It should also be noted that the Random Forest model (again using the unprocessed data) also achieved very high accuracy, but was just bettered by the gbm model.

Stochastic Gradient Boosting Machine Model

require(survival, quietly = TRUE)
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
require(gbm, quietly = TRUE)
## Loaded gbm 2.1.1
require(plyr, quietly = TRUE)
GBMParamGrid <- expand.grid(interaction.depth = c(2, 5, 10)
                            , n.trees = c(100, 200, 500)
                            , shrinkage = 0.1
                            , n.minobsinnode = 10)

GBMModelFit <- train(classe ~ .
                     , method = "gbm"
                     , trControl = CV10
                     , tuneGrid = GBMParamGrid
                     , data = cleanTrain
                     , verbose = FALSE #gbm prints a lot of output if TRUE
)
GBMModelFit
## Stochastic Gradient Boosting 
## 
## 14718 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 13246, 13245, 13248, 13246, 13247, 13247, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD
##    2                 100      0.9092270  0.8851330  0.005966073
##    2                 200      0.9478189  0.9339733  0.004198312
##    2                 500      0.9812470  0.9762780  0.003224092
##    5                 100      0.9712597  0.9636384  0.002719773
##    5                 200      0.9862074  0.9825538  0.002680970
##    5                 500      0.9933410  0.9915769  0.002638823
##   10                 100      0.9875662  0.9842726  0.003202808
##   10                 200      0.9932733  0.9914912  0.003162053
##   10                 500      0.9951075  0.9938113  0.002695908
##   Kappa SD   
##   0.007542928
##   0.005299449
##   0.004080115
##   0.003444003
##   0.003391578
##   0.003338232
##   0.004051507
##   0.003999785
##   0.003410208
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 500,
##  interaction.depth = 10, shrinkage = 0.1 and n.minobsinnode = 10.

As can be seen from the results above, the best gbm model had an accuracy of 99.51%. This corresponds to an error rate of just 0.49%.

6. Assess Variable Importance

To provide a quick validation of the model results, the variables used in the model were assessed for their importance to the accuracy of the model.

varImp(GBMModelFit)
## gbm variable importance
## 
##   only 20 most important variables shown (out of 53)
## 
##                   Overall
## roll_belt         100.000
## yaw_belt           61.547
## pitch_forearm      51.701
## magnet_dumbbell_z  45.327
## magnet_dumbbell_y  32.922
## pitch_belt         29.220
## roll_forearm       27.948
## magnet_belt_z      23.647
## gyros_belt_z       16.392
## accel_dumbbell_y   16.047
## roll_dumbbell      15.268
## accel_forearm_z    13.146
## yaw_arm            12.942
## accel_forearm_x    11.162
## magnet_forearm_z   10.922
## accel_dumbbell_x   10.219
## accel_dumbbell_z   10.134
## magnet_forearm_x   10.072
## magnet_dumbbell_x   8.117
## magnet_belt_x       7.401

As can be seen from the results above, the two variables roll_belt and yaw_belt were by some margin the most significant variables used in the model. Another interesting observation is that movement of the dumbbell along both the y (vertical) and z (forward/backward) axes - but not the x (side to side) axis - provided the model with important information about the technique used. Looking at the exercise being performed (bicep curls) and the variations participants were asked to perform, this makes sense intuitively as none of the variations would require lateral movement of the dumbbell.

7. Estimate Out-Of-Bag Error

After determining the best model and training data to use, the gbm model was then used on the validation dataset to estimate the out-of-bag error. The first step in this process was to repeat the cleaning process undertaken on the training dataset.

# Remove fields with mostly NAs and Blanks
validation.no <- validation[validation$new_window == "no",]
NonNACols.val <- validation.no[,!as.vector(apply(is.na(validation.no), 2, sum) == nrow(validation.no))]
ColsWithValues.val <- NonNACols.val[,as.vector(apply(NonNACols.val=="", 2, sum) < nrow(NonNACols.val))]
cleanValidation <- validation[,colnames(ColsWithValues.val)]

#Remove Unneeded Fields
cleanValidation <- cleanValidation[,-c(1, 2, 3, 4, 6, 7)]

# Convert Timestamp into an Hour of Day Field
cleanValidation$cvtd_timestamp <- as.numeric(substr(cleanValidation$cvtd_timestamp, 12, 13))

After being cleaned, the best gbm model was applied to the cleaned validation data.

GBMPredictions <- predict(GBMModelFit, newdata=cleanValidation)
confusionMatrix(GBMPredictions, cleanValidation$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1394    1    0    0    0
##          B    1  947    3    0    0
##          C    0    1  847    0    0
##          D    0    0    5  801    3
##          E    0    0    0    3  898
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9965         
##                  95% CI : (0.9945, 0.998)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9956         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9993   0.9979   0.9906   0.9963   0.9967
## Specificity            0.9997   0.9990   0.9998   0.9980   0.9993
## Pos Pred Value         0.9993   0.9958   0.9988   0.9901   0.9967
## Neg Pred Value         0.9997   0.9995   0.9980   0.9993   0.9993
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2843   0.1931   0.1727   0.1633   0.1831
## Detection Prevalence   0.2845   0.1939   0.1729   0.1650   0.1837
## Balanced Accuracy      0.9995   0.9984   0.9952   0.9972   0.9980

The resulting output (shown above) indicated an accuracy of 99.65% on the validation dataset. In fact, of the 4,904 rows in the validation dataset, only 17 were incorrectly predicted. This provided a high level of confidence that the model would accurately predict the classe for the 20 records in the test dataset.

8. Predicting on the Testing Dataset

The final step was to predict the classe for the 20 records in the test dataset. Again, the first step is to load in the test data and subject it to the same cleaning as the training dataset.

rawTest <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv")

# Remove fields with mostly NAs and Blanks
test.no <- rawTest[rawTest$new_window == "no",]
NonNACols.test <- test.no[,!as.vector(apply(is.na(test.no), 2, sum) == nrow(test.no))]
ColsWithValues.test <- NonNACols.test[,as.vector(apply(NonNACols.test=="", 2, sum) < nrow(NonNACols.test))]
cleanTest <- rawTest[,colnames(ColsWithValues.test)]

#Remove Unneeded Fields
cleanTest <- cleanTest[,-c(1, 2, 3, 4, 6, 7)]

# Convert Timestamp into an Hour of Day Field
cleanTest$cvtd_timestamp <- as.numeric(substr(cleanTest$cvtd_timestamp, 12, 13))

The final step was then to make the predictions.

testPredictions <- predict(GBMModelFit, newdata=cleanTest)
testPredictions
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E