Skip to content
Permalink
main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
---
output:
pdf_document: default
html_document: default
---
# Cervical Cancer Modeling - Data Analytics Assigment 6
```{r, include=FALSE}
if (!require("caret")) {
install.packages("caret")
library(caret)
}
if (!require("corrplot")) {
install.packages("corrplot")
library(corrplot)
}
if (!require("Metrics")) {
install.packages("Metrics")
library(Metrics)
}
if (!require("tidyverse")) {
install.packages("tidyverse")
library(tidyverse)
}
if (!require("e1071")) {
install.packages("e1071")
library(e1071)
}
if (!require("xgboost")) {
install.packages("xgboost")
library(xgboost)
}
if (!require("factoextra")) {
install.packages("factoextra")
library(factoextra)
}
if (!require("pROC")) {
install.packages("pROC")
library(pROC)
}
select <- dplyr::select
```
## EDA
I chose the cervical cancer behavior risk dataset from UCI, which has data regarding potential behavioral risk factors for cervical cancer and then a binary column ca_cervix indicating whether the person has cervical cancer or not.
```{r}
df <- read_csv("~/DataStore-DataAnalytics/sobar-72.csv")
head(df)
```
No NAs!
```{r}
dim(na.omit(df))
```
```{r, include=FALSE}
summary(df)
```
```{r}
df %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(fill = "red", bins = 15) +
facet_wrap(~Variable, scales = "free") +
theme_minimal() +
ggtitle("Feature Distributions - Cervical Cancer Df")
```
```{r}
scaledDf <- df %>%
mutate(across(-ca_cervix, scale))
summary(scaledDf)
```
```{r}
scaledDf %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(fill = "red", bins = 15) +
facet_wrap(~Variable, scales = "free") +
theme_minimal() +
ggtitle("Feature Distributions - Cervical Cancer Df (Scaled)")
```
```{r}
corMatrix <- cor(scaledDf)
corrplot(corMatrix, method = "circle", type = "lower", tl.cex = 0.7, number.cex = 0.8, tl.srt = 45, insig = "blank")
```
In this data processing and exploratory analysis, I made numerous alterations to the dataset specific on its structure and nature. Thankfully, the data was incredibly clean and had no NAs. This meant there was no need for any imputation, interpolation, or NA removal and maintained a full-rank dataset.
Since the dataset is so small (72 rows), it was not really worth it to do IQR removal of outliers. This would reduce the data size even more and scaling should be able to make the data more usable for ML modeling approaches. I experimented with IQR removal of outliers but it resulted in a reduction of 13 rows (~18%) of the data.
I plot feature distributions across the entire dataset using 15 bin histograms. This number of bins was selected using visual feedback to represent the distributions best. The features do not all seem to be normally distributed, but this could be due to the type of plots generated. Some are bi-modal (obviously ca_cervix), which can be interesting from a modeling perspective. Many are left-skewed. This was the major driving factor to standardization (besides just mean-scaling for equalizing data prominence), and can be seen with the feature histograms more normalized in the scaled case.
A correlation plot was generated and gladly a decent amount of the features have moderate to strong correlations with the dependent variable. Interestingly it all seems to be a negative correlation, suggesting that increasing these values decreases the likelihood of cervical cancer on average.
This is all set up to serve very well for a classification problem where we see which factors are leading most significantly to cervical cancer. This gives me the ideas of a tree-based approach so we can generate feature importance.
## Modeling, Validation, and Optimization
Based on the numeric nature of all of the data and the binary feature output, I think the best approaches are going to be clustering and classification. In order to enhance these, I will use my third model as PCA. This allows me to reduce noise in modeling and generate some interesting plots on the clusters with respect to the high variance directions in the data.
### PCA
```{r}
pca_model <- prcomp((scaledDf %>% select(-ca_cervix)))
pca_var_explained <- summary(pca_model)$importance[2, ]
cum_var_explained <- cumsum(pca_var_explained)
num_pcs <- which(cum_var_explained >= 0.85)[1]
pca_df <- as.data.frame(pca_model$x[, 1:num_pcs])
pca_df$ca_cervix <- df$ca_cervix
```
```{r}
cum_var_explained
```
The PCA was relatively successful in capturing the variance across the dataset. It took 9 features to capture 75% of the variance, but even just the first 5 had above 70%. Also, the first two PCs have a large chunk of the variance which makes them very interesting from a plotting perspective later on with kmeans.
### Kmeans on PCA-reduced and original data
I first focus on the original data.
```{r}
set.seed(123)
kmeans_model_full <- kmeans(scaledDf %>% select(-ca_cervix), centers = 2, nstart = 25)
scaledDf$cluster <- as.factor(kmeans_model_full$cluster)
ggplot(scaledDf, aes(x = .data[[names(scaledDf)[1]]], y = .data[[names(scaledDf)[2]]],
color = cluster)) +
geom_point(size = 3, alpha = 0.7) +
theme_minimal() +
ggtitle("K-means Clustering on Full Data (First Two Features)") +
labs(color = "Cluster")
```
Plots were only generated for one feature pairing, as facet_wrapping data was unable to be generated in meaningful way with reasonable timing.
```{r}
ggplot(scaledDf, aes(x = .data[[names(scaledDf)[1]]], y = .data[[names(scaledDf)[2]]],
color = cluster, shape = as.factor(ca_cervix))) +
geom_point(size = 3, alpha = 0.7) +
theme_minimal() +
ggtitle("K-means Clustering on Full Data (Colored by Cluster, Shaped by ca_cervix)") +
labs(color = "Cluster", shape = "Cervical Cancer (0 = No, 1 = Yes)")
```
These two plots are relatively uninteresting and show not great separation, but that was to be expected. These are more so for reference for the PCA-reduced method comparisons.
Now, we proceed with these:
```{r}
set.seed(123)
kmeans_model <- kmeans(pca_df[, 1:2], centers = 2, nstart = 25)
ggplot(pca_df, aes(x = PC1, y = PC2, color = as.factor(kmeans_model$cluster))) +
geom_point(size = 3, alpha = 0.7) +
theme_minimal() +
ggtitle("K-means Clustering on PCA-reduced Data") +
labs(color = "Cluster")
```
```{r}
ggplot(pca_df, aes(x = PC1, y = PC2, color = as.factor(kmeans_model$cluster), shape = as.factor(ca_cervix))) +
geom_point(size = 3, alpha = 0.7) +
theme_minimal() +
ggtitle("K-means Clustering on PCA-reduced Data (Colored by Cluster, Shaped by ca_cervix)") +
labs(color = "Cluster", shape = "Cervical Cancer (0 = No, 1 = Yes)")
```
Note that I pick two clusters from a non-elbow approach method but out of a goal to create separation of clusters between the positive and negative cervical cancer classes. Since there are two, I pick two clusters to focus on this. This seemed to work reasonably well even with the first to PCs as can be seen in the second plot - as almost all of positive cancer flag is captured in cluster 2 (besides 2). This seems to suggest the data is fairly linearly separable in the principal component space and can be useful for modeling approaches. This is the validation for the kmeans and PCA approaches, as there are really no distinct metrics for usability. The metrics of inter/intra-cluster distances and similarities and cumulative variances are important, but the context in the problem setting is much more important.
### XGBoost Classification
As discussed previously, I want a tree-based model so that after we do the predictive modeling, I can generate feature importances to view the most contributing factors. I choose XGBoost as it expands upon trees in a boosting approach which has had huge success for me in the fast with relatively quick training and cross validation enabled. Boosting allows the trees to iteratively learn on prior iterations failures, creating a very robust model.
```{r}
set.seed(123)
split_index <- createDataPartition(df$ca_cervix, p = 0.8, list = FALSE)
train_data <- df[split_index, ]
test_data <- df[-split_index, ]
Xtrain <- as.matrix(train_data %>% select(-ca_cervix))
ytrain <- as.matrix(train_data$ca_cervix)
Xtest <- as.matrix(test_data %>% select(-ca_cervix))
ytest <- as.matrix(test_data$ca_cervix)
train_matrix <- xgb.DMatrix(data = Xtrain, label = ytrain)
test_matrix <- xgb.DMatrix(data = Xtest, label = ytest)
```
xgb.cv is the cross validation approach which samples params up until the values listed in the pararm list. I select the best model of these below based on mean AUC (under ROC).
```{r}
param <- list(
objective = "binary:logistic",
eval_metric = "auc",
max_depth = 4,
eta = 0.1,
subsample = 0.8,
colsample_bytree = 0.8
)
cv_model <- xgb.cv(
params = param,
data = train_matrix,
nrounds = 100,
nfold = 5,
verbose = 0
)
```
```{r}
xgb_model <- xgboost(
params = param,
data = train_matrix,
nrounds = which.max(cv_model$evaluation_log$test_auc_mean),
verbose = 0
)
```
```{r}
preds <- predict(xgb_model, test_matrix)
pred_labels <- ifelse(preds > 0.5, 1, 0)
conf_matrix <- confusionMatrix(factor(pred_labels), factor(test_data$ca_cervix))
auc_value <- Metrics::auc(test_data$ca_cervix, preds)
```
```{r}
list(
AUC = auc_value,
Accuracy = conf_matrix$overall["Accuracy"],
F1_Score = conf_matrix$byClass["F1"],
Balanced_Accuracy = conf_matrix$byClass["Balanced Accuracy"]
)
```
Well.... this was relatively uninteresting. The model is perfect! While this is the ideal case, this made the modeling problem quite simple and uninteresting. I ensured no data leakage in the process, and in fact actually found a peer-reviewed publication using this same dataset and having these same insanely high/perfect accuracies with tree-based modeling approaches. [Paper](https://pmc.ncbi.nlm.nih.gov/articles/PMC9185380/)
I was planning to also try the modeling approach with PCA-reduced data and will do so for testing here, but this is entirely unnecessary as the modeling approach has shown massive abilities to generalize and learn. Note that I speculate this will lead to reduced performance for two reasons - tree based approaches do intrinsic feature selection essentially so reducing data input is not advised AND the model was already perfect.
```{r}
pca_model_train <- prcomp(train_data %>% select(-ca_cervix), scale. = TRUE)
pca_train <- as.data.frame(pca_model_train$x[, 1:num_pcs])
pca_train$ca_cervix <- train_data$ca_cervix
pca_model_test <- prcomp(test_data %>% select(-ca_cervix), scale. = TRUE)
pca_test <- as.data.frame(pca_model_test$x[, 1:num_pcs])
pca_test$ca_cervix <- test_data$ca_cervix
Xtrain_pca <- as.matrix(pca_train %>% select(-ca_cervix))
ytrain_pca <- as.matrix(pca_train$ca_cervix)
Xtest_pca <- as.matrix(pca_test %>% select(-ca_cervix))
ytest_pca <- as.matrix(pca_test$ca_cervix)
xgbTrain_pca <- xgb.DMatrix(data = Xtrain_pca, label = ytrain_pca)
xgbTest_pca <- xgb.DMatrix(data = Xtest_pca, label = ytest_pca)
xgb_model_pca <- xgboost(
params = param,
data = xgbTrain_pca,
nrounds = which.max(cv_model$evaluation_log$test_auc_mean),
verbose = 0
)
```
```{r}
preds_pca <- predict(xgb_model_pca, xgbTest_pca)
pred_labels_pca <- ifelse(preds_pca > 0.5, 1, 0)
conf_matrix_pca <- confusionMatrix(factor(pred_labels_pca), factor(ytest_pca))
auc_value_pca <- Metrics::auc(ytest_pca, preds_pca)
list(
AUC = auc_value_pca,
Accuracy = conf_matrix_pca$overall["Accuracy"],
F1_Score = conf_matrix_pca$byClass["F1"],
Balanced_Accuracy = conf_matrix_pca$byClass["Balanced Accuracy"]
)
```
As can be plausibly expected, the modeling approach resulted in a loss of perfect information and reduced the performance of the model quite sizeably, but the model still has great performance. I find it quite interesting that the model has a perfect AUC (under ROC curve), but not perfect accuracy. This means this model can also perform perfectly in prediction, but with an augmented threshold for classification (i.e. not 0.5). In the case of non-perfect results in the prior case I would search for the threshold - so I will here too. I do this by plotting the ROC curve.
```{r}
roc_curve_pca <- roc(ytest_pca, preds_pca)
roc_data <- data.frame(
fpr = rev(roc_curve_pca$specificities),
tpr = rev(roc_curve_pca$sensitivities)
)
ggplot(roc_data, aes(x = fpr, y = tpr)) +
geom_line(color = "blue", size = 1.2) +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(title = "ROC Curve - XGBoost on PCA-reduced Data", x = "False Positive Rate", y = "True Positive Rate") +
theme_minimal()
```
Interestingly, I do not see why the 0.5 threshold is not working. I will look into this further at another time.
However, feature importances being directly accessible from the non-PCA approach is another huge advantage of not doing the model training in this way.
```{r}
importance_matrix <- xgb.importance(model = xgb_model)
xgb.plot.importance(importance_matrix, top_n = 10, rel_to_first = TRUE, xlab = "Relative Importance", main = "Top 10 Feature Importances - XGBoost Model")
```
Variables most correlated with PC1 are:
```{r}
pc1_loadings <- abs(pca_model$rotation[, 1])
# Identify the top 5 contributing variables
top_5_indices <- order(pc1_loadings, decreasing = TRUE)[1:5]
top_5_variables <- names(pc1_loadings)[top_5_indices]
# Identify the bottom 3 contributing variables
bottom_5_indices <- order(pc1_loadings)[1:5]
bottom_5_variables <- names(pc1_loadings)[bottom_5_indices]
data.frame(Top_Variables = top_5_variables, Bottom_Variables = bottom_5_variables)
```
## Discussion
As described loosely throughout, I would consider this modeling a great success, albeit that the data was the driving factor in this being the case. This data was extremely well feature-engineered to capture the relevant risk factors for cervical cancer. The model is incredibly well-poised to generalize to future data, but it is very important to note that the test set only had 14 observations and the entire dataset as a whole was small. While this is impressive that it was able to train so well on such little data, I have concerns about model generalize-ability to a larger population with more variance and potentially entirely new feature interactions. So I would say to use the classification with confidence but in cases that where the data makes sense.
Both the clustering and PCA approaches were great supplements to the classification approach and followed smoothly in the modeling process to give higher understanding of the data and its nature. They allowed for unique visuals to be generated and an insight into the explicit features most related to the variance in the data. It is interesting to note that these do not directly align with the feature importance features (for PC1), and is likely why the PC-reduced dataset was not as effective in classification.
The feature importance extraction at the end also great gave insights into what the leading risk factors are in cervical cancer. This is incredibly important from both a public health and personal perspective. Women can directly try to reduce risky behaviors in their lives, especially with histories of these conditions in their genetics. Overall, national organizations can try to incentive shifts away from these risk behaviors as well to decrease overall prevalence. However, it must be noted that these relationships are not nearly deemed causal and are not extremely robust - just the extraction from one particular ML model - albeit a very predictive one.