Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
DataAnalytics2025_MICHAEL_GIANNATTASIO/DataAnalytics_A2_MICHAEL_GIANNATTASIO.Rmd
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
165 lines (127 sloc)
4.4 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
output: | |
pdf_document: default | |
html_document: default | |
--- | |
```{r, include=FALSE} | |
if (!require("tidyverse")) { | |
install.packages("tidyverse") | |
library(tidyverse) | |
} | |
if (!require("class")) { | |
install.packages("class") | |
library(class) | |
} | |
``` | |
```{r, include=FALSE} | |
## read dataset | |
df <- read_csv("~/DataStore-DataAnalytics/epi_results_2024_pop_gdp.csv") | |
``` | |
```{r, include=FALSE} | |
SSA <- df %>% filter(region=="Sub-Saharan Africa") | |
FSS <- df %>% filter(region=="Former Soviet States") | |
``` | |
# Analysis for Sub-Saharan Africa: | |
```{r} | |
ggplot(SSA, aes(x = gdp)) + | |
geom_histogram(aes(y = after_stat(density)), color = "black", bins = 20, alpha = 0.5) + | |
geom_density(alpha = 0.7) | |
qqplot(rnorm(250), SSA$gdp, xlab = "Q-Q plot for norm dsn") | |
qqline(SSA$gdp) | |
qqplot(rt(250, df = 5), SSA$gdp, xlab = "Q-Q plot for t dsn") | |
qqline(SSA$gdp) | |
``` | |
# Analysis for Former Soviet States: | |
```{r} | |
ggplot(FSS, aes(x = gdp)) + | |
geom_histogram(aes(y = ..density..), color = "black", bins = 20, alpha = 0.5) + | |
geom_density(alpha = 0.7) | |
qqplot(rnorm(250), FSS$gdp, xlab = "Q-Q plot for norm dsn") | |
qqline(FSS$gdp) | |
qqplot(rt(250, df = 5), FSS$gdp, xlab = "Q-Q plot for t dsn") | |
qqline(FSS$gdp) | |
``` | |
# Linear Models: | |
## Whole Dataset: | |
```{r} | |
lmod <- lm(EPI.new~log10(gdp), data = df) | |
## print model output | |
summary(lmod) | |
ggplot(df, aes(x = log10(gdp), y = EPI.new)) + | |
geom_point() + | |
stat_smooth(method = "lm", col="red") | |
``` | |
```{r} | |
lmod <- lm(log1p(PAR.new)~log10(population)+log10(gdp), data = df) | |
## print model output | |
summary(lmod) | |
ggplot(df, aes(x = log10(population), y = log1p(PAR.new))) + | |
geom_point() + | |
stat_smooth(method = "lm", col="red") | |
``` | |
## SSA Subset: | |
```{r} | |
lmod <- lm(EPI.new~log10(gdp), data = SSA) | |
## print model output | |
summary(lmod) | |
ggplot(SSA, aes(x = log10(gdp), y = EPI.new)) + | |
geom_point() + | |
stat_smooth(method = "lm", col="red") | |
``` | |
```{r} | |
lmod <- lm(log1p(PAR.new)~log10(population)+log10(gdp), data = SSA) | |
## print model output | |
summary(lmod) | |
ggplot(SSA, aes(x = log10(population), y = log1p(PAR.new))) + | |
geom_point() + | |
stat_smooth(method = "lm", col="red") | |
``` | |
In both cases, the model with more data performed much better on an adjusted R-squared metric. While not specifically tailored to a task, this metric is fairly good for determining overall predictivity while also preventing overfitting due to rising multiple r-squared with added predictors. This is likely due to the fact that there is much more data for the linear regressions to fit on. | |
# Classification (KNN): | |
## Model 1: | |
```{r} | |
classDf <- df %>% | |
filter(region=="Sub-Saharan Africa"|region=="Former Soviet States") %>% | |
select(region, EPI.new, BER.new, MKP.new) %>% | |
mutate(region = ifelse(region == "Sub-Saharan Africa", 0, 1)) | |
classDf <- na.omit(classDf) | |
# Perform 80/20 split | |
set.seed(123) # Ensure reproducibility | |
train_indices <- sample(1:nrow(classDf), size = 0.8 * nrow(classDf)) # 80% for training | |
train <- classDf[train_indices, ] # Training set | |
test <- classDf[-train_indices, ] # Test set (remaining 20%) | |
``` | |
```{r} | |
## kNN | |
knn.predictions <- knn(train, test, train$region, k=4) | |
## confusion matrix/contingency table | |
CM <- table(knn.predictions, test$region, dnn=list('predicted','actual')) | |
CM | |
``` | |
## Model 2: | |
```{r} | |
classDf <- df %>% | |
filter(region=="Sub-Saharan Africa"|region=="Former Soviet States") %>% | |
select(region, SPI.old, ECO.old, BDH.old) %>% | |
mutate(region = ifelse(region == "Sub-Saharan Africa", 0, 1)) | |
classDf <- na.omit(classDf) | |
# Perform 80/20 split | |
set.seed(123) # Ensure reproducibility | |
train_indices <- sample(1:nrow(classDf), size = 0.8 * nrow(classDf)) # 80% for training | |
train <- classDf[train_indices, ] # Training set | |
test <- classDf[-train_indices, ] # Test set (remaining 20%) | |
``` | |
```{r} | |
## kNN | |
knn.predictions <- knn(train, test, train$region, k=2) | |
## confusion matrix/contingency table | |
CM <- table(knn.predictions, test$region, dnn=list('predicted','actual')) | |
CM | |
``` | |
```{r} | |
# Calculate accuracy | |
accuracy <- sum(diag(CM)) / sum(CM) | |
# Print accuracy | |
print(paste("Accuracy:", round(accuracy, 4))) | |
``` | |
The first model performed identically out-of-sample, but both had far too small of sample sizes to truly evaluate and best reflect the predictive capabilities of the respective models. This is also based purely on an accuracy metric, but they both had an approximate test accuracy of 0.8333. This is interesting because they both used both different time periods of data (old vs new) and also 3 different quantities. |