Skip to content
Permalink
main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
---
output:
pdf_document: default
html_document: default
---
```{r, include=FALSE}
if (!require("tidyverse")) {
install.packages("tidyverse")
library(tidyverse)
}
if (!require("class")) {
install.packages("class")
library(class)
}
```
```{r, include=FALSE}
## read dataset
df <- read_csv("~/DataStore-DataAnalytics/epi_results_2024_pop_gdp.csv")
```
```{r, include=FALSE}
SSA <- df %>% filter(region=="Sub-Saharan Africa")
FSS <- df %>% filter(region=="Former Soviet States")
```
# Analysis for Sub-Saharan Africa:
```{r}
ggplot(SSA, aes(x = gdp)) +
geom_histogram(aes(y = after_stat(density)), color = "black", bins = 20, alpha = 0.5) +
geom_density(alpha = 0.7)
qqplot(rnorm(250), SSA$gdp, xlab = "Q-Q plot for norm dsn")
qqline(SSA$gdp)
qqplot(rt(250, df = 5), SSA$gdp, xlab = "Q-Q plot for t dsn")
qqline(SSA$gdp)
```
# Analysis for Former Soviet States:
```{r}
ggplot(FSS, aes(x = gdp)) +
geom_histogram(aes(y = ..density..), color = "black", bins = 20, alpha = 0.5) +
geom_density(alpha = 0.7)
qqplot(rnorm(250), FSS$gdp, xlab = "Q-Q plot for norm dsn")
qqline(FSS$gdp)
qqplot(rt(250, df = 5), FSS$gdp, xlab = "Q-Q plot for t dsn")
qqline(FSS$gdp)
```
# Linear Models:
## Whole Dataset:
```{r}
lmod <- lm(EPI.new~log10(gdp), data = df)
## print model output
summary(lmod)
ggplot(df, aes(x = log10(gdp), y = EPI.new)) +
geom_point() +
stat_smooth(method = "lm", col="red")
```
```{r}
lmod <- lm(log1p(PAR.new)~log10(population)+log10(gdp), data = df)
## print model output
summary(lmod)
ggplot(df, aes(x = log10(population), y = log1p(PAR.new))) +
geom_point() +
stat_smooth(method = "lm", col="red")
```
## SSA Subset:
```{r}
lmod <- lm(EPI.new~log10(gdp), data = SSA)
## print model output
summary(lmod)
ggplot(SSA, aes(x = log10(gdp), y = EPI.new)) +
geom_point() +
stat_smooth(method = "lm", col="red")
```
```{r}
lmod <- lm(log1p(PAR.new)~log10(population)+log10(gdp), data = SSA)
## print model output
summary(lmod)
ggplot(SSA, aes(x = log10(population), y = log1p(PAR.new))) +
geom_point() +
stat_smooth(method = "lm", col="red")
```
In both cases, the model with more data performed much better on an adjusted R-squared metric. While not specifically tailored to a task, this metric is fairly good for determining overall predictivity while also preventing overfitting due to rising multiple r-squared with added predictors. This is likely due to the fact that there is much more data for the linear regressions to fit on.
# Classification (KNN):
## Model 1:
```{r}
classDf <- df %>%
filter(region=="Sub-Saharan Africa"|region=="Former Soviet States") %>%
select(region, EPI.new, BER.new, MKP.new) %>%
mutate(region = ifelse(region == "Sub-Saharan Africa", 0, 1))
classDf <- na.omit(classDf)
# Perform 80/20 split
set.seed(123) # Ensure reproducibility
train_indices <- sample(1:nrow(classDf), size = 0.8 * nrow(classDf)) # 80% for training
train <- classDf[train_indices, ] # Training set
test <- classDf[-train_indices, ] # Test set (remaining 20%)
```
```{r}
## kNN
knn.predictions <- knn(train, test, train$region, k=4)
## confusion matrix/contingency table
CM <- table(knn.predictions, test$region, dnn=list('predicted','actual'))
CM
```
## Model 2:
```{r}
classDf <- df %>%
filter(region=="Sub-Saharan Africa"|region=="Former Soviet States") %>%
select(region, SPI.old, ECO.old, BDH.old) %>%
mutate(region = ifelse(region == "Sub-Saharan Africa", 0, 1))
classDf <- na.omit(classDf)
# Perform 80/20 split
set.seed(123) # Ensure reproducibility
train_indices <- sample(1:nrow(classDf), size = 0.8 * nrow(classDf)) # 80% for training
train <- classDf[train_indices, ] # Training set
test <- classDf[-train_indices, ] # Test set (remaining 20%)
```
```{r}
## kNN
knn.predictions <- knn(train, test, train$region, k=2)
## confusion matrix/contingency table
CM <- table(knn.predictions, test$region, dnn=list('predicted','actual'))
CM
```
```{r}
# Calculate accuracy
accuracy <- sum(diag(CM)) / sum(CM)
# Print accuracy
print(paste("Accuracy:", round(accuracy, 4)))
```
The first model performed identically out-of-sample, but both had far too small of sample sizes to truly evaluate and best reflect the predictive capabilities of the respective models. This is also based purely on an accuracy metric, but they both had an approximate test accuracy of 0.8333. This is interesting because they both used both different time periods of data (old vs new) and also 3 different quantities.