DataAnalytics_A2_MICHAEL_GIANNATTASIO.Rmd

---
output:
  pdf_document: default
  html_document: default
---
```{r, include=FALSE}
if (!require("tidyverse")) {
  install.packages("tidyverse")
  library(tidyverse)
}
if (!require("class")) {
  install.packages("class")
  library(class)
}
```

```{r, include=FALSE}
## read dataset
df <- read_csv("~/DataStore-DataAnalytics/epi_results_2024_pop_gdp.csv")
```

```{r, include=FALSE}
SSA <- df %>% filter(region=="Sub-Saharan Africa")
FSS <- df %>% filter(region=="Former Soviet States")
```

# Analysis for Sub-Saharan Africa:
```{r}
ggplot(SSA, aes(x = gdp)) +
  geom_histogram(aes(y = after_stat(density)), color = "black", bins = 20, alpha = 0.5) +
  geom_density(alpha = 0.7)

qqplot(rnorm(250), SSA$gdp, xlab = "Q-Q plot for norm dsn")
qqline(SSA$gdp)

qqplot(rt(250, df = 5), SSA$gdp, xlab = "Q-Q plot for t dsn")
qqline(SSA$gdp)
```

# Analysis for Former Soviet States:
```{r}
ggplot(FSS, aes(x = gdp)) +
  geom_histogram(aes(y = ..density..), color = "black", bins = 20, alpha = 0.5) +
  geom_density(alpha = 0.7)

qqplot(rnorm(250), FSS$gdp, xlab = "Q-Q plot for norm dsn")
qqline(FSS$gdp)

qqplot(rt(250, df = 5), FSS$gdp, xlab = "Q-Q plot for t dsn")
qqline(FSS$gdp)
```

# Linear Models:

## Whole Dataset:
```{r}
lmod <- lm(EPI.new~log10(gdp), data = df)

## print model output
summary(lmod)

ggplot(df, aes(x = log10(gdp), y = EPI.new)) +
  geom_point() +
  stat_smooth(method = "lm", col="red")
```

```{r}
lmod <- lm(log1p(PAR.new)~log10(population)+log10(gdp), data = df)

## print model output
summary(lmod)

ggplot(df, aes(x = log10(population), y = log1p(PAR.new))) +
  geom_point() +
  stat_smooth(method = "lm", col="red")
```

## SSA Subset:
```{r}
lmod <- lm(EPI.new~log10(gdp), data = SSA)

## print model output
summary(lmod)

ggplot(SSA, aes(x = log10(gdp), y = EPI.new)) +
  geom_point() +
  stat_smooth(method = "lm", col="red")
```

```{r}
lmod <- lm(log1p(PAR.new)~log10(population)+log10(gdp), data = SSA)

## print model output
summary(lmod)

ggplot(SSA, aes(x = log10(population), y = log1p(PAR.new))) +
  geom_point() +
  stat_smooth(method = "lm", col="red")
```

In both cases, the model with more data performed much better on an adjusted R-squared metric. While not specifically tailored to a task, this metric is fairly good for determining overall predictivity while also preventing overfitting due to rising multiple r-squared with added predictors. This is likely due to the fact that there is much more data for the linear regressions to fit on.

# Classification (KNN):

## Model 1:
```{r}
classDf <- df %>%
  filter(region=="Sub-Saharan Africa"|region=="Former Soviet States") %>%
  select(region, EPI.new, BER.new, MKP.new) %>%
  mutate(region = ifelse(region == "Sub-Saharan Africa", 0, 1))

classDf <- na.omit(classDf)

# Perform 80/20 split
set.seed(123)  # Ensure reproducibility
train_indices <- sample(1:nrow(classDf), size = 0.8 * nrow(classDf))  # 80% for training
train <- classDf[train_indices, ]  # Training set
test <- classDf[-train_indices, ]  # Test set (remaining 20%)
```

```{r}
## kNN
knn.predictions <- knn(train, test, train$region, k=4)

## confusion matrix/contingency table
CM <- table(knn.predictions, test$region, dnn=list('predicted','actual'))

CM
```

## Model 2:
```{r}
classDf <- df %>%
  filter(region=="Sub-Saharan Africa"|region=="Former Soviet States") %>%
  select(region, SPI.old, ECO.old, BDH.old) %>%
  mutate(region = ifelse(region == "Sub-Saharan Africa", 0, 1))

classDf <- na.omit(classDf)

# Perform 80/20 split
set.seed(123)  # Ensure reproducibility
train_indices <- sample(1:nrow(classDf), size = 0.8 * nrow(classDf))  # 80% for training
train <- classDf[train_indices, ]  # Training set
test <- classDf[-train_indices, ]  # Test set (remaining 20%)
```

```{r}
## kNN
knn.predictions <- knn(train, test, train$region, k=2)

## confusion matrix/contingency table
CM <- table(knn.predictions, test$region, dnn=list('predicted','actual'))

CM
```

```{r}
# Calculate accuracy
accuracy <- sum(diag(CM)) / sum(CM)

# Print accuracy
print(paste("Accuracy:", round(accuracy, 4)))
```

The first model performed identically out-of-sample, but both had far too small of sample sizes to truly evaluate and best reflect the predictive capabilities of the respective models. This is also based purely on an accuracy metric, but they both had an approximate test accuracy of 0.8333. This is interesting because they both used both different time periods of data (old vs new) and also 3 different quantities.
	---
	output:
	pdf_document: default
	html_document: default
	---
	```{r, include=FALSE}
	if (!require("tidyverse")) {
	install.packages("tidyverse")
	library(tidyverse)
	}
	if (!require("class")) {
	install.packages("class")
	library(class)
	}
	```

	```{r, include=FALSE}
	## read dataset
	df <- read_csv("~/DataStore-DataAnalytics/epi_results_2024_pop_gdp.csv")
	```

	```{r, include=FALSE}
	SSA <- df %>% filter(region=="Sub-Saharan Africa")
	FSS <- df %>% filter(region=="Former Soviet States")
	```

	# Analysis for Sub-Saharan Africa:
	```{r}
	ggplot(SSA, aes(x = gdp)) +
	geom_histogram(aes(y = after_stat(density)), color = "black", bins = 20, alpha = 0.5) +
	geom_density(alpha = 0.7)

	qqplot(rnorm(250), SSA$gdp, xlab = "Q-Q plot for norm dsn")
	qqline(SSA$gdp)

	qqplot(rt(250, df = 5), SSA$gdp, xlab = "Q-Q plot for t dsn")
	qqline(SSA$gdp)
	```

	# Analysis for Former Soviet States:
	```{r}
	ggplot(FSS, aes(x = gdp)) +
	geom_histogram(aes(y = ..density..), color = "black", bins = 20, alpha = 0.5) +
	geom_density(alpha = 0.7)

	qqplot(rnorm(250), FSS$gdp, xlab = "Q-Q plot for norm dsn")
	qqline(FSS$gdp)

	qqplot(rt(250, df = 5), FSS$gdp, xlab = "Q-Q plot for t dsn")
	qqline(FSS$gdp)
	```

	# Linear Models:

	## Whole Dataset:
	```{r}
	lmod <- lm(EPI.new~log10(gdp), data = df)

	## print model output
	summary(lmod)

	ggplot(df, aes(x = log10(gdp), y = EPI.new)) +
	geom_point() +
	stat_smooth(method = "lm", col="red")
	```

	```{r}
	lmod <- lm(log1p(PAR.new)~log10(population)+log10(gdp), data = df)

	## print model output
	summary(lmod)

	ggplot(df, aes(x = log10(population), y = log1p(PAR.new))) +
	geom_point() +
	stat_smooth(method = "lm", col="red")
	```

	## SSA Subset:
	```{r}
	lmod <- lm(EPI.new~log10(gdp), data = SSA)

	## print model output
	summary(lmod)

	ggplot(SSA, aes(x = log10(gdp), y = EPI.new)) +
	geom_point() +
	stat_smooth(method = "lm", col="red")
	```

	```{r}
	lmod <- lm(log1p(PAR.new)~log10(population)+log10(gdp), data = SSA)

	## print model output
	summary(lmod)

	ggplot(SSA, aes(x = log10(population), y = log1p(PAR.new))) +
	geom_point() +
	stat_smooth(method = "lm", col="red")
	```

	In both cases, the model with more data performed much better on an adjusted R-squared metric. While not specifically tailored to a task, this metric is fairly good for determining overall predictivity while also preventing overfitting due to rising multiple r-squared with added predictors. This is likely due to the fact that there is much more data for the linear regressions to fit on.

	# Classification (KNN):

	## Model 1:
	```{r}
	classDf <- df %>%
	filter(region=="Sub-Saharan Africa"\|region=="Former Soviet States") %>%
	select(region, EPI.new, BER.new, MKP.new) %>%
	mutate(region = ifelse(region == "Sub-Saharan Africa", 0, 1))

	classDf <- na.omit(classDf)

	# Perform 80/20 split
	set.seed(123) # Ensure reproducibility
	train_indices <- sample(1:nrow(classDf), size = 0.8 * nrow(classDf)) # 80% for training
	train <- classDf[train_indices, ] # Training set
	test <- classDf[-train_indices, ] # Test set (remaining 20%)
	```

	```{r}
	## kNN
	knn.predictions <- knn(train, test, train$region, k=4)

	## confusion matrix/contingency table
	CM <- table(knn.predictions, test$region, dnn=list('predicted','actual'))

	CM
	```

	## Model 2:
	```{r}
	classDf <- df %>%
	filter(region=="Sub-Saharan Africa"\|region=="Former Soviet States") %>%
	select(region, SPI.old, ECO.old, BDH.old) %>%
	mutate(region = ifelse(region == "Sub-Saharan Africa", 0, 1))

	classDf <- na.omit(classDf)

	# Perform 80/20 split
	set.seed(123) # Ensure reproducibility
	train_indices <- sample(1:nrow(classDf), size = 0.8 * nrow(classDf)) # 80% for training
	train <- classDf[train_indices, ] # Training set
	test <- classDf[-train_indices, ] # Test set (remaining 20%)
	```

	```{r}
	## kNN
	knn.predictions <- knn(train, test, train$region, k=2)

	## confusion matrix/contingency table
	CM <- table(knn.predictions, test$region, dnn=list('predicted','actual'))

	CM
	```

	```{r}
	# Calculate accuracy
	accuracy <- sum(diag(CM)) / sum(CM)

	# Print accuracy
	print(paste("Accuracy:", round(accuracy, 4)))
	```

	The first model performed identically out-of-sample, but both had far too small of sample sizes to truly evaluate and best reflect the predictive capabilities of the respective models. This is also based purely on an accuracy metric, but they both had an approximate test accuracy of 0.8333. This is interesting because they both used both different time periods of data (old vs new) and also 3 different quantities.