Skip to content
Permalink
main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
---
output:
pdf_document: default
html_document: default
---
```{r, include=FALSE}
if (!require("caret")) {
install.packages("caret")
library(caret)
}
if (!require("corrplot")) {
install.packages("corrplot")
library(corrplot)
}
if (!require("ggpmisc")) {
install.packages("ggpmisc")
library(ggpmisc)
}
if (!require("Metrics")) {
install.packages("Metrics")
library(Metrics)
}
if (!require("MASS")) {
install.packages("MASS")
library(MASS)
}
if (!require("tidyverse")) {
install.packages("tidyverse")
library(tidyverse)
}
if (!require("e1071")) {
install.packages("e1071")
library(e1071)
}
if (!require("caTools")) {
install.packages("caTools")
library(caTools)
}
if (!require("class")) {
install.packages("class")
library(class)
}
if (!require("ranger")) {
install.packages("ranger")
library(ranger)
}
if (!require("xgboost")) {
install.packages("xgboost")
library(xgboost)
}
```
```{r}
df <- read_csv("~/DataStore-DataAnalytics/NYC_Citywide_Annualized_Calendar_Sales_Update_20241107.csv")
chelseaDf <- df %>%
filter(BOROUGH == 1)
```
## Relationship Between Sale Price and Year Built
```{r}
chelseaDf %>%
ggplot(aes(x = `YEAR BUILT`, y = `SALE PRICE`)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
stat_poly_eq(aes(label = paste(..eq.label.., ..adj.rr.label.., sep = "~~~")),
formula = y ~ x,
parse = TRUE,
color = "blue") +
labs(title = "Sale Price vs. Year Built",
x = "Year Built",
y = "Sale Price") +
theme_minimal()
```
## Outlier Detection with IQR and Boxplot
```{r}
Q1 <- quantile(chelseaDf$`SALE PRICE`, 0.25, na.rm = TRUE)
Q3 <- quantile(chelseaDf$`SALE PRICE`, 0.75, na.rm = TRUE)
IQR_value <- Q3 - Q1
chelseaDf <- chelseaDf %>%
mutate(Outlier = ifelse(`SALE PRICE` < (Q1 - 1.5 * IQR_value) | `SALE PRICE` > (Q3 + 1.5 * IQR_value), TRUE, FALSE))
# Filter out the outliers
chelseaDf_no_outliers <- chelseaDf %>%
filter(Outlier == FALSE)
chelseaDf %>%
ggplot(aes(y = `SALE PRICE`)) +
geom_boxplot(outlier.color = "red", outlier.shape = 16) +
labs(title = "Boxplot of Sale Price (Outliers Removed)",
y = "Sale Price") +
theme_minimal()
chelseaDf_no_outliers %>%
ggplot(aes(y = `SALE PRICE`)) +
geom_boxplot() +
labs(title = "Boxplot of Sale Price (Outliers Removed)",
y = "Sale Price") +
theme_minimal()
```
## Correlation Plot of Numeric Features
```{r}
numeric_features <- chelseaDf %>%
dplyr::select(where(is.numeric)) %>%
drop_na()
cor_matrix <- cor(numeric_features, use = "complete.obs")
corrplot(cor_matrix, method = "circle", type = "upper", tl.cex = 0.7, tl.col = "black")
```
```{r}
modelDf <- chelseaDf_no_outliers %>%
dplyr::select(-NEIGHBORHOOD, -ADDRESS, -Outlier, -`SALE DATE`,
-`APARTMENT NUMBER`, -`LAND SQUARE FEET`, -`Census Tract 2020`,
-`Census Tract`,-`EASE-MENT`, -`NTA Code`, -BOROUGH,
-`TAX CLASS AS OF FINAL ROLL`, -`TAX CLASS AT TIME OF SALE`,
-`BUILDING CLASS AS OF FINAL ROLL`,
-`BUILDING CLASS AT TIME OF SALE`,
-`BUILDING CLASS CATEGORY`,
-`ZIP CODE`,
-NTA
)
modelDf$`GROSS SQUARE FEET` <- as.numeric(gsub(",","",modelDf$`GROSS SQUARE FEET`))
modelDf <- modelDf %>%
dplyr::select(where(~ n_distinct(.) > 1)) %>%
mutate(across(where(is.numeric) & !all_of("SALE PRICE"), scale)) %>%
drop_na()
model1 <- lm(`SALE PRICE` ~ ., data = modelDf)
model2 <- stepAIC(model1, direction = "both", trace = FALSE)
summary(model2)
```
I did a decent deal of cleaning before modeling. First, I removed outliers in the DV based on the IQR method for determination of outliers. I then selected only the rows that had meaningful representations in the data based on intuition and data cleanliness/sparsity. I convert ZIP CODE to categorical as it is not a numerical ordinal value and I did not want to treat it as such and GSQFT to numeric as desired. I drop features with no variability as they add no predictive power and remove NAs. I standardize the data and mean-center it. I then use stepwise AIC criteria to do feature selection, where features are only kept if they have significant predictive value (can be ran with trace=TRUE to better understand it).
Based on adjusted r-squared criteria, the model performed quite terribly (~0.04). This means the model is effectively really only capturing about 4% of variability in the sale price feature, which is quite poor.
```{r}
df$`GROSS SQUARE FEET` <- as.numeric(gsub(",", "", df$`GROSS SQUARE FEET`))
modelDf <- df %>%
filter(BOROUGH==1) %>%
dplyr::select(NEIGHBORHOOD, `SALE PRICE`, `GROSS SQUARE FEET`) %>%
dplyr::select(where(~ n_distinct(.) > 1)) %>%
mutate(across(where(is.numeric) & !all_of("SALE PRICE"), scale)) %>%
drop_na()
colnames(modelDf) <- c("Neighborhood", "Sale_Price", "GSqFt")
set.seed(123)
split <- sample.split(modelDf$Neighborhood, SplitRatio = 0.7)
train_data <- subset(modelDf, split == TRUE)
test_data <- subset(modelDf, split == FALSE)
```
Overall, for the dataset I select borough 1, the same as previously used, but select only the pertinent columns. I scale the numeric values of sale price and gross square feet with standardization and mean-centering. I fix the column names to remove spaces as this causes issues for some packages and properly convert the GSqFt column to a numeric. I do a 70/30 train-test split for modeling and evaluation. I report the precision, recall and f1-measure for all of the models, and do a weighted average based on class size to better represent and consolidate the overall model success.
```{r}
nb_model <- naiveBayes(Neighborhood ~ ., data = train_data)
predictions_nb <- predict(nb_model, test_data)
conf_matrix_nb <- table(Predicted = predictions_nb, Actual = test_data$Neighborhood)
precision_nb <- diag(conf_matrix_nb) / rowSums(conf_matrix_nb)
recall_nb <- diag(conf_matrix_nb) / colSums(conf_matrix_nb)
f1_score_nb <- 2 * (precision_nb * recall_nb) / (precision_nb + recall_nb)
class_weights <- rowSums(conf_matrix_nb) / sum(conf_matrix_nb)
weighted_precision <- sum(precision_nb * class_weights, na.rm = TRUE)
weighted_recall <- sum(recall_nb * class_weights, na.rm = TRUE)
weighted_f1_score <- sum(f1_score_nb * class_weights, na.rm = TRUE)
print(conf_matrix_nb)
cat("Weighted Precision:", round(weighted_precision, 2), "\n")
cat("Weighted Recall:", round(weighted_recall, 2), "\n")
cat("Weighted F1 Score:", round(weighted_f1_score, 2), "\n")
```
No further processing was necessary for naive bayes and results are reported.
```{r}
set.seed(123)
train_data$Neighborhood <- as.factor(train_data$Neighborhood)
test_data$Neighborhood <- as.factor(test_data$Neighborhood)
ntree_values <- seq(50, 500, by = 50)
cv_results <- sapply(ntree_values, function(ntree) {
rf_model <- ranger(Neighborhood ~ .,
data = train_data,
num.trees = ntree,
importance = 'impurity')
pred <- predict(rf_model, data = train_data)$predictions
mean(pred == train_data$Neighborhood) # Accuracy on training data
})
optimal_ntree <- ntree_values[which.max(cv_results)]
rf_model <- ranger(Neighborhood ~ .,
data = train_data,
num.trees = optimal_ntree,
importance = 'impurity')
predictions_rf <- predict(rf_model, data = test_data)$predictions
conf_matrix_rf <- table(Predicted = predictions_rf, Actual = test_data$Neighborhood)
precision_rf <- diag(conf_matrix_rf) / rowSums(conf_matrix_rf)
recall_rf <- diag(conf_matrix_rf) / colSums(conf_matrix_rf)
f1_score_rf <- 2 * (precision_rf * recall_rf) / (precision_rf + recall_rf)
class_weights <- rowSums(conf_matrix_rf) / sum(conf_matrix_rf)
weighted_precision <- sum(precision_rf * class_weights, na.rm = TRUE)
weighted_recall <- sum(recall_rf * class_weights, na.rm = TRUE)
weighted_f1_score <- sum(f1_score_rf * class_weights, na.rm = TRUE)
cat("Optimal num.trees:", optimal_ntree, "\n")
print(conf_matrix_rf)
cat("Weighted Precision:", round(weighted_precision, 2), "\n")
cat("Weighted Recall:", round(weighted_recall, 2), "\n")
cat("Weighted F1 Score:", round(weighted_f1_score, 2), "\n")
```
For ranger's random forest, I needed to convert the target neighborhood column to a factor and did cross validation to determine the ideal number of trees in the forest.
```{r}
set.seed(123)
# Prepare the data: Include 'SALE PRICE'
train_data_xgb <- train_data
test_data_xgb <- test_data
# Convert target variable 'Neighborhood' to a numeric class label
train_data_xgb$Neighborhood <- as.numeric(factor(train_data_xgb$Neighborhood)) - 1
test_data_xgb$Neighborhood <- as.numeric(factor(test_data_xgb$Neighborhood)) - 1
# Prepare the matrices for xgboost
train_matrix <- xgb.DMatrix(data = as.matrix(train_data_xgb[, -which(names(train_data_xgb) == "Neighborhood")]),
label = train_data_xgb$Neighborhood)
test_matrix <- xgb.DMatrix(data = as.matrix(test_data_xgb[, -which(names(test_data_xgb) == "Neighborhood")]),
label = test_data_xgb$Neighborhood)
# Train the final model with the optimal number of rounds (determined with removed CV and manually set)
xgb_model <- xgboost(data = train_matrix,
nrounds = 100,
objective = "multi:softmax",
num_class = length(unique(train_data$Neighborhood)),
eval_metric = "merror",
verbose = 0)
# Predictions on the test set
predictions_xgb <- predict(xgb_model, test_matrix)
predictions_xgb <- factor(predictions_xgb, levels = 0:(length(unique(train_data$Neighborhood)) - 1))
# Confusion matrix
conf_matrix_xgb <- table(Predicted = predictions_xgb, Actual = test_data$Neighborhood)
# Precision, Recall, and F1 Score per class
precision_xgb <- diag(conf_matrix_xgb) / rowSums(conf_matrix_xgb)
recall_xgb <- diag(conf_matrix_xgb) / colSums(conf_matrix_xgb)
f1_score_xgb <- 2 * (precision_xgb * recall_xgb) / (precision_xgb + recall_xgb)
# Weighted averages
class_weights <- rowSums(conf_matrix_xgb) / sum(conf_matrix_xgb)
weighted_precision <- sum(precision_xgb * class_weights, na.rm = TRUE)
weighted_recall <- sum(recall_xgb * class_weights, na.rm = TRUE)
weighted_f1_score <- sum(f1_score_xgb * class_weights, na.rm = TRUE)
cat("Optimal nrounds:", 100, "\n")
print(conf_matrix_xgb)
cat("Weighted Precision:", round(weighted_precision, 2), "\n")
cat("Weighted Recall:", round(weighted_recall, 2), "\n")
cat("Weighted F1 Score:", round(weighted_f1_score, 2), "\n")
```
For XGBoost, the data needed to be converted specifically to the xgb.Dmatrix objects which was done. CV was used for only the simple parameter of bossting rounds but was removed to alleviate runtime concerns and manually set as 100.
The results are as follows:
Algorithm Weighted Precision Weighted Recall Weighted F1 Score
| Algorithm | Weighted Precision | Weighted Recall | Weighted F1 Score |
|----------------|--------------------|-----------------|-------------------|
| **Naive Bayes** | **0.04** | **0.58** | **0.06** |
| **Random Forest** | **0.18** | **0.31** | **0.20** |
| **XGBoost** | **0.22** | **0.29** | **0.23** |
The most balanced predictivity of the models comes from XGBoost, with the largest f1-measure and relatively high values for each metric. However, the f1 score is not extremely close to 1, meaning there is a lot of room for improvement. Random forest was also quite balanced, with slightly edging out xgboost on recall, but falling short in precision and overall model predicitivy. Naive bayes had a very large weighted recall, but it seems to be at the expense of a very low weighted precision. This likely means the model is overestimating positives and not very specifically tuned to the problem.
Application of Lin. Regression to New Borough
```{r}
# Filter data for Borough 3 (Brooklyn)
brooklynDf <- df %>%
filter(BOROUGH == 3)
# Remove outliers in Sale Price using IQR
Q1_b <- quantile(brooklynDf$`SALE PRICE`, 0.25, na.rm = TRUE)
Q3_b <- quantile(brooklynDf$`SALE PRICE`, 0.75, na.rm = TRUE)
IQR_value_b <- Q3_b - Q1_b
brooklynDf <- brooklynDf %>%
mutate(Outlier = ifelse(`SALE PRICE` < (Q1_b - 1.5 * IQR_value_b) |
`SALE PRICE` > (Q3_b + 1.5 * IQR_value_b), TRUE, FALSE)) %>%
filter(Outlier == FALSE)
# Data cleaning and processing identical to Chelsea
model2Df <- brooklynDf %>%
dplyr::select(-NEIGHBORHOOD, -ADDRESS, -Outlier, -`SALE DATE`,
-`APARTMENT NUMBER`, -`LAND SQUARE FEET`, -`Census Tract 2020`,
-`Census Tract`,-`EASE-MENT`, -`NTA Code`, -BOROUGH,
-`TAX CLASS AS OF FINAL ROLL`, -`TAX CLASS AT TIME OF SALE`,
-`BUILDING CLASS AS OF FINAL ROLL`,
-`BUILDING CLASS AT TIME OF SALE`,
-`BUILDING CLASS CATEGORY`,
-`ZIP CODE`,
-NTA
)
model2Df$`GROSS SQUARE FEET` <- as.numeric(gsub(",", "", model2Df$`GROSS SQUARE FEET`))
model2Df <- model2Df %>%
dplyr::select(where(~ n_distinct(.) > 1)) %>%
mutate(across(where(is.numeric) & !all_of("SALE PRICE"), scale)) %>%
drop_na()
# Predicting using the old model2 (stepwise AIC from Chelsea)
predictions <- predict(model2, newdata = model2Df)
# Add predictions and residuals to dataframe
model2Df <- model2Df %>%
mutate(Predicted_Sale_Price = predictions,
Residuals = `SALE PRICE` - Predicted_Sale_Price)
# Plot predictions vs actuals
ggplot(model2Df, aes(x = Predicted_Sale_Price, y = `SALE PRICE`)) +
geom_point(alpha = 0.5) +
geom_abline(slope = 1, intercept = 0, color = "red") +
labs(title = "Predicted vs Actual Sale Price (Borough 3)",
x = "Predicted Sale Price",
y = "Actual Sale Price") +
theme_minimal()
# Plot residuals
ggplot(model2Df, aes(x = Predicted_Sale_Price, y = Residuals)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, color = "red") +
labs(title = "Residuals vs Predicted Sale Price (Borough 3)",
x = "Predicted Sale Price",
y = "Residuals") +
theme_minimal()
# Model performance metrics
rmse <- sqrt(mean(model2Df$Residuals^2))
mae <- mean(abs(model2Df$Residuals))
r_squared <- cor(model2Df$Predicted_Sale_Price, model2Df$`SALE PRICE`)^2
cat("Model Performance on Borough 3 (Brooklyn):\n")
cat("RMSE:", round(rmse, 2), "\n")
cat("MAE:", round(mae, 2), "\n")
cat("R-squared:", round(r_squared, 2), "\n")
```
2a As can be seen by these plots and the extremely low R-squared value of 0.01, the model performed miserably on this dataset. Both RMSE and MAE are extremely high, and the model seems to be poorly fit across all values of sale price. This is not a shocking result, however, as the model was trained on a completely different area, and is making faulty conclusions because this is unseen data. ML models are not able to "learn" from data they have never been trained on, so this is not a shocking result. In simpler terms, this is a less severe example of expecting a model trained on NYC housing price prediction in rural VA. It also should be noted that I had to reduce the number of categorical features in origina training, specifically to remove ones with values in the new data that were not in the old (i.e. zip code).
2b Unfortunately, these classification models cannot be applied to the new dataset to predict neighborhood, as this is a different borough. Since the borough is different, all the neighborhoods are different as well and these models cannot adjust for unseen levels in the predictor column (unlike continuous regression). The models can be re-fit with the same parameters potentially, but the training data needs to have these new factor levels possible in order to generalize. This means classification models are hyper-specific to tasks and not nearly as generalizable.
2c The dataset has hugely different scales across variables that need to be addressed. Using geographical tagging (i.e. PUMS) could be interesting for location, as location seems to be a main driver of sale price. As such, local models are definitely desirable to maintain high predictive power, despite inefficiencies in training and storing the models. As a whole, I am incredibly un-confident in the results, but can try new modelling approaches better suited to individual tasks to result in more confidence and success.
3 This is quite a poor application area of classification models, as they were used in multi-class classification with a huge number of categories - compounding and making the model even more difficult to properly train. These varying across boroughs makes it impossible to cross-apply models, but this is a terrible thing to do regardless. The same applies for regression, do not apply these to unseen data or even data outside of the target variable ranges. The gradient-boosted trees worked quite well compared to other models. Naive bayes is difficult to gather real predictive power due to a large number of numerical features and leads to under-fitting of the problem. Random forest is also a powerful method for classification. As per linear regression, a lot more though should go into preprocessing of the data for model training and also more complex models should be implemented and trained to capture non-linearity and more complex relationships between variable that lead to sale price variability and predictivity.