DataINCITE · zhaot4 · Oct 16, 2024
diff --git a/StudentNotebooks/Assignment04/zhaot4-f24-assignment4.Rmd b/StudentNotebooks/Assignment04/zhaot4-f24-assignment4.Rmd
@@ -0,0 +1,339 @@
+---
+title: "DAR F24 Project Status Notebook(assignment 4)"
+author: "Tianhao Zhao"
+date: "`r Sys.Date()`"
+output:
+  html_document:
+    toc: yes
+  pdf_document:
+    toc: yes
+subtitle: "CTBench assignment 4"
+---
+
+## Weekly Work Summary	
+
+**NOTE:** Follow an outline format; use bullets to express individual points. 
+
+* RCS ID: zhaot4
+* Project Name: CTEval
+* Summary of work since last week 
+
+    * I mainly work on using the paired t-test method to compare the Classification Performance(precision, recall, f1) score between gpt model and BERT model by using paired t-test. After that, I try to find the correlation between the length of candidate and reference with the classification performance for these two match model.
+
+* NEW: Summary of github issues added and worked 
+
+    * Issues that you've submitted
+    * Issues that you've self-assigned and addressed
+
+* Summary of github commits 
+
+    * dar-zhaot4
+
+
+* List of presentations,  papers, or other outputs
+
+    * Include browsable links
+
+* List of references (if necessary) 
+* Indicate any use of group shared code base
+* Indicate which parts of your described work were done by you or as part of joint efforts
+
+The plot that combined three scatter plot in one big plot that I learned from Yashas Balaji's assignment 3, which make the comparision directly.
+
+
+* **Required:** Provide illustrating figures and/or tables
+
+## Personal Contribution	
+
+* Clearly defined, unique contribution(s) done by you: code, ideas, writing...
+* Include github issues you've addressed if any
+
+## Analysis: Match model performance metrics between GPT and BERT
+
+### Question being asked 
+Compare the Classification Performance(precision, recall, f1) score between gpt model and BERT model by using paired t-test in order to pick out the better model which has better accuracy.
+
+
+### Data Preparation
+Load the ct_pub updated trials.responses data set and set it to a data frame. Then, filter the data for two different match model(gpt and BERT) and finally merge these two match model data in a combined data frame by matching on trial_id.
+
+```{r, result01_data}
+#Load the trials.responses data
+updated.responses.df<- readRDS("../../CTBench_source/corrected_data/ct_pub/trials.responses_updated.Rds")
+
+# convert model and type to factors
+updated.responses.df$trial_group <- as.factor(updated.responses.df$trial_group)
+updated.responses.df$model <- as.factor(updated.responses.df$model)
+# check out the size
+dim(updated.responses.df)
+
+
+# Filter the data for the match-models (e.g., gpt vs BERT)
+gpt_data <- subset(updated.responses.df, match_model == "gpt")
+bert_data <- subset(updated.responses.df, match_model == "BERT")
+
+# Ensure the data is paired by matching on trial_id
+merged_data <- merge(gpt_data, bert_data, by = "trial_id", suffixes = c("_gpt", "_bert"))
+```
+
+### Analysis: Methods and results
+
+Using paired t-test function to Compare the Classification Performance score(f1, precision and recall) between gpt model and BERT model and use kable function to create a data which show the paired t-test result of f1 score, precision and recall in a clear way.
+```{r, result01_analysis}
+# Load the tibble package
+library(knitr)
+
+#paired t-test based on f1 score
+t_test_result <- t.test(merged_data$f1_gpt, merged_data$f1_bert, paired = TRUE)
+
+
+```
+Precision:
+```{r}
+
+#Paired t-test based on the precision
+t_test_result_precision <- t.test(merged_data$precision_gpt, merged_data$precision_bert, paired = TRUE)
+
+```
+
+recall:
+```{r}
+#Paired t-test based on the recall
+t_test_result_recall <- t.test(merged_data$recall_gpt, merged_data$recall_bert, paired = TRUE)
+
+```
+
+
+Display the combined table which contain three paired t-test result for each component(recall, precision and f1 score)
+```{r}
+# Combine precision and F1 t-test results into a single data frame
+t_test_combined <- data.frame(
+  Metric = c("Precision", "F1 Score", "Recall"),
+  Statistic = c(t_test_result_precision$statistic, t_test_result$statistic, t_test_result_recall$statistic),
+  DF = c(t_test_result_precision$parameter, t_test_result$parameter, t_test_result_recall$parameter),
+  p_value = c(t_test_result_precision$p.value, t_test_result$p.value, t_test_result_recall$p.value),
+  CI_Lower = c(t_test_result_precision$conf.int[1], t_test_result$conf.int[1], t_test_result_recall$conf.int[1]),
+  CI_Upper = c(t_test_result_precision$conf.int[2], t_test_result$conf.int[2],t_test_result_recall$conf.int[2]),
+  Mean_Difference = c(t_test_result_precision$estimate, t_test_result$estimate, t_test_result_recall$estimate)
+)
+
+# Display the combined table using kable for formatting
+kable(t_test_combined, caption = "Paired t-test Results for Precision and F1 Score (GPT vs BERT)")
+```
+Precision:
+Mean Difference = 0.105695: The average difference in precision between the GPT and BERT models. On average, GPT's precision is approximately 0.105 (or 10.5%) higher than BERT's precision across the trials.
+p-value = 0: the p-value is much smaller than 0.05, which means that the difference in precision between the GPT and BERT models is statistically significant.
+Confidence Interval (CI_Lower = 0.0986, CI_Upper = 0.1127): Since both values are positive, we can infer that GPT's precision is consistently higher than BERT's by an amount between 0.0986 and 0.1127.
+F1 score:
+Mean Difference = 0.09928: The average difference in precision between the GPT and BERT models. On average, GPT's precision is approximately 0.099 (or 9.9%) higher than BERT's precision across the trials.
+p-value = 0: the p-value is much smaller than 0.05, which means that the difference in precision between the GPT and BERT models is statistically significant.
+Confidence Interval (CI_Lower = 0.0921, CI_Upper = 0.1065): Since both values are positive, we can infer that GPT's precision is consistently higher than BERT's by an amount between 0.0921 and 0.1065.
+Recall:Mean Difference = 0.12695: The average difference in precision between the GPT and BERT models. On average, GPT's precision is approximately 0.126 (or 12.6%) higher than BERT's precision across the trials.
+p-value = 0: the p-value is much smaller than 0.05, which means that the difference in precision between the GPT and BERT models is statistically significant.
+Confidence Interval (CI_Lower = 0.1180, CI_Upper = 0.1359): Since both values are positive, we can infer that GPT's precision is consistently higher than BERT's by an amount between 0.118 and 0.1359.
+
+
+### Discussion of results
+
+Based on the result we get from the paired t-test, gpt match model performance better if we only focused on the classification performance score. In all three metrics—precision, F1 score, and recall—the p-values are effectively 0, indicating that the observed differences between GPT and BERT are highly statistically significant. These results provide strong evidence that GPT outperforms BERT across all three performance metrics. The mean differences suggest that GPT performs about 10% better in terms of precision and F1 score and about 12.7% better in terms of recall compared to BERT. Based on these results, GPT shows a clear advantage over BERT in identifying positive instances (recall), achieving a better balance between precision and recall (F1 score), and having fewer false positives (precision). These significant differences suggest that GPT may be a more suitable model for tasks that require higher precision, recall, and F1 score.
+
+
+## Analysis: Compare Feature Generation and Model Performance(model comparison)
+
+
+### Question being asked 
+Find the correlation between the Classification Performance(precision, recall, f1) in responses data with the candidates length and reference length in order to pick out the better model when the reference and candidate number become larger.
+
+
+### Data Preparation
+
+First, load two matches data for both gpt and bert model. Then, count the real length of reference and candidate for each model's data set and merge it to the responses dataset. Finally, combine the new response data set and the matches dataset for each match model which got the data we are going to use for analysis.
+
+```{r, result02_data}
+library(tidyr)
+library(ggplot2)
+library(dplyr)
+library(stringr)
+# Load the trials.matches
+gpt_matches.df<- readRDS("../../CTBench_source/corrected_data/ct_pub/trials.matches_gpt_updated.Rds")
+
+# convert model and type to factors
+gpt_matches.df$model <- as.factor(gpt_matches.df$model)
+
+dim(gpt_matches.df)
+
+# Load the trials.matches
+bert_matches.df<- readRDS("../../CTBench_source/corrected_data/ct_pub/trials.matches_BERT_updated.Rds")
+
+# convert model and type to factors
+bert_matches.df$model <- as.factor(bert_matches.df$model)
+
+# Count non-NA values in reference and candidate for each trial_id and model
+reference_candidate_counts_bert <- bert_matches.df %>%
+  group_by(trial_id, model) %>%
+  summarize(
+    reference_len = sum(!is.na(reference)),  # Count non-NA values in reference
+    candidate_len = sum(!is.na(candidate)),  # Count non-NA values in candidate
+    .groups = 'drop'  # Prevents the warning about grouping
+  )
+
+reference_candidate_counts_gpt <- gpt_matches.df %>%
+  group_by(trial_id, model) %>%
+  summarize(
+    reference_len = sum(!is.na(reference)),  # Count non-NA values in reference
+    candidate_len = sum(!is.na(candidate)),  # Count non-NA values in candidate
+    .groups = 'drop'  # Prevents the warning about grouping
+  )
+
+# Correct model names in the candidate dataset
+reference_candidate_counts_bert <- reference_candidate_counts_bert %>%
+  mutate(model = str_replace_all(model, "in", "it"))
+
+# Correct model names in the candidate dataset
+reference_candidate_counts_gpt <- reference_candidate_counts_gpt %>%
+  mutate(model = str_replace_all(model, "in", "it"))
+
+# Merge with BERT matches
+bert_with_counts <- bert_data%>%
+  left_join(reference_candidate_counts_bert, by = c("trial_id", "model"))
+
+# Merge with BERT matches
+gpt_with_counts <- gpt_data%>%
+  left_join(reference_candidate_counts_gpt, by = c("trial_id", "model"))
+
+# Merge the gpt matches with the gpt responses
+combined_data_gpt <- merge(gpt_matches.df, gpt_with_counts, by = c("trial_id", "model"), all = TRUE)
+
+# Merge the BERT matches with the BERT responses
+combined_data_bert <- merge(bert_matches.df, bert_with_counts, by = c("trial_id", "model"), all = TRUE)
+```
+
+
+
+
+### Analysis: Methods and Results  
+
+
+Working on to provide the correlation value(cor function) between candidate and reference length with Classification Performance by using GPT and BERT as the match model,
+GPT:
+```{r}
+
+# Correlation between precision and len_candidate
+cor_precision_len_candidate <- cor(combined_data_gpt$precision, combined_data_gpt$candidate_len, use = "complete.obs")
+
+# Correlation between recall and len_reference
+cor_recall_len_reference <- cor(combined_data_gpt$recall, combined_data_gpt$reference_len, use = "complete.obs")
+
+# Correlation between F1 and both len_candidate and len_reference
+# F1 depends on both len_candidate (via precision) and len_reference (via recall), so we calculate both
+cor_f1_len_candidate <- cor(combined_data_gpt$f1, combined_data_gpt$candidate_len, use = "complete.obs")
+cor_f1_len_reference <- cor(combined_data_gpt$f1, combined_data_gpt$reference_len, use = "complete.obs")
+
+# Print the correlations for analysis
+print(paste("Correlation between Precision and len_candidate:", cor_precision_len_candidate))
+print(paste("Correlation between Recall and len_reference:", cor_recall_len_reference))
+print(paste("Correlation between F1 and len_candidate:", cor_f1_len_candidate))
+print(paste("Correlation between F1 and len_reference:", cor_f1_len_reference))
+```
+Precision vs len_candidate: The correlation is -0.306, indicating a weak negative relationship between the precision of GPT and the number of candidate features.
+Recall vs len_reference: The correlation is -0.577, indicating a moderate negative relationship.
+F1 vs len_candidate: The correlation is -0.226, indicating a weak negative relationship. 
+F1 vs len_reference: The correlation is -0.040, indicating almost no relationship between F1 score and the number of reference features.
+
+Visualization of the correlation score for gpt model which the slope represents the correlation score.
+```{r,warning=FALSE}
+# Precision vs len_candidate plot
+p1 <- ggplot(combined_data_gpt, aes(x = candidate_len, y = precision)) +
+  geom_point(color = "lightgreen", alpha = 0.6) +
+  geom_smooth(method = "lm", color = "black") +
+  labs(title = "Precision vs len_candidate", x = "len_candidate", y = "Precision") +
+  theme_minimal()
+
+# Recall vs len_reference plot
+p2 <- ggplot(combined_data_gpt, aes(x = reference_len, y = recall)) +
+  geom_point(color = "lightblue", alpha = 0.6) +
+  geom_smooth(method = "lm", color = "black") +
+  labs(title = "Recall vs len_reference", x = "len_reference", y = "Recall") +
+  theme_minimal()
+
+# F1 Score vs len_reference (you could combine with len_candidate if needed)
+p3 <- ggplot(combined_data_gpt, aes(x = candidate_len, y = f1)) +
+  geom_point(color = "salmon", alpha = 0.6) +
+  geom_smooth(method = "lm", color = "black") +
+  labs(title = "F1 Score vs len_reference", x = "len_reference", y = "F1 Score") +
+  theme_minimal()
+
+# Arrange the plots in a grid
+library(gridExtra)
+grid.arrange(p1, p2, p3, ncol = 3, top = "Correlation: Precision vs len_candidate, Recall vs len_reference, F1 vs len_reference")
+```
+
+BERT:
+
+```{r}
+# Correlation between precision and len_candidate for BERT
+cor_precision_len_candidate_bert <- cor(combined_data_bert$precision, combined_data_bert$candidate_len, use = "complete.obs")
+
+# Correlation between recall and len_reference for BERT
+cor_recall_len_reference_bert <- cor(combined_data_bert$recall, combined_data_bert$reference_len, use = "complete.obs")
+
+# Correlation between F1 and both len_candidate and len_reference for BERT
+cor_f1_len_candidate_bert <- cor(combined_data_bert$f1, combined_data_bert$candidate_len, use = "complete.obs")
+cor_f1_len_reference_bert <- cor(combined_data_bert$f1, combined_data_bert$reference_len, use = "complete.obs")
+
+# Print the correlations for BERT model
+print(paste("Correlation between Precision and len_candidate (BERT):", cor_precision_len_candidate_bert))
+print(paste("Correlation between Recall and len_reference (BERT):", cor_recall_len_reference_bert))
+print(paste("Correlation between F1 and len_candidate (BERT):", cor_f1_len_candidate_bert))
+print(paste("Correlation between F1 and len_reference (BERT):", cor_f1_len_reference_bert))
+```
+Precision vs len_candidate: The correlation is -0.309, similar to GPT, indicating a weak negative relationship. 
+Recall vs len_reference: The correlation is -0.561, which is very close to the GPT value.
+F1 vs len_candidate: The correlation is -0.213, showing a weak negative relationship between the F1 score and the number of candidate features for BERT.
+F1 vs len_reference: The correlation is -0.140, indicating a weak negative relationship. 
+
+Visualization of the correlation score for BERT model which the slope represents the correlation score.
+```{r}
+# Precision vs len_candidate plot
+p1 <- ggplot(combined_data_bert, aes(x = candidate_len, y = precision)) +
+  geom_point(color = "lightgreen", alpha = 0.6) +
+  geom_smooth(method = "lm", color = "black") +
+  labs(title = "Precision vs len_candidate", x = "len_candidate", y = "Precision") +
+  theme_minimal()
+
+# Recall vs len_reference plot
+p2 <- ggplot(combined_data_bert, aes(x = reference_len, y = recall)) +
+  geom_point(color = "lightblue", alpha = 0.6) +
+  geom_smooth(method = "lm", color = "black") +
+  labs(title = "Recall vs len_reference", x = "len_reference", y = "Recall") +
+  theme_minimal()
+
+# F1 Score vs len_reference (you could combine with len_candidate if needed)
+p3 <- ggplot(combined_data_bert, aes(x = candidate_len, y = f1)) +
+  geom_point(color = "salmon", alpha = 0.6) +
+  geom_smooth(method = "lm", color = "black") +
+  labs(title = "F1 Score vs len_reference", x = "len_reference", y = "F1 Score") +
+  theme_minimal()
+
+# Arrange the plots in a grid
+library(gridExtra)
+grid.arrange(p1, p2, p3, ncol = 3, top = "Correlation: Precision vs len_candidate, Recall vs len_reference, F1 vs len_reference")
+```
+
+
+### Discussion of results
+Both GPT and BERT exhibit similar patterns. As the number of candidate features increases, precision tends to decrease slightly. Recall is more significantly affected by the number of reference features. When there are more reference features, both models struggle to capture all of them, which leads to a decrease in recall. The F1 score, which balances precision and recall, shows weaker correlations overall. This suggests that neither the number of candidate nor the number of reference features has a strong influence on the overall balance between precision and recall. The correlation values indicate that neither model consistently outperforms the other based on these metrics, but both exhibit challenges when dealing with larger sets of candidate or reference features.
+
+
+
+
+## Summary and next steps
+
+1. Paired t-test Results: Precision: GPT has a higher mean difference in precision compared to BERT (0.099 vs. 0). The t-test statistic is significantly high (t = 27.13), with a very small p-value (< 2.2e-16), indicating that the difference is statistically significant. This means that GPT outperforms BERT in terms of precision. F1 Score: Similarly, GPT shows a statistically significant advantage in the F1 score (mean difference of 0.1057, t = 29.36). This indicates that GPT has better overall balance between precision and recall. The recall also shows a significant difference, with GPT having a mean difference of 0.1269 compared to BERT. This again favors GPT, suggesting it is better at identifying true positive features.
+
+2. Correlation Test Results: Based on the correlation tests, both models show similar weaknesses in handling large sets of candidate and reference features, but the differences are relatively minor. The correlations do not significantly alter the interpretation of the t-test results.
+
+Overall, GPT appears to be the better performing model for the tasks in this analysis. While both models face challenges when dealing with large feature sets, GPT maintains higher precision, recall, and F1 score, making it a more robust option for this feature-matching task.
+
+For the next step, I'm going to focus on fully stats part which I will try to use Mann-Whitney U test based on precision, recall and f1 score to find out which model works better(gpt or llama) in each of the match model as I tested in this assignment. 
+
diff --git a/StudentNotebooks/Assignment04/zhaot4-f24-assignment4.html b/StudentNotebooks/Assignment04/zhaot4-f24-assignment4.html
diff --git a/StudentNotebooks/Assignment04/zhaot4-f24-assignment4.pdf b/StudentNotebooks/Assignment04/zhaot4-f24-assignment4.pdf