Skip to content

assignment 4 #21

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
339 changes: 339 additions & 0 deletions StudentNotebooks/Assignment04/zhaot4-f24-assignment4.Rmd
@@ -0,0 +1,339 @@
---
title: "DAR F24 Project Status Notebook(assignment 4)"
author: "Tianhao Zhao"
date: "`r Sys.Date()`"
output:
html_document:
toc: yes
pdf_document:
toc: yes
subtitle: "CTBench assignment 4"
---

## Weekly Work Summary

**NOTE:** Follow an outline format; use bullets to express individual points.

* RCS ID: zhaot4
* Project Name: CTEval
* Summary of work since last week

* I mainly work on using the paired t-test method to compare the Classification Performance(precision, recall, f1) score between gpt model and BERT model by using paired t-test. After that, I try to find the correlation between the length of candidate and reference with the classification performance for these two match model.

* NEW: Summary of github issues added and worked

* Issues that you've submitted
* Issues that you've self-assigned and addressed

* Summary of github commits

* dar-zhaot4


* List of presentations, papers, or other outputs

* Include browsable links

* List of references (if necessary)
* Indicate any use of group shared code base
* Indicate which parts of your described work were done by you or as part of joint efforts

The plot that combined three scatter plot in one big plot that I learned from Yashas Balaji's assignment 3, which make the comparision directly.


* **Required:** Provide illustrating figures and/or tables

## Personal Contribution

* Clearly defined, unique contribution(s) done by you: code, ideas, writing...
* Include github issues you've addressed if any

## Analysis: Match model performance metrics between GPT and BERT

### Question being asked
Compare the Classification Performance(precision, recall, f1) score between gpt model and BERT model by using paired t-test in order to pick out the better model which has better accuracy.


### Data Preparation
Load the ct_pub updated trials.responses data set and set it to a data frame. Then, filter the data for two different match model(gpt and BERT) and finally merge these two match model data in a combined data frame by matching on trial_id.

```{r, result01_data}
#Load the trials.responses data
updated.responses.df<- readRDS("../../CTBench_source/corrected_data/ct_pub/trials.responses_updated.Rds")
# convert model and type to factors
updated.responses.df$trial_group <- as.factor(updated.responses.df$trial_group)
updated.responses.df$model <- as.factor(updated.responses.df$model)
# check out the size
dim(updated.responses.df)
# Filter the data for the match-models (e.g., gpt vs BERT)
gpt_data <- subset(updated.responses.df, match_model == "gpt")
bert_data <- subset(updated.responses.df, match_model == "BERT")
# Ensure the data is paired by matching on trial_id
merged_data <- merge(gpt_data, bert_data, by = "trial_id", suffixes = c("_gpt", "_bert"))
```

### Analysis: Methods and results

Using paired t-test function to Compare the Classification Performance score(f1, precision and recall) between gpt model and BERT model and use kable function to create a data which show the paired t-test result of f1 score, precision and recall in a clear way.
```{r, result01_analysis}
# Load the tibble package
library(knitr)
#paired t-test based on f1 score
t_test_result <- t.test(merged_data$f1_gpt, merged_data$f1_bert, paired = TRUE)
```
Precision:
```{r}
#Paired t-test based on the precision
t_test_result_precision <- t.test(merged_data$precision_gpt, merged_data$precision_bert, paired = TRUE)
```

recall:
```{r}
#Paired t-test based on the recall
t_test_result_recall <- t.test(merged_data$recall_gpt, merged_data$recall_bert, paired = TRUE)
```


Display the combined table which contain three paired t-test result for each component(recall, precision and f1 score)
```{r}
# Combine precision and F1 t-test results into a single data frame
t_test_combined <- data.frame(
Metric = c("Precision", "F1 Score", "Recall"),
Statistic = c(t_test_result_precision$statistic, t_test_result$statistic, t_test_result_recall$statistic),
DF = c(t_test_result_precision$parameter, t_test_result$parameter, t_test_result_recall$parameter),
p_value = c(t_test_result_precision$p.value, t_test_result$p.value, t_test_result_recall$p.value),
CI_Lower = c(t_test_result_precision$conf.int[1], t_test_result$conf.int[1], t_test_result_recall$conf.int[1]),
CI_Upper = c(t_test_result_precision$conf.int[2], t_test_result$conf.int[2],t_test_result_recall$conf.int[2]),
Mean_Difference = c(t_test_result_precision$estimate, t_test_result$estimate, t_test_result_recall$estimate)
)
# Display the combined table using kable for formatting
kable(t_test_combined, caption = "Paired t-test Results for Precision and F1 Score (GPT vs BERT)")
```
Precision:
Mean Difference = 0.105695: The average difference in precision between the GPT and BERT models. On average, GPT's precision is approximately 0.105 (or 10.5%) higher than BERT's precision across the trials.
p-value = 0: the p-value is much smaller than 0.05, which means that the difference in precision between the GPT and BERT models is statistically significant.
Confidence Interval (CI_Lower = 0.0986, CI_Upper = 0.1127): Since both values are positive, we can infer that GPT's precision is consistently higher than BERT's by an amount between 0.0986 and 0.1127.
F1 score:
Mean Difference = 0.09928: The average difference in precision between the GPT and BERT models. On average, GPT's precision is approximately 0.099 (or 9.9%) higher than BERT's precision across the trials.
p-value = 0: the p-value is much smaller than 0.05, which means that the difference in precision between the GPT and BERT models is statistically significant.
Confidence Interval (CI_Lower = 0.0921, CI_Upper = 0.1065): Since both values are positive, we can infer that GPT's precision is consistently higher than BERT's by an amount between 0.0921 and 0.1065.
Recall:Mean Difference = 0.12695: The average difference in precision between the GPT and BERT models. On average, GPT's precision is approximately 0.126 (or 12.6%) higher than BERT's precision across the trials.
p-value = 0: the p-value is much smaller than 0.05, which means that the difference in precision between the GPT and BERT models is statistically significant.
Confidence Interval (CI_Lower = 0.1180, CI_Upper = 0.1359): Since both values are positive, we can infer that GPT's precision is consistently higher than BERT's by an amount between 0.118 and 0.1359.


### Discussion of results

Based on the result we get from the paired t-test, gpt match model performance better if we only focused on the classification performance score. In all three metrics—precision, F1 score, and recall—the p-values are effectively 0, indicating that the observed differences between GPT and BERT are highly statistically significant. These results provide strong evidence that GPT outperforms BERT across all three performance metrics. The mean differences suggest that GPT performs about 10% better in terms of precision and F1 score and about 12.7% better in terms of recall compared to BERT. Based on these results, GPT shows a clear advantage over BERT in identifying positive instances (recall), achieving a better balance between precision and recall (F1 score), and having fewer false positives (precision). These significant differences suggest that GPT may be a more suitable model for tasks that require higher precision, recall, and F1 score.


## Analysis: Compare Feature Generation and Model Performance(model comparison)


### Question being asked
Find the correlation between the Classification Performance(precision, recall, f1) in responses data with the candidates length and reference length in order to pick out the better model when the reference and candidate number become larger.


### Data Preparation

First, load two matches data for both gpt and bert model. Then, count the real length of reference and candidate for each model's data set and merge it to the responses dataset. Finally, combine the new response data set and the matches dataset for each match model which got the data we are going to use for analysis.

```{r, result02_data}
library(tidyr)
library(ggplot2)
library(dplyr)
library(stringr)
# Load the trials.matches
gpt_matches.df<- readRDS("../../CTBench_source/corrected_data/ct_pub/trials.matches_gpt_updated.Rds")
# convert model and type to factors
gpt_matches.df$model <- as.factor(gpt_matches.df$model)
dim(gpt_matches.df)
# Load the trials.matches
bert_matches.df<- readRDS("../../CTBench_source/corrected_data/ct_pub/trials.matches_BERT_updated.Rds")
# convert model and type to factors
bert_matches.df$model <- as.factor(bert_matches.df$model)
# Count non-NA values in reference and candidate for each trial_id and model
reference_candidate_counts_bert <- bert_matches.df %>%
group_by(trial_id, model) %>%
summarize(
reference_len = sum(!is.na(reference)), # Count non-NA values in reference
candidate_len = sum(!is.na(candidate)), # Count non-NA values in candidate
.groups = 'drop' # Prevents the warning about grouping
)
reference_candidate_counts_gpt <- gpt_matches.df %>%
group_by(trial_id, model) %>%
summarize(
reference_len = sum(!is.na(reference)), # Count non-NA values in reference
candidate_len = sum(!is.na(candidate)), # Count non-NA values in candidate
.groups = 'drop' # Prevents the warning about grouping
)
# Correct model names in the candidate dataset
reference_candidate_counts_bert <- reference_candidate_counts_bert %>%
mutate(model = str_replace_all(model, "in", "it"))
# Correct model names in the candidate dataset
reference_candidate_counts_gpt <- reference_candidate_counts_gpt %>%
mutate(model = str_replace_all(model, "in", "it"))
# Merge with BERT matches
bert_with_counts <- bert_data%>%
left_join(reference_candidate_counts_bert, by = c("trial_id", "model"))
# Merge with BERT matches
gpt_with_counts <- gpt_data%>%
left_join(reference_candidate_counts_gpt, by = c("trial_id", "model"))
# Merge the gpt matches with the gpt responses
combined_data_gpt <- merge(gpt_matches.df, gpt_with_counts, by = c("trial_id", "model"), all = TRUE)
# Merge the BERT matches with the BERT responses
combined_data_bert <- merge(bert_matches.df, bert_with_counts, by = c("trial_id", "model"), all = TRUE)
```




### Analysis: Methods and Results


Working on to provide the correlation value(cor function) between candidate and reference length with Classification Performance by using GPT and BERT as the match model,
GPT:
```{r}
# Correlation between precision and len_candidate
cor_precision_len_candidate <- cor(combined_data_gpt$precision, combined_data_gpt$candidate_len, use = "complete.obs")
# Correlation between recall and len_reference
cor_recall_len_reference <- cor(combined_data_gpt$recall, combined_data_gpt$reference_len, use = "complete.obs")
# Correlation between F1 and both len_candidate and len_reference
# F1 depends on both len_candidate (via precision) and len_reference (via recall), so we calculate both
cor_f1_len_candidate <- cor(combined_data_gpt$f1, combined_data_gpt$candidate_len, use = "complete.obs")
cor_f1_len_reference <- cor(combined_data_gpt$f1, combined_data_gpt$reference_len, use = "complete.obs")
# Print the correlations for analysis
print(paste("Correlation between Precision and len_candidate:", cor_precision_len_candidate))
print(paste("Correlation between Recall and len_reference:", cor_recall_len_reference))
print(paste("Correlation between F1 and len_candidate:", cor_f1_len_candidate))
print(paste("Correlation between F1 and len_reference:", cor_f1_len_reference))
```
Precision vs len_candidate: The correlation is -0.306, indicating a weak negative relationship between the precision of GPT and the number of candidate features.
Recall vs len_reference: The correlation is -0.577, indicating a moderate negative relationship.
F1 vs len_candidate: The correlation is -0.226, indicating a weak negative relationship.
F1 vs len_reference: The correlation is -0.040, indicating almost no relationship between F1 score and the number of reference features.

Visualization of the correlation score for gpt model which the slope represents the correlation score.
```{r,warning=FALSE}
# Precision vs len_candidate plot
p1 <- ggplot(combined_data_gpt, aes(x = candidate_len, y = precision)) +
geom_point(color = "lightgreen", alpha = 0.6) +
geom_smooth(method = "lm", color = "black") +
labs(title = "Precision vs len_candidate", x = "len_candidate", y = "Precision") +
theme_minimal()
# Recall vs len_reference plot
p2 <- ggplot(combined_data_gpt, aes(x = reference_len, y = recall)) +
geom_point(color = "lightblue", alpha = 0.6) +
geom_smooth(method = "lm", color = "black") +
labs(title = "Recall vs len_reference", x = "len_reference", y = "Recall") +
theme_minimal()
# F1 Score vs len_reference (you could combine with len_candidate if needed)
p3 <- ggplot(combined_data_gpt, aes(x = candidate_len, y = f1)) +
geom_point(color = "salmon", alpha = 0.6) +
geom_smooth(method = "lm", color = "black") +
labs(title = "F1 Score vs len_reference", x = "len_reference", y = "F1 Score") +
theme_minimal()
# Arrange the plots in a grid
library(gridExtra)
grid.arrange(p1, p2, p3, ncol = 3, top = "Correlation: Precision vs len_candidate, Recall vs len_reference, F1 vs len_reference")
```

BERT:

```{r}
# Correlation between precision and len_candidate for BERT
cor_precision_len_candidate_bert <- cor(combined_data_bert$precision, combined_data_bert$candidate_len, use = "complete.obs")
# Correlation between recall and len_reference for BERT
cor_recall_len_reference_bert <- cor(combined_data_bert$recall, combined_data_bert$reference_len, use = "complete.obs")
# Correlation between F1 and both len_candidate and len_reference for BERT
cor_f1_len_candidate_bert <- cor(combined_data_bert$f1, combined_data_bert$candidate_len, use = "complete.obs")
cor_f1_len_reference_bert <- cor(combined_data_bert$f1, combined_data_bert$reference_len, use = "complete.obs")
# Print the correlations for BERT model
print(paste("Correlation between Precision and len_candidate (BERT):", cor_precision_len_candidate_bert))
print(paste("Correlation between Recall and len_reference (BERT):", cor_recall_len_reference_bert))
print(paste("Correlation between F1 and len_candidate (BERT):", cor_f1_len_candidate_bert))
print(paste("Correlation between F1 and len_reference (BERT):", cor_f1_len_reference_bert))
```
Precision vs len_candidate: The correlation is -0.309, similar to GPT, indicating a weak negative relationship.
Recall vs len_reference: The correlation is -0.561, which is very close to the GPT value.
F1 vs len_candidate: The correlation is -0.213, showing a weak negative relationship between the F1 score and the number of candidate features for BERT.
F1 vs len_reference: The correlation is -0.140, indicating a weak negative relationship.

Visualization of the correlation score for BERT model which the slope represents the correlation score.
```{r}
# Precision vs len_candidate plot
p1 <- ggplot(combined_data_bert, aes(x = candidate_len, y = precision)) +
geom_point(color = "lightgreen", alpha = 0.6) +
geom_smooth(method = "lm", color = "black") +
labs(title = "Precision vs len_candidate", x = "len_candidate", y = "Precision") +
theme_minimal()
# Recall vs len_reference plot
p2 <- ggplot(combined_data_bert, aes(x = reference_len, y = recall)) +
geom_point(color = "lightblue", alpha = 0.6) +
geom_smooth(method = "lm", color = "black") +
labs(title = "Recall vs len_reference", x = "len_reference", y = "Recall") +
theme_minimal()
# F1 Score vs len_reference (you could combine with len_candidate if needed)
p3 <- ggplot(combined_data_bert, aes(x = candidate_len, y = f1)) +
geom_point(color = "salmon", alpha = 0.6) +
geom_smooth(method = "lm", color = "black") +
labs(title = "F1 Score vs len_reference", x = "len_reference", y = "F1 Score") +
theme_minimal()
# Arrange the plots in a grid
library(gridExtra)
grid.arrange(p1, p2, p3, ncol = 3, top = "Correlation: Precision vs len_candidate, Recall vs len_reference, F1 vs len_reference")
```


### Discussion of results
Both GPT and BERT exhibit similar patterns. As the number of candidate features increases, precision tends to decrease slightly. Recall is more significantly affected by the number of reference features. When there are more reference features, both models struggle to capture all of them, which leads to a decrease in recall. The F1 score, which balances precision and recall, shows weaker correlations overall. This suggests that neither the number of candidate nor the number of reference features has a strong influence on the overall balance between precision and recall. The correlation values indicate that neither model consistently outperforms the other based on these metrics, but both exhibit challenges when dealing with larger sets of candidate or reference features.




## Summary and next steps

1. Paired t-test Results: Precision: GPT has a higher mean difference in precision compared to BERT (0.099 vs. 0). The t-test statistic is significantly high (t = 27.13), with a very small p-value (< 2.2e-16), indicating that the difference is statistically significant. This means that GPT outperforms BERT in terms of precision. F1 Score: Similarly, GPT shows a statistically significant advantage in the F1 score (mean difference of 0.1057, t = 29.36). This indicates that GPT has better overall balance between precision and recall. The recall also shows a significant difference, with GPT having a mean difference of 0.1269 compared to BERT. This again favors GPT, suggesting it is better at identifying true positive features.

2. Correlation Test Results: Based on the correlation tests, both models show similar weaknesses in handling large sets of candidate and reference features, but the differences are relatively minor. The correlations do not significantly alter the interpretation of the t-test results.

Overall, GPT appears to be the better performing model for the tasks in this analysis. While both models face challenges when dealing with large feature sets, GPT maintains higher precision, recall, and F1 score, making it a more robust option for this feature-matching task.

For the next step, I'm going to focus on fully stats part which I will try to use Mann-Whitney U test based on precision, recall and f1 score to find out which model works better(gpt or llama) in each of the match model as I tested in this assignment.

911 changes: 911 additions & 0 deletions StudentNotebooks/Assignment04/zhaot4-f24-assignment4.html

Large diffs are not rendered by default.

Binary file not shown.