DataINCITE · bennek · Nov 12, 2024 · Oct 28, 2024 · Nov 12, 2024
diff --git a/StudentNotebooks/Assignment04/Balajy_assignment4.Rmd b/StudentNotebooks/Assignment04/Balajy_assignment4.Rmd
@@ -0,0 +1,393 @@
+---
+title: "DAR F24 Project Status Notebook Week 4"
+author: "Yashas Balaji"
+date: "`r Sys.Date()`"
+output:
+  pdf_document:
+    toc: yes
+  html_document:
+    toc: yes
+subtitle: "CTBench"
+---
+
+```{r setup, include=FALSE}
+
+# Required R package installation; RUN THIS BLOCK BEFORE ATTEMPTING TO KNIT THIS NOTEBOOK!!!
+# This section  install packages if they are not already installed. 
+# This block will not be shown in the knit file.
+if (!require("knitr")) {
+  install.packages("knitr")
+  library(knitr)
+}
+
+knitr::opts_chunk$set(echo = TRUE)
+
+# Set the default CRAN repository
+local({r <- getOption("repos")
+       r["CRAN"] <- "http://cran.r-project.org" 
+       options(repos=r)
+})
+
+if (!require("devtools")) {
+  install.packages("devtools")
+  library(devtools)
+}
+# For package conflict resolution (esp. dplyr functions)
+# run con
+if (!require("conflicted")) {
+  devtools::install_github("r-lib/conflicted")
+  library(conflicted)
+}
+
+# Required packages for CTEval analysis
+if (!require("rmarkdown")) {
+  install.packages("rmarkdown")
+  library(rmarkdown)
+}
+
+
+if (!require("tidyverse")) {
+  install.packages("tidyverse")
+  library(tidyverse)
+}
+# Our preferences
+conflicts_prefer(dplyr::summarize())
+conflicts_prefer(dplyr::filter())
+conflicts_prefer(dplyr::select())
+conflicts_prefer(dplyr::mutate())
+conflicts_prefer(dplyr::arrange())
+
+if (!require("stringr")) {
+  install.packages("stringr")
+  library(stringr)
+}
+
+if (!require("ggbiplot")) {
+  install.packages("ggbiplot")
+  library(ggbiplot)
+}
+
+if (!require("pheatmap")) {
+  install.packages("pheatmap")
+  library(pheatmap)
+}
+if (!require("plotrix")) {
+  install.packages("plotrix")
+  library(plotrix)
+}
+if (!require("kableExtra")) {
+  install.packages("kableExtra")
+  library(kableExtra)
+}
+
+```
+## Instructions (DELETE BEFORE SUBMISSION)
+
+* Use this notebook is a template for your biweekly project status assignment. 
+* Use the sections starting with **BiWeekly Work Summary** as your outline for your submitted notebook.
+* Summarize ALL of your work in this notebook; **if you don't show and/or link to your work here, it doesn't exist for us!**
+
+1. Create a new copy of this notebook in the `AssignmentX` sub-directory of your team's github repository using the following naming convention
+
+   * `rcsid_assignmentX.Rmd` and `rcsid_assignmentX.pdf`
+   * For example, `bennek_assignment03.Rmd`
+
+2. Document **all** the work you did on your assigned project this week **using the outline below.** 
+
+3. You MUST include figures and/or tables to illustrate your work. *Screen shots are okay*, but include something!
+
+4. You MUST include links to other important resources (knitted HTMl files, Shiny apps). See the guide below for help.
+
+5. Commit the source (`.Rmd`) and knitted (`.html`) versions of your notebook and push to github
+
+6. **Submit a pull request.** Please notify Dr. Erickson if you don't see your notebook merged within one day. 
+
+7. **DO NOT MERGE YOUR PULL REQUESTS YOURSELF!!**
+
+See the Grading Rubric for guidance on how the contents of this notebook will be graded on LMS or GradeScope. 
+
+## Weekly Work Summary	
+
+**NOTE:** Follow an outline format; use bullets to express individual points. 
+
+* RCS ID: **Always** include this!
+* Project Name: **Always** include this!
+* Summary of work since last week 
+
+    * Describe the important aspects of what you worked on and accomplished
+
+* NEW: Summary of github issues added and worked 
+
+    * Issues that you've submitted
+    * Issues that you've self-assigned and addressed
+
+* Summary of github commits 
+
+    * include branch name(s)
+    * include browsable links to all external files on github
+    * Include links to shared Shiny apps
+
+* List of presentations,  papers, or other outputs
+
+    * Include browsable links
+
+* List of references (if necessary) 
+* Indicate any use of group shared code base
+* Indicate which parts of your described work were done by you or as part of joint efforts
+
+* **Required:** Provide illustrating figures and/or tables
+
+## Personal Contribution	
+
+* Clearly defined, unique contribution(s) done by you: code, ideas, writing...
+* Include github issues you've addressed if any
+
+## Analysis: Question 1: How do the LLM models perform differently in terms of precision, recall, and F1 scores?
+
+### Question being asked 
+
+We are investigating whether different large language models (LLMs), such as GPT-4o and LLaMa3-70B, differ significantly in terms of precision, recall, and F1 scores when generating baseline descriptors for clinical trials. Specifically, the goal is to identify if one model consistently outperforms the other and whether these differences are statistically significant across multiple trials.
+
+### Data Preparation
+
+Here, I am just using the updated data that corey reran and just choosing the columns I need for this graph
+
+```{r, result01_data}
+# Include all data processing code (if necessary), clearly commented
+pub_data <- readRDS("../../CTBench_source/corrected_data/ct_pub/CT_Pub_data_updated.Rds")
+pub_matches <- readRDS("../../CTBench_source/corrected_data/ct_pub/trials.matches_gpt_updated.Rds")
+pub_responses <- readRDS("../../CTBench_source/corrected_data/ct_pub/trials.responses_updated.Rds")
+
+subgroup_data <- pub_responses %>%
+  select(trial_group, model, precision, recall, f1) %>%
+  mutate(trial_group = as.factor(trial_group),
+         model = as.factor(model))
+
+performance_data <- pub_responses %>%
+  select(model, precision, recall, f1) %>%
+  mutate(model = as.factor(model))
+
+performance_data_long <- performance_data %>%
+  gather(key = "Metric", value = "Score", precision, recall, f1)
+```
+
+### Analysis: Methods and results
+
+The goal of this analysis is to compare the performance of two LLMs, GPT-4o and LLaMa3-70B in both 3 shot and zero shot prompting methods, in terms of precision, recall, and F1 scores across different clinical trials. My goal is to to visually inspect the distributions of these performance metrics and determine whether there are any significant differences between the models.
+
+
+```{r, result01_analysis}
+# Violin plot with x-axis labels rotated
+ggplot(performance_data_long, aes(x = model, y = Score, fill = Metric)) +
+  geom_violin(alpha = 0.3) +
+  geom_boxplot(width = 0.2, position = position_dodge(width = 0.75)) +
+  facet_wrap(~ Metric, scales = "free") +
+  labs(title = "Distribution of Precision, Recall, and F1 Scores by Model", x = "Model", y = "Score") +
+  theme_minimal() +
+  theme(legend.position = "bottom", 
+        axis.text.x = element_text(angle = 45, hjust = 1))  # Rotating x-axis labels
+
+
+```
+
+### Discussion of results
+
+The results are very similar to last weeks with fewer outliers. Once again we have llama3 zero shot beating every other model, but this time much more slightly than before. Now there is one obvious worst model, which is gpt4o zero shot.
+
+## Analysis: Question 2 (Subgroup analysis)
+
+### Question being asked 
+
+Are certain disease types (e.g., diabetes, cancer) harder for models to predict than others? My goal is to determine whether the F1 scores of the models vary across different disease types. Specifically, we want to identify if certain disease types are more challenging and whether these differences are consistent across models.
+
+### Data Preparation
+
+Here once again I am just choosing the columns that I need from the ct_pub responses data.
+
+```{r, result02_data}
+# Include all data processing code (if necessary), clearly commented
+subgroup_data <- pub_responses %>%
+  select(trial_group, model, precision, recall, f1) %>%
+  mutate(trial_group = as.factor(trial_group),
+         model = as.factor(model))
+```
+
+### Analysis: Methods and Results  
+
+The data was grouped by trial group (which represents different disease types) and model. We calculated the mean F1 score for each combination of disease type and model. The F1 score is a harmonic mean of precision and recall, and it provides a balanced measure that takes into account both false positives and false negatives, making it a suitable metric for this analysis.
+
+To visualize the performance across disease types, a plot of the error bars for each disease was used. This plot shows the error bar for each disease across all models, and where the average f1 score of each model that was used on that disease.
+
+```{r, result02_analysis}
+# Ensure that the necessary libraries are loaded
+
+# Step 1: Group the data by disease type and calculate the overall mean F1 score across all models
+# Assuming subgroup_data is the dataset containing F1 scores
+grouped_data_avg <- subgroup_data %>%
+  group_by(trial_group) %>%
+  summarise(mean_f1_all_models = mean(f1, na.rm = TRUE),
+            sd_f1_all_models = sd(f1, na.rm = TRUE),
+            .groups = "drop")  # Drop grouping after summarising
+
+# Step 2: Rank the diseases based on the overall mean F1 score
+# Select the top 10 best-performing diseases
+best_diseases <- grouped_data_avg %>%
+  arrange(desc(mean_f1_all_models)) %>%
+  slice(1:10) %>%
+  pull(trial_group)  # Extract disease names
+
+# Select the bottom 10 worst-performing diseases
+worst_diseases <- grouped_data_avg %>%
+  arrange(mean_f1_all_models) %>%
+  slice(1:10) %>%
+  pull(trial_group)  # Extract disease names
+
+# Step 3: Filter the original dataset for top 10 and bottom 10 diseases
+best_data <- subgroup_data %>% filter(trial_group %in% best_diseases)
+worst_data <- subgroup_data %>% filter(trial_group %in% worst_diseases)
+
+# Step 4: Recalculate the F1 scores for each model within the top and bottom 10 diseases
+plot_f1_scores <- function(data, title) {
+  grouped_data_avg_subset <- data %>%
+    group_by(trial_group) %>%
+    summarise(mean_f1_all_models = mean(f1, na.rm = TRUE),
+              sd_f1_all_models = sd(f1, na.rm = TRUE),
+              .groups = "drop")
+
+  data_combined <- data %>%
+    group_by(trial_group, model) %>%
+    summarise(mean_f1 = mean(f1, na.rm = TRUE), .groups = "drop") %>%
+    left_join(grouped_data_avg_subset, by = "trial_group")
+
+  # Step 5: Plot with error bars representing the mean F1 score ± SD across all models
+  ggplot(data_combined, aes(x = trial_group, y = mean_f1, group = model, color = model)) +
+    geom_point(size = 3) +  # Points for each model-specific score
+    geom_line(aes(group = model)) +  # Lines connecting model scores within each disease group
+
+    # Add error bars for mean F1 ± SD
+    geom_errorbar(aes(ymin = mean_f1_all_models - sd_f1_all_models, 
+                      ymax = mean_f1_all_models + sd_f1_all_models), 
+                  width = 0.2, color = "black") +
+
+
+
+    # Highlight the mean F1 score with a tick mark or point
+    geom_point(aes(y = mean_f1_all_models), shape = 3, size = 4, color = "black") +
+
+    labs(title = title, x = "Disease Type", y = "Mean F1 Score") +
+    theme_minimal() +
+    theme(legend.position = "bottom", axis.text.x = element_text(angle = 45, hjust = 1))
+}
+
+# Step 6: Plot the top 10 best-performing diseases
+plot_best <- plot_f1_scores(best_data, "F1 Scores by Disease Type and Model with Average Error Bars")
+
+
+plot_best
+
+```
+
+### Discussion of results
+
+Here we see that the error bar is very large, so the distribution of the data is very wide. This is surprising considering the mean isn't very high, so a large error bar could indicate that it really depends on the trial, and not the disease. However more would need to be done in order to confirm that. Also, we once again see llama3 outperforming gpt4o in all metrics, where once again gpt4o scores the worst. Once the data is reran and the evaluation is fixed, this would be interesting to see again and rerun the results.
+
+## Analysis: Question 3 (Hallucination Stats.
+
+### Question being asked 
+
+How much do the models hallucinate? And is there any patterns among these hallucinations?
+
+### Data Preparation
+
+Here I am just rereading in the data for the ct_pub dataframe
+
+```{r, result03_data}
+pub_data <- readRDS("../../CTBench_source/corrected_data/ct_pub/CT_Pub_data_updated.Rds")
+pub_matches <- readRDS("../../CTBench_source/corrected_data/ct_pub/trials.matches_gpt_updated.Rds")
+pub_responses <- readRDS("../../CTBench_source/corrected_data/ct_pub/trials.responses_updated.Rds")
+
+```
+
+### Analysis methods used  
+
+Right now, this code is bugged and attempts to find the hallucinations in the dataframe by cross referencing them against itself. I will have this fixed soon enough, I believe the bug is occuring because of a spacing issue, where the words in the vector arent always using the same spaces before and after.
+
+
+```{r, result03_analysis}
+# Initialize output data frame
+output_df <- data.frame(
+  trial = character(),
+  model = character(),
+  num_matches = integer(),
+  original_reference = list(),
+  original_candidate = list(),
+  reference_hallucinations = list(),
+  candidate_hallucinations = list(),
+  reference_duplicate_matches = list(),
+  candidate_duplicate_matches = list(),
+  stringsAsFactors = FALSE
+)
+# Loop through each unique trial in df1
+for (trial in unique(pub_data$NCTId)) {
+  # Extract the corresponding reference and candidate features for the current trial in df1
+  original_reference <- unlist(strsplit(as.character(pub_data$Paper_BaselineMeasures_Corrected[pub_data$NCTId == trial]), ","))
+  original_candidate <- unlist(strsplit(as.character(pub_responses$gen_response[pub_responses$trial_id == trial]), ","))
+  original_reference <- gsub("`", "", original_reference)
+  original_candidate <- gsub("`", "", original_candidate)
+  # Get all rows from df2 corresponding to the current trial
+  trial_df2 <- subset(pub_matches, trial_id == trial)
+  # Now loop through each unique model for the current trial
+  for (m in unique(trial_df2$model)) {
+    # Filter df2 for the current trial and model
+    trial_model_df2 <- subset(trial_df2, model == m)
+    # Initialize counters and lists for this trial-model pair
+    num_matches <- 0
+    reference_hallucinations <- c()
+    candidate_hallucinations <- c()
+    reference_duplicate_matches <- c()
+    candidate_duplicate_matches <- c()
+    # Lists to track occurrences for duplicates
+    reference_occurrences <- c()
+    candidate_occurrences <- c()
+    # Iterate through each row in trial_model_df2 (each match attempt)
+    for (row in 1:nrow(trial_model_df2)) {
+      selected_reference <- trial_model_df2$reference[row]
+      selected_candidate <- trial_model_df2$candidate[row]
+      if( !is.na(selected_reference) && !is.na(selected_candidate)){
+        num_matches <- num_matches + 1
+      }
+      # Track hallucinations
+      if (!(selected_reference %in% original_reference) && !is.na(selected_reference)) {
+        reference_hallucinations <- c(reference_hallucinations, selected_reference)
+      }
+      if (!(selected_candidate %in% original_candidate) && !is.na(selected_candidate)) {
+        candidate_hallucinations <- c(candidate_hallucinations, selected_candidate)
+      }
+      # Track duplicates
+      reference_occurrences <- c(reference_occurrences, selected_reference)
+      candidate_occurrences <- c(candidate_occurrences, selected_candidate)
+    }
+    # Find duplicates by checking occurrences of more than one
+    reference_duplicate_matches <- reference_occurrences[duplicated(reference_occurrences) & !is.na(reference_occurrences)]
+    candidate_duplicate_matches <- candidate_occurrences[duplicated(candidate_occurrences) & !is.na(candidate_occurrences)]
+    # Add the results to output_df
+    output_df <- rbind(output_df, data.frame(
+      trial = trial,
+      model = m,
+      num_matches = num_matches,
+      original_reference = I(list(original_reference)),
+      original_candidate = I(list(original_candidate)),
+      reference_hallucinations = I(list(reference_hallucinations)),
+      candidate_hallucinations = I(list(candidate_hallucinations)),
+      reference_duplicate_matches = I(list(reference_duplicate_matches)),
+      candidate_duplicate_matches = I(list(candidate_duplicate_matches)),
+      stringsAsFactors = FALSE
+    ))
+  }
+}
+# View the output
+print(head(output_df))
+```
+
+
+
diff --git a/StudentNotebooks/Assignment04/Balajy_assignment4.pdf b/StudentNotebooks/Assignment04/Balajy_assignment4.pdf