Skip to content

Dar balajy #41

Merged
merged 2 commits into from Nov 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
393 changes: 393 additions & 0 deletions StudentNotebooks/Assignment04/Balajy_assignment4.Rmd
@@ -0,0 +1,393 @@
---
title: "DAR F24 Project Status Notebook Week 4"
author: "Yashas Balaji"
date: "`r Sys.Date()`"
output:
pdf_document:
toc: yes
html_document:
toc: yes
subtitle: "CTBench"
---

```{r setup, include=FALSE}
# Required R package installation; RUN THIS BLOCK BEFORE ATTEMPTING TO KNIT THIS NOTEBOOK!!!
# This section install packages if they are not already installed.
# This block will not be shown in the knit file.
if (!require("knitr")) {
install.packages("knitr")
library(knitr)
}
knitr::opts_chunk$set(echo = TRUE)
# Set the default CRAN repository
local({r <- getOption("repos")
r["CRAN"] <- "http://cran.r-project.org"
options(repos=r)
})
if (!require("devtools")) {
install.packages("devtools")
library(devtools)
}
# For package conflict resolution (esp. dplyr functions)
# run con
if (!require("conflicted")) {
devtools::install_github("r-lib/conflicted")
library(conflicted)
}
# Required packages for CTEval analysis
if (!require("rmarkdown")) {
install.packages("rmarkdown")
library(rmarkdown)
}
if (!require("tidyverse")) {
install.packages("tidyverse")
library(tidyverse)
}
# Our preferences
conflicts_prefer(dplyr::summarize())
conflicts_prefer(dplyr::filter())
conflicts_prefer(dplyr::select())
conflicts_prefer(dplyr::mutate())
conflicts_prefer(dplyr::arrange())
if (!require("stringr")) {
install.packages("stringr")
library(stringr)
}
if (!require("ggbiplot")) {
install.packages("ggbiplot")
library(ggbiplot)
}
if (!require("pheatmap")) {
install.packages("pheatmap")
library(pheatmap)
}
if (!require("plotrix")) {
install.packages("plotrix")
library(plotrix)
}
if (!require("kableExtra")) {
install.packages("kableExtra")
library(kableExtra)
}
```
## Instructions (DELETE BEFORE SUBMISSION)

* Use this notebook is a template for your biweekly project status assignment.
* Use the sections starting with **BiWeekly Work Summary** as your outline for your submitted notebook.
* Summarize ALL of your work in this notebook; **if you don't show and/or link to your work here, it doesn't exist for us!**

1. Create a new copy of this notebook in the `AssignmentX` sub-directory of your team's github repository using the following naming convention

* `rcsid_assignmentX.Rmd` and `rcsid_assignmentX.pdf`
* For example, `bennek_assignment03.Rmd`

2. Document **all** the work you did on your assigned project this week **using the outline below.**

3. You MUST include figures and/or tables to illustrate your work. *Screen shots are okay*, but include something!

4. You MUST include links to other important resources (knitted HTMl files, Shiny apps). See the guide below for help.

5. Commit the source (`.Rmd`) and knitted (`.html`) versions of your notebook and push to github

6. **Submit a pull request.** Please notify Dr. Erickson if you don't see your notebook merged within one day.

7. **DO NOT MERGE YOUR PULL REQUESTS YOURSELF!!**

See the Grading Rubric for guidance on how the contents of this notebook will be graded on LMS or GradeScope.

## Weekly Work Summary

**NOTE:** Follow an outline format; use bullets to express individual points.

* RCS ID: **Always** include this!
* Project Name: **Always** include this!
* Summary of work since last week

* Describe the important aspects of what you worked on and accomplished

* NEW: Summary of github issues added and worked

* Issues that you've submitted
* Issues that you've self-assigned and addressed

* Summary of github commits

* include branch name(s)
* include browsable links to all external files on github
* Include links to shared Shiny apps

* List of presentations, papers, or other outputs

* Include browsable links

* List of references (if necessary)
* Indicate any use of group shared code base
* Indicate which parts of your described work were done by you or as part of joint efforts

* **Required:** Provide illustrating figures and/or tables

## Personal Contribution

* Clearly defined, unique contribution(s) done by you: code, ideas, writing...
* Include github issues you've addressed if any

## Analysis: Question 1: How do the LLM models perform differently in terms of precision, recall, and F1 scores?

### Question being asked

We are investigating whether different large language models (LLMs), such as GPT-4o and LLaMa3-70B, differ significantly in terms of precision, recall, and F1 scores when generating baseline descriptors for clinical trials. Specifically, the goal is to identify if one model consistently outperforms the other and whether these differences are statistically significant across multiple trials.

### Data Preparation

Here, I am just using the updated data that corey reran and just choosing the columns I need for this graph

```{r, result01_data}
# Include all data processing code (if necessary), clearly commented
pub_data <- readRDS("../../CTBench_source/corrected_data/ct_pub/CT_Pub_data_updated.Rds")
pub_matches <- readRDS("../../CTBench_source/corrected_data/ct_pub/trials.matches_gpt_updated.Rds")
pub_responses <- readRDS("../../CTBench_source/corrected_data/ct_pub/trials.responses_updated.Rds")
subgroup_data <- pub_responses %>%
select(trial_group, model, precision, recall, f1) %>%
mutate(trial_group = as.factor(trial_group),
model = as.factor(model))
performance_data <- pub_responses %>%
select(model, precision, recall, f1) %>%
mutate(model = as.factor(model))
performance_data_long <- performance_data %>%
gather(key = "Metric", value = "Score", precision, recall, f1)
```

### Analysis: Methods and results

The goal of this analysis is to compare the performance of two LLMs, GPT-4o and LLaMa3-70B in both 3 shot and zero shot prompting methods, in terms of precision, recall, and F1 scores across different clinical trials. My goal is to to visually inspect the distributions of these performance metrics and determine whether there are any significant differences between the models.


```{r, result01_analysis}
# Violin plot with x-axis labels rotated
ggplot(performance_data_long, aes(x = model, y = Score, fill = Metric)) +
geom_violin(alpha = 0.3) +
geom_boxplot(width = 0.2, position = position_dodge(width = 0.75)) +
facet_wrap(~ Metric, scales = "free") +
labs(title = "Distribution of Precision, Recall, and F1 Scores by Model", x = "Model", y = "Score") +
theme_minimal() +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 45, hjust = 1)) # Rotating x-axis labels
```

### Discussion of results

The results are very similar to last weeks with fewer outliers. Once again we have llama3 zero shot beating every other model, but this time much more slightly than before. Now there is one obvious worst model, which is gpt4o zero shot.

## Analysis: Question 2 (Subgroup analysis)

### Question being asked

Are certain disease types (e.g., diabetes, cancer) harder for models to predict than others? My goal is to determine whether the F1 scores of the models vary across different disease types. Specifically, we want to identify if certain disease types are more challenging and whether these differences are consistent across models.

### Data Preparation

Here once again I am just choosing the columns that I need from the ct_pub responses data.

```{r, result02_data}
# Include all data processing code (if necessary), clearly commented
subgroup_data <- pub_responses %>%
select(trial_group, model, precision, recall, f1) %>%
mutate(trial_group = as.factor(trial_group),
model = as.factor(model))
```

### Analysis: Methods and Results

The data was grouped by trial group (which represents different disease types) and model. We calculated the mean F1 score for each combination of disease type and model. The F1 score is a harmonic mean of precision and recall, and it provides a balanced measure that takes into account both false positives and false negatives, making it a suitable metric for this analysis.

To visualize the performance across disease types, a plot of the error bars for each disease was used. This plot shows the error bar for each disease across all models, and where the average f1 score of each model that was used on that disease.

```{r, result02_analysis}
# Ensure that the necessary libraries are loaded
# Step 1: Group the data by disease type and calculate the overall mean F1 score across all models
# Assuming subgroup_data is the dataset containing F1 scores
grouped_data_avg <- subgroup_data %>%
group_by(trial_group) %>%
summarise(mean_f1_all_models = mean(f1, na.rm = TRUE),
sd_f1_all_models = sd(f1, na.rm = TRUE),
.groups = "drop") # Drop grouping after summarising
# Step 2: Rank the diseases based on the overall mean F1 score
# Select the top 10 best-performing diseases
best_diseases <- grouped_data_avg %>%
arrange(desc(mean_f1_all_models)) %>%
slice(1:10) %>%
pull(trial_group) # Extract disease names
# Select the bottom 10 worst-performing diseases
worst_diseases <- grouped_data_avg %>%
arrange(mean_f1_all_models) %>%
slice(1:10) %>%
pull(trial_group) # Extract disease names
# Step 3: Filter the original dataset for top 10 and bottom 10 diseases
best_data <- subgroup_data %>% filter(trial_group %in% best_diseases)
worst_data <- subgroup_data %>% filter(trial_group %in% worst_diseases)
# Step 4: Recalculate the F1 scores for each model within the top and bottom 10 diseases
plot_f1_scores <- function(data, title) {
grouped_data_avg_subset <- data %>%
group_by(trial_group) %>%
summarise(mean_f1_all_models = mean(f1, na.rm = TRUE),
sd_f1_all_models = sd(f1, na.rm = TRUE),
.groups = "drop")
data_combined <- data %>%
group_by(trial_group, model) %>%
summarise(mean_f1 = mean(f1, na.rm = TRUE), .groups = "drop") %>%
left_join(grouped_data_avg_subset, by = "trial_group")
# Step 5: Plot with error bars representing the mean F1 score ± SD across all models
ggplot(data_combined, aes(x = trial_group, y = mean_f1, group = model, color = model)) +
geom_point(size = 3) + # Points for each model-specific score
geom_line(aes(group = model)) + # Lines connecting model scores within each disease group
# Add error bars for mean F1 ± SD
geom_errorbar(aes(ymin = mean_f1_all_models - sd_f1_all_models,
ymax = mean_f1_all_models + sd_f1_all_models),
width = 0.2, color = "black") +
# Highlight the mean F1 score with a tick mark or point
geom_point(aes(y = mean_f1_all_models), shape = 3, size = 4, color = "black") +
labs(title = title, x = "Disease Type", y = "Mean F1 Score") +
theme_minimal() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 45, hjust = 1))
}
# Step 6: Plot the top 10 best-performing diseases
plot_best <- plot_f1_scores(best_data, "F1 Scores by Disease Type and Model with Average Error Bars")
plot_best
```

### Discussion of results

Here we see that the error bar is very large, so the distribution of the data is very wide. This is surprising considering the mean isn't very high, so a large error bar could indicate that it really depends on the trial, and not the disease. However more would need to be done in order to confirm that. Also, we once again see llama3 outperforming gpt4o in all metrics, where once again gpt4o scores the worst. Once the data is reran and the evaluation is fixed, this would be interesting to see again and rerun the results.

## Analysis: Question 3 (Hallucination Stats.

### Question being asked

How much do the models hallucinate? And is there any patterns among these hallucinations?

### Data Preparation

Here I am just rereading in the data for the ct_pub dataframe

```{r, result03_data}
pub_data <- readRDS("../../CTBench_source/corrected_data/ct_pub/CT_Pub_data_updated.Rds")
pub_matches <- readRDS("../../CTBench_source/corrected_data/ct_pub/trials.matches_gpt_updated.Rds")
pub_responses <- readRDS("../../CTBench_source/corrected_data/ct_pub/trials.responses_updated.Rds")
```

### Analysis methods used

Right now, this code is bugged and attempts to find the hallucinations in the dataframe by cross referencing them against itself. I will have this fixed soon enough, I believe the bug is occuring because of a spacing issue, where the words in the vector arent always using the same spaces before and after.


```{r, result03_analysis}
# Initialize output data frame
output_df <- data.frame(
trial = character(),
model = character(),
num_matches = integer(),
original_reference = list(),
original_candidate = list(),
reference_hallucinations = list(),
candidate_hallucinations = list(),
reference_duplicate_matches = list(),
candidate_duplicate_matches = list(),
stringsAsFactors = FALSE
)
# Loop through each unique trial in df1
for (trial in unique(pub_data$NCTId)) {
# Extract the corresponding reference and candidate features for the current trial in df1
original_reference <- unlist(strsplit(as.character(pub_data$Paper_BaselineMeasures_Corrected[pub_data$NCTId == trial]), ","))
original_candidate <- unlist(strsplit(as.character(pub_responses$gen_response[pub_responses$trial_id == trial]), ","))
original_reference <- gsub("`", "", original_reference)
original_candidate <- gsub("`", "", original_candidate)
# Get all rows from df2 corresponding to the current trial
trial_df2 <- subset(pub_matches, trial_id == trial)
# Now loop through each unique model for the current trial
for (m in unique(trial_df2$model)) {
# Filter df2 for the current trial and model
trial_model_df2 <- subset(trial_df2, model == m)
# Initialize counters and lists for this trial-model pair
num_matches <- 0
reference_hallucinations <- c()
candidate_hallucinations <- c()
reference_duplicate_matches <- c()
candidate_duplicate_matches <- c()
# Lists to track occurrences for duplicates
reference_occurrences <- c()
candidate_occurrences <- c()
# Iterate through each row in trial_model_df2 (each match attempt)
for (row in 1:nrow(trial_model_df2)) {
selected_reference <- trial_model_df2$reference[row]
selected_candidate <- trial_model_df2$candidate[row]
if( !is.na(selected_reference) && !is.na(selected_candidate)){
num_matches <- num_matches + 1
}
# Track hallucinations
if (!(selected_reference %in% original_reference) && !is.na(selected_reference)) {
reference_hallucinations <- c(reference_hallucinations, selected_reference)
}
if (!(selected_candidate %in% original_candidate) && !is.na(selected_candidate)) {
candidate_hallucinations <- c(candidate_hallucinations, selected_candidate)
}
# Track duplicates
reference_occurrences <- c(reference_occurrences, selected_reference)
candidate_occurrences <- c(candidate_occurrences, selected_candidate)
}
# Find duplicates by checking occurrences of more than one
reference_duplicate_matches <- reference_occurrences[duplicated(reference_occurrences) & !is.na(reference_occurrences)]
candidate_duplicate_matches <- candidate_occurrences[duplicated(candidate_occurrences) & !is.na(candidate_occurrences)]
# Add the results to output_df
output_df <- rbind(output_df, data.frame(
trial = trial,
model = m,
num_matches = num_matches,
original_reference = I(list(original_reference)),
original_candidate = I(list(original_candidate)),
reference_hallucinations = I(list(reference_hallucinations)),
candidate_hallucinations = I(list(candidate_hallucinations)),
reference_duplicate_matches = I(list(reference_duplicate_matches)),
candidate_duplicate_matches = I(list(candidate_duplicate_matches)),
stringsAsFactors = FALSE
))
}
}
# View the output
print(head(output_df))
```



Binary file not shown.