diff --git a/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/finalProjectNotebookF24-balajy.Rmd b/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/finalProjectNotebookF24-balajy.Rmd index 0e795d5..7212bda 100644 --- a/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/finalProjectNotebookF24-balajy.Rmd +++ b/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/finalProjectNotebookF24-balajy.Rmd @@ -18,94 +18,12 @@ output: # DAR Project and Group Members * Project name: CTBench -* Project team members: Evaluation - -# Instructions (DELETE BEFORE SUBMISSION) - -* The first goal of this notebook is to document your _major findings_ to convey them to your client (Dr. Rogers, Dr. Senveratne, or Mr. Neehal) and to preserve them for future use. - -* The second goal of this notebook is to document your _major findings_ with full scientific reproducibility. _Ideally someone should be able to go back years later and understand exactly what you did and reproduce your results._ - -* You can use the appendix to include additional results to improve the readability (for example extra plots) of your notebook or to show your work even if not really a major finding. - -* This is a scientific report written in complete sentences (i.e. not bullets) using good rules of grammar. It should be readable as a paper even if all the code is not shown, and if only the results of running your code are shown. - -* You should have sufficient details for scientific reproducibility including documentation of the code. You will need to describe the analysis methods that can be used together with the code to reproduce your work. This is especially important if you use several R files. - -* The rubric for grading is here [Rubric](https://docs.google.com/spreadsheets/d/e/2PACX-1vSeo5QZbboWwKnEZodmPQLnhr3hf5FrlzAqy4LydnOAsCw6V-YLWnAU8BzkLdmb9TP0zCpufAzI20XJ/pubhtml) - -* A suggested report structure is given below, but you can customize this to meet the needs of your project. For your draft notebook, you will design the stucture of your notebook and outline the contents. - -* Every student's final project notebook should be written individually or, in rare cases, as a small group. In many cases, you will discussing joint work located in other notebooks/locations. Talk with professor if you want to do joint notebook. - -* As noted above, your final notebook serves as a written presentation of your work this semester so it must be written like a written document. You should include code but feel free to use use proper R Markdown code chunk syntax to hide code chunks that don't need to be shown. You must describe what you are doing and the results outside of the code chunks. **You report should be readable and understandable by the readers without reading any code.** - -* The R code that executes the results should be embedded in this notebook if possible. - + It's also okay to "source" external scripts from within your notebook. - + You can also describe functionality code and results that are in other locations (like apps). - + PLEASE make sure all source code is in appropriate repository. -* Fall 2024 students may have work that is not appropriate to be embedded on your final notebook - + You should describe the work in the notebook and provide figures generated elsewhere (e.g. screen shots, graphs). - + Indicate if that work has been committed to github. If necessary put details in Appendix including the names of the committed files. -* Your writing style should be suitable for sharing with external partners/mentors and useful to future contributors. Do not assume that your reader is familiar with the technical details of your implementation and code. Again, write as if this is a research paper. -* Focus on results; please don't summarize everything you did this semester! - + Discuss only the *most important* aspects of your work. - + Ask yourself *what really matters?* -* **IMPORTANT:** Discuss any insights you found regarding your research. -* If there are limitations to your work, discuss, in detail. -* Include any **background** or **supporting evidence** for your work. - + For example, mention any relevant research articles you found -- and be sure to include references! - -## Things to check before you submit (DELETE BEFORE SUBMITTING) ## -* Have you done all the required components of the notebook in the format required? - -* Is your document readable as a research paper even if all the code is suppressed? - + Try suppressing all the code using hint below and see if this is true. -* Did you proofread your document? Does it use complete sentences and good grammar? -* Is every figure/table clearly labeled and titled? -* Does every figure serve a purpose? - + Does the figure/table have a useful title? **Hint:** What _question_ does the figure answer? - + You can put extra (non-essential) figures/tables in your **Appendix**. - + Is the figured/tables captioned? - + Are the figure/tables and its associated findings discussed in the text? - + Is it clear which figure/tables is being discussed? **Hint:** use captions! -* **CRITICAL:** Have you given enough information for someone to reproduce, understand and extend your results? - + Where can they *find* the data and code that you used? - + Have you *described* the data that used? - + Have you *documented* your code? - + Have you stated where code is located? - + Are your figures/tables *clearly labeled*? - + Did you *discuss each figure and your findings*? - + Did you use good grammar and *proofread* your results? - + Finally, have you *committed* your work to github and made a *pull request*? - -* Summarize ALL of your work that is worthy of being preserved in this notebook; Feel free to include work in the appendix at end. It will not be judged as being part of the research document but rather as additional information to be preserved. **if you don't show and/or link to your work here, it doesn't exist for us!** - - -* You **MUST** include figures and/or tables to illustrate your work. *Screen shots or pngs are okay for work generated outside the notebook*. - -* . You **MUST** include links to other important resources (knitted HTMl files, Shiny apps). See the guide below for help. - -5. Commit the source (`.Rmd`), pdf (`.pdf`) and knitted (`.html`) versions of your notebook and push to github. Turn in the pdf version to lms. - - -See LMS for guidance on how the contents of this notebook will be graded. - -**DELETE THE SECTIONS ABOVE!** +* Project team members: Corey Curran, Xiheng Liu, Tianyan Lin, Samuel Park, Tianhao Zhao, Ziyi Bao, Mingyang Li, Soumeek Mishra # 0.0 Preliminaries. -*R Notebooks are meant to be dynamic documents. Provide any relevant technical guidance for users of your notebook. Also take care of any preliminaries, such as required packages. Sample text:* - -This report is generated from an R Markdown file that includes all the R code necessary to produce the results described and embedded in the report. Code blocks can be surpressed from output for readability using the command code `{R, echo=show}` in the code block header. If `show <- FALSE` the code block will be surpressed; if `show <- TRUE` then the code will be show. - -```{r} -# Set to TRUE to expand R code blocks; set to FALSE to collapse R code blocks -show <- TRUE -``` - Executing this R notebook requires some subset of the following packages: @@ -128,6 +46,10 @@ if (!require("ggplot2")) { install.packages("ggplot2") library(ggplot2) } +if (!require("ggcorrplot")) { + install.packages("ggcorrplot") + library(ggcorrplot) +} if (!require("tidyverse")) { install.packages("tidyverse") library(tidyverse) @@ -196,8 +118,6 @@ if (!require("kableExtra")) { # 1.0 Project Introduction -_Describe your project and your approaches at a high level. Give enough information that a researcher examing your notebook can understand what this notebook is about. _ - CTBench is a benchmark designed to evaluate the performance of large language models (LLMs) in supporting the design of clinical studies. By leveraging study-specific metadata, CTBench assesses how effectively different LLMs identify the baseline features of a clinical trial, such as demographic details and other key attributes collected at the trial's outset from all participants. The CTBench analysis incorporates two sources of clinical trial data: CT_repo and CT_pub. CT_repo includes selected clinical trials and their attributes sourced from the ClinicalTrials.gov data repository. In contrast, CT_pub features a subset of clinical trials with attributes derived from their corresponding clinical trial publications. @@ -205,59 +125,39 @@ The CTBench analysis incorporates two sources of clinical trial data: CT_repo an Here we will be evaluating the LLM's performance in various aspects, such as F1, Recall, and Precision. We will also take a look at the each model's tendency to hallucinate features. - -```{r } -# Code - -``` - # 2.0 Organization of Report -_Give report organization including list of major findings. Sample is provided. Please be sure to edit appropriately and remove this statement._ - This report is organize as follows: -* Section 3.0. Finding 1: Provide short name and give brief description. We performed a comparison of ying versus yang items using three different approaches: blah1, blah2, and blah3. - - * Section 4.0: Finding 2: Short name and brief desciption. +* Section 3.0. Finding 1: Hallucination metrics -Repeat as necessary +* Section 4.0: Finding 2: Nature of Hallucinations -* Section (X).0 Finding X-2: Short name and brief description. - -* Section (X+1).0 Overall conclusions and suggestions - -* Section (X+2).0 Appendix This section describe the following additional works that may be helpful in the future work: *list subjects*. # 3.0 Finding 1: Hallucinations Overview -_Give a highlevel overview of the major finding. What questions were your trying to address, what approaches did you employ, and what happened?_ - This study did not take into account for the LLMs to hallucinate and add or remove features. My goal here is to look at how the models often these models hallucinate, and if there are patterns we can find. I made multiple visualizations, and used the Kruskal-Wallis test to check this out. ## 3.1 Data, Code, and Resources -Here is a list data sets, codes, that are used in your work. Along with brief description and URL where they are located. -_Some examples you can replace. Note all these links must be clickable and live when document submitted. So make sure to do your commits and pull requests._ - -1. CT_Pub_data.Rds this is the rds from CT_Pub containing the reference features for CT_Pub -[https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_pub/CT_Pub_data.Rds](https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_pub/CT_Pub_data.Rds) +1. CT_Repo_data.Rds this is the rds from CT_Repo containing the reference features for CT_Repo +[https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_repo/CT_Repo_data.Rds](https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_repo/CT_Repo_data.Rds) 2. trials.matches.Rds is the rds containing the match data [https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_pub/trials.matches.Rds](https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_pub/trials.matches.Rds). 3. trials.responses.Rds is the rds containing the match data -[https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_pub/trials.matches.Rds](https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_pub/trials.matches.Rds). +[https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_repo/trials.matches.Rds](https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_repo/trials.matches.Rds). 4. functions.R are functions that Corey wrote in order to calculate the hallucination data [https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/StudentData/functions.R] (https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/StudentData/functions.R) -5. CT-Pub-Hallucination-Metrics.Rds is summarized hallucination data, which includes the trial groups of each trial -[https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/StudentData/CT-Pub-Hallucination-Metrics.Rds] (https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/StudentData/CT-Pub-Hallucination-Metrics.Rds) +5. CT-Repo-Hallucination-Metrics.Rds is summarized hallucination data, which includes the trial groups of each trial +[https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/StudentData/CT-Repo-Hallucination-Metrics.Rds] (https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/StudentData/CT-Repo-Hallucination-Metrics.Rds) *Describe the dataset and prepartion and/or preprocessing techniques ("data munging") you use. Put code here if not external. @@ -689,16 +589,18 @@ id_hallucinations_class<-function(trial_df,matches_df){ ``` +Here is the code that I use to load all the proper data sets and produce the right dataframes for the tests and visualizations. + ```{r } # Code to read in data if appropriate. -pub_data <- readRDS("../../Data/Hallucinations/ct_pub/CT_Pub_data.Rds") -pub_matches <- readRDS("../../Data/Hallucinations/ct_pub/trials.matches.Rds") -pub_responses <- readRDS("../../Data/Hallucinations/ct_pub/trials.responses.Rds") +repo_data <- readRDS("../../Data/Hallucinations/ct_repo/CT_Repo_data.Rds") +repo_matches <- readRDS("../../Data/Hallucinations/ct_repo/trials.matches.Rds") +repo_responses <- readRDS("../../Data/Hallucinations/ct_repo/trials.responses.Rds") -metrics <- readRDS("../../StudentData/CT-Pub-Hallucination-Metrics.Rds") +metrics <- readRDS("../../StudentData/CT-Repo-Hallucination-Metrics.Rds") -trial_df <- read.csv("../../CTBench_source/corrected_data/ct_pub/CT-Pub-With-Examples-Corrected-allgen.csv", stringsAsFactors = FALSE) -matches_df <- read.csv("../../CTBench_source/corrected_data/ct_pub/CT-Pub-With-Examples-Corrected-allgpteval.csv", stringsAsFactors = FALSE) +trial_df <- read.csv("../../CTBench_source/corrected_data/ct_repo/CT-Repo-With-Examples-Corrected-allgen.csv", stringsAsFactors = FALSE) +matches_df <- read.csv("../../CTBench_source/corrected_data/ct_repo/CT-Repo-With-Examples-Corrected-allgpteval.csv", stringsAsFactors = FALSE) hall_data <- id_hallucinations_v2(trial_df, matches_df) @@ -746,27 +648,16 @@ hallucinations_long <- hallucinations_by_trial_group %>% ## 3.2 Contribution -_State if this section is sole work or joint work. If joint work describe who you worked with and your contribution. You can also indicate any work by others that you reused._ - This section is my work, which is building off of work that Corey has done with Hallucinations. - ## 3.3 Methods Description - -_Describe the data analytics methods you used and why you chose them. -Discuss your data analytics "pipeline" from *data preparation* and *experimental design*, to *methods*, to *results*. Were you able to use pre-existing implementations? If the techniques required user-specified parameters, how did you choose what parameter values to use?_ +The analysis aimed to examine hallucination patterns in LLM outputs for clinical trial descriptors, focusing on positive, negative, and multimatch hallucinations. Data preparation involved combining multiple datasets, including CT_Repo_data (reference trial features), trials.matches (matched descriptors between reference and candidate features), and trials.responses (model-generated outputs). Using the id_hallucinations_v2() function, reference and candidate descriptors were parsed to compute hallucination counts. The analytics pipeline consisted of aggregating hallucination data by trial group and model, calculating total and average counts for each hallucination type, and performing statistical tests. Specifically, Kruskal-Wallis tests were employed to assess whether hallucination types significantly varied by trial group or model. Visualizations included bar charts to illustrate hallucination averages by model and type, as well as heatmaps to identify clustering patterns across trial groups and models. The analysis relied on reshaping data for trial group comparisons and non-parametric statistical methods to identify significant differences, focusing on the top three hallucination types for simplicity and interpretability. Default statistical thresholds (p < 0.05) were applied, ensuring robust conclusions from the observed trends. ## 3.4 Result and Discussion - - -_For each result, state the method used. Run the code to perform it here (or state how it was run if run elsewhere) -Provide relvant visual illustrations of findings such as tables and graphs. -Then discuss the result. Repeat as necessary. Remember that readers will only read text and results and not code._ - This section will be looking into each model's Hallucinations. Hallucinations in the context of language models for clinical study design refer to outputs that deviate from the expected or correct response. There are three primary types of hallucinations: positive, negative, and multimatch. Positive hallucinations occur when the model invents or adds information that is not present in the reference dataset, such as generating a demographic feature like "Height" when it is not included in the trial's baseline features. These hallucinations can mislead researchers by introducing irrelevant or non-existent features into the study design. Negative hallucinations, on the other hand, involve the omission of information that is explicitly present in the reference dataset. For example, if the baseline features include "Age" and "Sex/Gender," but the model only outputs "Age," the omission of "Sex/Gender" is a negative hallucination, leading to incomplete or inadequate study designs. Multimatch hallucinations occur when the model incorrectly matches a single reference feature to multiple generated features or vice versa, violating the expectation of one-to-one correspondence. For instance, if the reference includes "Race," and the model outputs both "Race" and "Ethnicity" as separate features, this creates redundant or conflicting information. These hallucinations introduce ambiguity and complicate data interpretation and analysis. Each type of hallucination poses unique challenges and can undermine the reliability of the model's output, highlighting the importance of addressing these issues to ensure accurate and effective language model performance in clinical trial design. We will be looking comparing the different LLM models in their numbers of hallucinations. @@ -776,9 +667,9 @@ We will be looking comparing the different LLM models in their numbers of halluc # Summarize data to get the average number of hallucinations per model hallucinations_by_model_temp <- hall_data %>% group_by(model) %>% - summarise(avg_positive_hallucinations = sum(num_pos_halls), - avg_negative_hallucinations = sum(num_neg_halls), - avg_multimatch_halls = sum(num_multi_halls)) + summarise(avg_positive_hallucinations = mean(num_pos_halls), + avg_negative_hallucinations = mean(num_neg_halls), + avg_multimatch_halls = mean(num_multi_halls)) # Reshape data to long format for combined bar plot hallucinations_long_temp <- hallucinations_by_model_temp %>% @@ -791,30 +682,61 @@ hallucinations_long_temp <- hallucinations_by_model_temp %>% # Create combined bar chart ggplot(hallucinations_long_temp, aes(x = model, y = count, fill = hallucination_type)) + geom_bar(stat = "identity", position = "dodge") + - labs(title = "Total Hallucinations by Model and Type for CT_Pub", x = "Model", y = "Total Hallucinations") + + labs(title = "Average Hallucinations by Model and Type for CT_Repo", x = "Model", y = "Total Hallucinations") + scale_fill_manual(values = c("steelblue", "tomato", "purple"), name = "Hallucination Type") + theme_minimal() ``` +The bar chart shows the **average hallucinations by model and type** for the **CT_Repo** dataset, broken down into positive, negative, and multimatch hallucinations. Across all models, **positive hallucinations** occur the most frequently, while **multimatch hallucinations** are generally the least frequent, with some variation. + +For **positive hallucinations**, all models display a high average. Notably, **GPT4-omni-zs** and **llama3-70b-in-zs** have similar values, but **llama3-70b-in-zs** performs slightly better by having a marginally lower average. **GPT4-omni-ts** and **llama3-70b-in-ts** show no significant deviation from this trend, reinforcing that positive hallucinations remain prevalent across all models. -Overall, almost all of the models followed a similar trend as shown by the bar chart. Each model except for llama3-zs had the least number of multimatch hallucinations, the most amount of positive hallucinations, the number of negative hallucinations was in between the two. All the models had roughly the same number of negative hallucinations, with llama3-zs having the least and gpt4o-ts having the most. For multimatch hallucinations, interestingly gpt4o-ts had the least, and llama3-zs had the most. And finally for the positive hallucinations, gpt4o-zs performed the worst, while llama3-zs performed the best. +In the case of **negative hallucinations**, the averages remain consistent across the models. **Llama3-70b-in-ts** exhibits the highest negative hallucinations, while **GPT4-omni-ts** shows the least. The small differences suggest that models perform similarly in terms of omissions, with negative hallucinations staying within a comparable range. + +For **multimatch hallucinations**, the results vary more noticeably. **GPT4-omni-ts** has the lowest average multimatch hallucinations, indicating stronger performance in preventing over-matching. In contrast, **GPT4-omni-zs** and **llama3-70b-in-zs** show higher multimatch averages, with **llama3-70b-in-zs** having the most. This suggests that some models struggle with matching a single reference to multiple descriptors despite being explicitly guided not to do so. + +Overall, the chart reveals a clear trend: **positive hallucinations** dominate all models, followed by negative hallucinations, while multimatch hallucinations are the least common but vary across models. **GPT4-omni-ts** appears to perform best in minimizing multimatch hallucinations, while **llama3-70b-in-zs** performs slightly better for positive hallucinations but exhibits higher multimatch errors. This analysis highlights that while all models struggle with positive hallucinations, their performance on other types varies, reflecting specific strengths and weaknesses. With this in mind, lets dig deeper. Now lets take a look at how each each different trial group affects the number of hallucinations. ```{r} -ggplot(hallucinations_long, aes(x = TrialGroup, y = count, color = model_name, group = model_name)) + - geom_line(size = 1) + - geom_point(size = 2) + - facet_wrap(~ hallucination_type, scales = "free_y") + - labs(title = "Total Hallucinations by Trial Group and Model", - x = "Trial Group (Disease)", - y = "Total Hallucination Counts", - color = "Model") + - theme_minimal() + - theme(axis.text.x = element_text(angle = 45, hjust = 1)) +# Combine hallucination_type and model_name into one column and reshape the data +hallucination_matrix <- hallucinations_long %>% + unite(hallucination_model, hallucination_type, model_name, sep = " - ") %>% + pivot_wider(names_from = hallucination_model, values_from = count, values_fill = 0) %>% + column_to_rownames(var = "TrialGroup") + +# Reorder columns to group hallucination types +column_order <- colnames(hallucination_matrix) %>% + as_tibble() %>% + mutate(hallucination_type = gsub(" - .*", "", value)) %>% + arrange(hallucination_type) %>% + pull(value) + +# Apply the order to the columns +hallucination_matrix <- hallucination_matrix[, column_order] + +# Ensure the matrix is numeric and replace any NAs +hallucination_matrix <- as.matrix(hallucination_matrix) +hallucination_matrix[is.na(hallucination_matrix)] <- 0 # Replace NA with 0 + +# Transpose the reordered matrix +transposed_matrix <- t(hallucination_matrix) + +# Generate the heatmap +pheatmap( + transposed_matrix, # Transposed and cleaned matrix + cluster_rows = FALSE, # Cluster rows (originally columns) + cluster_cols = FALSE, # Keep columns in the defined order + scale = "row", # Scale values by row for better visualization + main = "Heatmap of Total Hallucinations by Trial Group and Model", + fontsize_row = 8, # Adjust font size for rows + fontsize_col = 8 # Adjust font size for columns +) + ``` Here we see a visualization of what this looks like. However, to draw any conclusions, below I am running a kruskal-wallis test by both model and trialgroup for each type of hallucination. @@ -853,11 +775,16 @@ results_df <- do.call(rbind, lapply(results, as.data.frame)) kable(results_df, col.names = c("Hallucination Type", "Test", "Group", "p-value"), caption = "Kruskal-Wallis Test Results for Hallucination Types by Trial Group and Model") ``` -Here we can see that we have p values of under 0.05 for positive hallucinations by model, and negative hallucinations by trial group. This is an interesting finding, and leads to the idea that Positive hallucinations are created completely model dependent, and to minimize these, we need to choose a better model. Negative hallucinations on the other hand, are dependent on trial group. This could indicate that harder trials that the LLMs struggle with, seem to be consistent across models. +The table presents the results of the Kruskal-Wallis test for the different hallucination types (Positive, Negative, and Multimatch) across Trial Groups and Models. The p-values indicate whether there are statistically significant differences within these groups. + +For positive hallucinations, there is a significant difference when grouped by Trial Group (p = 0.0048), suggesting that certain trial groups experience more positive hallucinations than others. However, no significant difference is observed across models (p = 0.9318), indicating that positive hallucinations occur at similar levels regardless of the model used. + +In the case of negative hallucinations, the results show a significant difference across Models (p = 0.0390), meaning that the performance of models varies when it comes to negative hallucinations (omissions). Conversely, no significant difference is found across Trial Groups (p = 0.2439), suggesting that trial groups do not strongly influence the occurrence of negative hallucinations. + +For multimatch hallucinations, there is a significant difference across Models (p = 0.0452), indicating variability in how models perform when avoiding multimatch errors. However, the results show no significant difference across Trial Groups (p = 0.3659), suggesting that trial groups do not significantly affect multimatch hallucinations. ## 3.5 Conclusions, Limitations, and Future Work. -**Discuss the significance of your finding. Discuss any limitations that should be addressed in the future. Give suggestions for future work.** This analysis revealed significant insights into the behavior of large language models (LLMs) in generating hallucinations during clinical trial design tasks. The results demonstrated that positive hallucinations—where models invent features not present in the reference data—are largely model-dependent, with specific models like llama3-zs consistently outperforming others. On the other hand, negative hallucinations, where models omit critical features, appear to be trial group-dependent, suggesting that certain trial contexts are inherently more challenging for LLMs regardless of the model used. Multimatch hallucinations, though less frequent, showed variability across models and warrant further exploration. Despite these findings, there are limitations. The study was confined to a specific dataset and models, potentially limiting the generalizability of results. Additionally, while statistical tests like the Kruskal-Wallis provided robust evidence of differences, further qualitative analysis of hallucinated features could reveal more nuanced insights. The study also focused primarily on identifying and quantifying hallucinations without implementing corrective measures to address them. @@ -865,37 +792,159 @@ Despite these findings, there are limitations. The study was confined to a speci Future work should address these limitations by expanding the dataset to include a wider variety of trial groups and exploring additional LLMs. Additionally, fine tuning the prompts with an emphasis on these hallucinations, will hopefully reduce them across the board. -# X.0 Finding 1: Name +# 4.0 Finding 1: Hallucination Nature. -_These sections can be duplicated for each finding as needed._ +Now that we have seen how the hallucinations are distributed, I want to take a closer look at the relationships between the types of hallucinations, and the descriptors that have been hallucinated -## X.1 Data, Code, and Resources +## 4.1 Data, Code, and Resources -## X.2 Contribution +1. CT_Repo_data.Rds this is the rds from CT_Repo containing the reference features for CT_Repo +[https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_repo/CT_Repo_data.Rds](https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_repo/CT_Repo_data.Rds) -## X.3 Methods Description +2. trials.matches.Rds is the rds containing the match data +[https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_pub/trials.matches.Rds](https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_pub/trials.matches.Rds). -## X.5 Conclusions and Future Work. +3. trials.responses.Rds is the rds containing the match data, and the list of hallucinations, however these hallucinations include all 3 types of hallucinations in one list. +[https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_repo/trials.matches.Rds](https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_repo/trials.matches.Rds). -## X.4 Result and Discussion +4. I also use multiple dataframes that I created during the preprocessing of the last section of the notebook. +Here I preprocess the code to make our visualizations. -## X.5 Conclusions, Limitations, and Future Work. +```{r} +# Step 1: Summarize hallucination counts for each trial +hallucination_type_summary <- hall_data %>% + group_by(trial_id) %>% + summarise( + avg_positive = mean(num_pos_halls, na.rm = TRUE), + avg_negative = mean(num_neg_halls, na.rm = TRUE), + avg_multimatch = mean(num_multi_halls, na.rm = TRUE), + .groups = "drop" + ) +# Step 2: Calculate correlations between hallucination types +hallucination_type_correlations <- hallucination_type_summary %>% + select(avg_positive, avg_negative, avg_multimatch) %>% + cor(method = "spearman") + +# Step 1: Extract hallucinated descriptors into a clean format +# If 'hallucinations' is a vector, it will unnest automatically +repo_responses_cleaned <- repo_responses %>% + rowwise() %>% + mutate(hallucinations_extracted = list(if (is.null(hallucinations) || all(is.na(hallucinations))) { + character(0) # Return an empty vector for NULL or NA + } else if (is.vector(hallucinations)) { + hallucinations # Already a vector, return as is + } else { + as.character(hallucinations) # Coerce single values to character + })) %>% + ungroup() + +# Step 2: Unnest the hallucinated descriptors into long format +hallucination_long <- repo_responses_cleaned %>% + select(NCTId, model, hallucinations_extracted) %>% + unnest(hallucinations_extracted) %>% + filter(hallucinations_extracted != "") %>% # Remove empty strings + rename(descriptor = hallucinations_extracted) # Rename for clarity + +# Step 3: Count the frequency of hallucinated descriptors per model +descriptor_counts <- hallucination_long %>% + group_by(model, descriptor) %>% + summarise(count = n(), .groups = "drop") %>% + arrange(model, desc(count)) + +# Step 1: Limit to the top 5 hallucinated descriptors per model +top_descriptors_by_model <- descriptor_counts %>% + group_by(model) %>% + slice_max(count, n = 5) %>% # Top 5 descriptors per model + ungroup() +top_descriptors_by_model <- top_descriptors_by_model %>% + mutate(descriptor_short = ifelse(nchar(descriptor) > 30, + paste0(substr(descriptor, 1, 27), "..."), + descriptor)) +``` +## 4.2 Contribution -# Bibliography -Provide a listing of references and other sources. +Using the data that Corey generated, I made these analysis myself. + +## 4.3 Methods Description + +To examine the relationships between hallucination types and identify frequently hallucinated descriptors across models, I utilized a multi-step analysis pipeline. First, hallucination counts for each trial were summarized to calculate the average number of positive, negative, and multimatch hallucinations. These averages were then used to compute a Spearman correlation matrix, which evaluates the strength and direction of relationships between the three hallucination types. This method was chosen because Spearman correlation is non-parametric and robust to non-linear relationships, making it suitable for our dataset. -* Citations from literature. Give each reference a unique name combining first author last name, year, and additional letter if required. e.g.[Bennett22a]. If there is no known author, make something reasonable up. -* Significant R packages used +Next, I processed the list of hallucinated descriptors by extracting elements from the hallucinations column in the repo_responses dataset. This involved cleaning and transforming the data to handle different formats (e.g., vectors or strings) and then unnesting the hallucinated descriptors into a long format. Descriptors were grouped by model, and their frequency was calculated to identify the top hallucinated descriptors for each model. To ensure clarity, the top five descriptors for each model were selected, and lengthy descriptor names were truncated for visualization purposes. These results were plotted using a faceted horizontal bar chart, where each facet represents a specific model, and the x-axis shows the frequency of hallucinations. This approach enabled a clear comparison of the most common hallucinated descriptors across models while maintaining readability. +## 4.4 Result and Discussion + +In our earlier sections in the notebook we looked into how the hallucinations were distributed, and what they were related to. Now I want to look into how the different types of hallucinations are related. To do this, I will be making correlation matrix using Spearman Correlation. + +```{r} +ggcorrplot(hallucination_type_correlations, lab = TRUE, lab_size = 4, colors = c("red", "white", "blue")) + + labs( + title = "Correlation Between Hallucination Types", + subtitle = "Spearman Correlation", + x = "Hallucination Type", + y = "Hallucination Type" + ) + + theme_minimal() +``` +The figure presents a Spearman correlation matrix showing the relationships between the three hallucination types: positive, negative, and multimatch. The correlation coefficients are color-coded, with blue representing positive correlations, red for negative correlations, and the intensity indicating the strength of the relationship. The correlation between positive hallucinations and negative hallucinations is 0.46, indicating a moderate positive relationship. This suggests that trials with more positive hallucinations also tend to have higher negative hallucinations, though the relationship is not very strong. The correlation between positive hallucinations and multimatch hallucinations is 0.04, showing a very weak positive relationship. This means that positive hallucinations and multimatch hallucinations occur largely independently of each other. Similarly, the correlation between negative hallucinations and multimatch hallucinations is 0.02, which is negligible, indicating almost no relationship between the two. Overall, the plot highlights that positive and negative hallucinations have the strongest relationship, while multimatch hallucinations appear largely uncorrelated with both positive and negative hallucinations. This suggests that multimatch hallucinations behave differently and may arise due to distinct underlying factors compared to the other hallucination types. +Additionally, we can also look at the top five hallucination descriptors per model. +```{r} +ggplot(top_descriptors_by_model, aes(y = reorder(descriptor_short, count), x = count, fill = model)) + + geom_bar(stat = "identity", position = "dodge", width=0.6) + + facet_wrap(~ model, ncol = 1, scales = "free_y") + + labs(title = "Top 5 Hallucinated Descriptors by Model (Truncated)", + x = "Frequency of Hallucination", y = "Descriptor") + + theme_minimal() + + theme( + axis.text.y = element_text(size = 7), + axis.text.x = element_text(size = 10), + strip.text = element_text(size = 12, face = "bold"), + legend.position = "none" + ) +``` + +For **GPT4-omni-ts**, the most hallucinated descriptors include "ECOG Performance Status," "Baseline Laboratory Values," and "Study Baseline: Cholesterol," with **ECOG Performance Status** being the most frequent. This suggests that GPT4-omni-ts often introduces specific baseline laboratory metrics and performance status features. + +In the case of **GPT4-omni-zs**, descriptors like "Patient history at enrollment," "Medical Condition," and "Body Mass Index (BMI)" are the most hallucinated. These results indicate a tendency to invent patient history and general health-related descriptors, suggesting GPT4-omni-zs focuses on broad clinical and demographic metrics. + +For **llama3-70b-in-ts**, the most frequent hallucinations include "ECOG Performance Status," "Body Mass Index (BMI)," and "Karnofsky Performance Status," alongside "Blood Pressure" and "Age." The prominence of "ECOG" and "BMI" highlights a similar pattern of hallucinating performance-based and anthropometric measures. + +Finally, for **llama3-70b-in-zs**, "Race/Ethnicity, Customized" stands out with the highest hallucination frequency across all models, far exceeding the others. Other descriptors, including "Body Mass Index (BMI)," "ECG measurements," and "Serum Creatinine," are also common. The dominance of "Race/Ethnicity" suggests llama3-70b-in-zs struggles with demographic features, particularly those related to race and ethnicity. + +Additionally, considering how the hallucinations are generated, it is important to consider that if the LLM doesn't include a feature exactly as it was given, it is counted as a hallucination. Many times while going through the data manually, I would find that some features are considered hallucinations, but the LLM kept the semantic meaning of the descriptor. + +Overall, the plot reveals clear differences in hallucination patterns across models. While **GPT4 models** tend to hallucinate baseline performance and laboratory-related descriptors, **llama3 models** show a stronger tendency to hallucinate demographic features like "BMI" and "Race/Ethnicity." These findings suggest that the nature of hallucinations is model-dependent, with certain descriptor types being particularly prone to hallucination for specific models. + +## 4.5 Conclusions, Limitations, and Future Work. + +The analysis revealed distinct patterns in the relationships between hallucination types and their distribution across models. Positive hallucinations showed a moderate positive correlation with negative hallucinations, suggesting that trials with more fabricated features often also omit relevant ones. However, multimatch hallucinations exhibited weak or negligible correlations with the other types, indicating they may arise from different underlying mechanisms. + +The examination of hallucinated descriptors highlighted model-dependent tendencies. GPT4 models primarily hallucinated baseline laboratory and performance-related metrics, such as "ECOG Performance Status" and "Baseline Laboratory Values." In contrast, llama3 models demonstrated a stronger tendency to hallucinate demographic features, including "BMI" and "Race/Ethnicity, Customized," particularly for llama3-70b-in-zs. These findings suggest that while all models struggle with hallucinations, the nature of these errors varies depending on the model architecture or training data. + +One key limitation is that the analysis treated all hallucinations equally without distinguishing between cases where semantic meaning was preserved and where the hallucination was genuinely misleading. This highlights the need for qualitative analysis to assess the context and significance of hallucinated descriptors. Additionally, the dataset is limited to a specific set of models and trials, which may impact the generalizability of the findings. + +Future work should focus on refining evaluation techniques to account for semantic equivalence in descriptors and further investigate the root causes of multimatch hallucinations. Expanding the analysis to include additional models and trial groups can provide a broader understanding of hallucination behaviors and inform strategies to mitigate these errors. Finally, fine-tuning LLMs to prioritize consistency in descriptor generation and minimizing deviations from reference features will be a critical step in improving their reliability for clinical trial design. + + +# Bibliography + +No outside sources were used, other than the mentioned R packages + +* `ggplot2` +* `tidyverse` +* `knitr` +* `jsonlite` +* `devtools` +* `stringr` +* `pheatmap` +* `kableExtra` # Appendix -*Include here whatever you think is relevant to support the main content of your notebook. For example, you may have only include example figures above in your main text but include additional ones here. Or you may have done a more extensive investigation, and want to put more results here to document your work in the semester. Be sure to divide appendix into appropriate sections and make the contents clear to the reader using approaches discussed above. * diff --git a/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/finalProjectNotebookF24-balajy.pdf b/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/finalProjectNotebookF24-balajy.pdf index db0951d..ee0d936 100644 Binary files a/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/finalProjectNotebookF24-balajy.pdf and b/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/finalProjectNotebookF24-balajy.pdf differ