diff --git a/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/finalProjectNotebookF24-balajy.Rmd b/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/finalProjectNotebookF24-balajy.Rmd new file mode 100644 index 0000000..0e795d5 --- /dev/null +++ b/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/finalProjectNotebookF24-balajy.Rmd @@ -0,0 +1,901 @@ +--- +title: "Data Analytics Research Individual Final Project Report" +author: "Yashas Balaji" +date: "December 2, 2024" +output: + pdf_document: + toc: yes + toc_depth: '3' + html_notebook: default + html_document: + toc: yes + toc_depth: 3 + toc_float: yes + number_sections: yes + theme: united +--- + +# DAR Project and Group Members + +* Project name: CTBench +* Project team members: Evaluation + +# Instructions (DELETE BEFORE SUBMISSION) + +* The first goal of this notebook is to document your _major findings_ to convey them to your client (Dr. Rogers, Dr. Senveratne, or Mr. Neehal) and to preserve them for future use. + +* The second goal of this notebook is to document your _major findings_ with full scientific reproducibility. _Ideally someone should be able to go back years later and understand exactly what you did and reproduce your results._ + +* You can use the appendix to include additional results to improve the readability (for example extra plots) of your notebook or to show your work even if not really a major finding. + +* This is a scientific report written in complete sentences (i.e. not bullets) using good rules of grammar. It should be readable as a paper even if all the code is not shown, and if only the results of running your code are shown. + +* You should have sufficient details for scientific reproducibility including documentation of the code. You will need to describe the analysis methods that can be used together with the code to reproduce your work. This is especially important if you use several R files. + +* The rubric for grading is here [Rubric](https://docs.google.com/spreadsheets/d/e/2PACX-1vSeo5QZbboWwKnEZodmPQLnhr3hf5FrlzAqy4LydnOAsCw6V-YLWnAU8BzkLdmb9TP0zCpufAzI20XJ/pubhtml) + +* A suggested report structure is given below, but you can customize this to meet the needs of your project. For your draft notebook, you will design the stucture of your notebook and outline the contents. + +* Every student's final project notebook should be written individually or, in rare cases, as a small group. In many cases, you will discussing joint work located in other notebooks/locations. Talk with professor if you want to do joint notebook. + +* As noted above, your final notebook serves as a written presentation of your work this semester so it must be written like a written document. You should include code but feel free to use use proper R Markdown code chunk syntax to hide code chunks that don't need to be shown. You must describe what you are doing and the results outside of the code chunks. **You report should be readable and understandable by the readers without reading any code.** + +* The R code that executes the results should be embedded in this notebook if possible. + + It's also okay to "source" external scripts from within your notebook. + + You can also describe functionality code and results that are in other locations (like apps). + + PLEASE make sure all source code is in appropriate repository. +* Fall 2024 students may have work that is not appropriate to be embedded on your final notebook + + You should describe the work in the notebook and provide figures generated elsewhere (e.g. screen shots, graphs). + + Indicate if that work has been committed to github. If necessary put details in Appendix including the names of the committed files. +* Your writing style should be suitable for sharing with external partners/mentors and useful to future contributors. Do not assume that your reader is familiar with the technical details of your implementation and code. Again, write as if this is a research paper. +* Focus on results; please don't summarize everything you did this semester! + + Discuss only the *most important* aspects of your work. + + Ask yourself *what really matters?* +* **IMPORTANT:** Discuss any insights you found regarding your research. +* If there are limitations to your work, discuss, in detail. +* Include any **background** or **supporting evidence** for your work. + + For example, mention any relevant research articles you found -- and be sure to include references! + +## Things to check before you submit (DELETE BEFORE SUBMITTING) ## +* Have you done all the required components of the notebook in the format required? + +* Is your document readable as a research paper even if all the code is suppressed? + + Try suppressing all the code using hint below and see if this is true. +* Did you proofread your document? Does it use complete sentences and good grammar? +* Is every figure/table clearly labeled and titled? +* Does every figure serve a purpose? + + Does the figure/table have a useful title? **Hint:** What _question_ does the figure answer? + + You can put extra (non-essential) figures/tables in your **Appendix**. + + Is the figured/tables captioned? + + Are the figure/tables and its associated findings discussed in the text? + + Is it clear which figure/tables is being discussed? **Hint:** use captions! +* **CRITICAL:** Have you given enough information for someone to reproduce, understand and extend your results? + + Where can they *find* the data and code that you used? + + Have you *described* the data that used? + + Have you *documented* your code? + + Have you stated where code is located? + + Are your figures/tables *clearly labeled*? + + Did you *discuss each figure and your findings*? + + Did you use good grammar and *proofread* your results? + + Finally, have you *committed* your work to github and made a *pull request*? + +* Summarize ALL of your work that is worthy of being preserved in this notebook; Feel free to include work in the appendix at end. It will not be judged as being part of the research document but rather as additional information to be preserved. **if you don't show and/or link to your work here, it doesn't exist for us!** + + +* You **MUST** include figures and/or tables to illustrate your work. *Screen shots or pngs are okay for work generated outside the notebook*. + +* . You **MUST** include links to other important resources (knitted HTMl files, Shiny apps). See the guide below for help. + +5. Commit the source (`.Rmd`), pdf (`.pdf`) and knitted (`.html`) versions of your notebook and push to github. Turn in the pdf version to lms. + + +See LMS for guidance on how the contents of this notebook will be graded. + +**DELETE THE SECTIONS ABOVE!** + + + +# 0.0 Preliminaries. + +*R Notebooks are meant to be dynamic documents. Provide any relevant technical guidance for users of your notebook. Also take care of any preliminaries, such as required packages. Sample text:* + +This report is generated from an R Markdown file that includes all the R code necessary to produce the results described and embedded in the report. Code blocks can be surpressed from output for readability using the command code `{R, echo=show}` in the code block header. If `show <- FALSE` the code block will be surpressed; if `show <- TRUE` then the code will be show. + +```{r} +# Set to TRUE to expand R code blocks; set to FALSE to collapse R code blocks +show <- TRUE +``` + + +Executing this R notebook requires some subset of the following packages: + +* `ggplot2` +* `tidyverse` +* `knitr` +* `jsonlite` +* `devtools` +* `stringr` +* `pheatmap` +* `kableExtra` + +These will be installed and loaded as necessary (code suppressed). + + +```{r, include=FALSE} +# This code will install required packages if they are not already installed +# ALWAYS INSTALL YOUR PACKAGES LIKE THIS! +if (!require("ggplot2")) { + install.packages("ggplot2") + library(ggplot2) +} +if (!require("tidyverse")) { + install.packages("tidyverse") + library(tidyverse) +} +if (!require("knitr")) { + install.packages("knitr") + library(knitr) +} + +knitr::opts_chunk$set(echo = TRUE) + +# Set the default CRAN repository +local({r <- getOption("repos") + r["CRAN"] <- "http://cran.r-project.org" + options(repos=r) +}) +if(!require("jsonlite")){ + install.packages("jsonlite") + library(jsonlite) +} + +if (!require("devtools")) { + install.packages("devtools") + library(devtools) +} +# For package conflict resolution (esp. dplyr functions) +# run con +if (!require("conflicted")) { + devtools::install_github("r-lib/conflicted") + library(conflicted) +} + +# Required packages for CTEval analysis +if (!require("rmarkdown")) { + install.packages("rmarkdown") + library(rmarkdown) +} + +# Our preferences +conflicts_prefer(dplyr::summarize()) +conflicts_prefer(dplyr::filter()) +conflicts_prefer(dplyr::select()) +conflicts_prefer(dplyr::mutate()) +conflicts_prefer(dplyr::arrange()) + +if (!require("stringr")) { + install.packages("stringr") + library(stringr) +} + +if (!require("pheatmap")) { + install.packages("pheatmap") + library(pheatmap) +} +if (!require("plotrix")) { + install.packages("plotrix") + library(plotrix) +} +if (!require("kableExtra")) { + install.packages("kableExtra") + library(kableExtra) +} + + +``` + +# 1.0 Project Introduction + +_Describe your project and your approaches at a high level. Give enough information that a researcher examing your notebook can understand what this notebook is about. _ + +CTBench is a benchmark designed to evaluate the performance of large language models (LLMs) in supporting the design of clinical studies. By leveraging study-specific metadata, CTBench assesses how effectively different LLMs identify the baseline features of a clinical trial, such as demographic details and other key attributes collected at the trial's outset from all participants. + +The CTBench analysis incorporates two sources of clinical trial data: CT_repo and CT_pub. CT_repo includes selected clinical trials and their attributes sourced from the ClinicalTrials.gov data repository. In contrast, CT_pub features a subset of clinical trials with attributes derived from their corresponding clinical trial publications. + +Here we will be evaluating the LLM's performance in various aspects, such as F1, Recall, and Precision. We will also take a look at the each model's tendency to hallucinate features. + + + +```{r } +# Code + +``` + +# 2.0 Organization of Report + +_Give report organization including list of major findings. Sample is provided. Please be sure to edit appropriately and remove this statement._ + +This report is organize as follows: + + +* Section 3.0. Finding 1: Provide short name and give brief description. We performed a comparison of ying versus yang items using three different approaches: blah1, blah2, and blah3. + + * Section 4.0: Finding 2: Short name and brief desciption. + +Repeat as necessary + +* Section (X).0 Finding X-2: Short name and brief description. + +* Section (X+1).0 Overall conclusions and suggestions + +* Section (X+2).0 Appendix This section describe the following additional works that may be helpful in the future work: *list subjects*. + + +# 3.0 Finding 1: Hallucinations Overview + +_Give a highlevel overview of the major finding. What questions were your trying to address, what approaches did you employ, and what happened?_ + +This study did not take into account for the LLMs to hallucinate and add or remove features. My goal here is to look at how the models often these models hallucinate, and if there are patterns we can find. I made multiple visualizations, and used the Kruskal-Wallis test to check this out. + + +## 3.1 Data, Code, and Resources + +Here is a list data sets, codes, that are used in your work. Along with brief description and URL where they are located. + +_Some examples you can replace. Note all these links must be clickable and live when document submitted. So make sure to do your commits and pull requests._ + +1. CT_Pub_data.Rds this is the rds from CT_Pub containing the reference features for CT_Pub +[https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_pub/CT_Pub_data.Rds](https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_pub/CT_Pub_data.Rds) + +2. trials.matches.Rds is the rds containing the match data +[https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_pub/trials.matches.Rds](https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_pub/trials.matches.Rds). + +3. trials.responses.Rds is the rds containing the match data +[https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_pub/trials.matches.Rds](https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/Data/Hallucinations/ct_pub/trials.matches.Rds). + +4. functions.R are functions that Corey wrote in order to calculate the hallucination data +[https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/StudentData/functions.R] (https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/StudentData/functions.R) + +5. CT-Pub-Hallucination-Metrics.Rds is summarized hallucination data, which includes the trial groups of each trial +[https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/StudentData/CT-Pub-Hallucination-Metrics.Rds] (https://github.rpi.edu/DataINCITE/DAR-CTEval-F24/blob/main/StudentData/CT-Pub-Hallucination-Metrics.Rds) + +*Describe the dataset and prepartion and/or preprocessing techniques ("data munging") you use. Put code here if not external. + +```{r, result02_analysis, include=FALSE} + +# extracting elements ----------------------------------------------------- + +extract_elements_v2 <- function(s) { + # Extract elements enclosed within backticks using regex + pattern <- "`(.*?)`" + elements <- regmatches(s, gregexpr(pattern, s))[[1]] + + # Remove the backticks from the matched elements + elements <- gsub("`", "", elements) + + return(elements) + +} + + + +id_hallucinations_v2<-function(trial_df,matches_df){ + + # takes the raw input from Nafis and returns the counts of each type of + # hallucination for each trial row + + # extracts the model from its corresponding column for each of the models + gptzs_matches<-data.frame(trial_id=matches_df$NCTId,model='gpt4-omni-zs', + matches=matches_df$gpt4o_zs_gen_matches) + + gptts_matches<-data.frame(trial_id=matches_df$NCTId,model='gpt4-omni-ts', + matches=matches_df$gpt4o_ts_gen_matches) + + llamazs_matches<-data.frame(trial_id=matches_df$NCTId, + model='llama3-70b-in-zs', + matches=matches_df$llama3_70b_it_zs_gen_matches) + + llamats_matches<-data.frame(trial_id=matches_df$NCTId, + model='llama3-70b-in-ts', + matches=matches_df$llama3_70b_it_ts_gen_matches) + + # combine the above; essentially transferred from wide to long form + matches<-rbind(gptzs_matches,gptts_matches,llamazs_matches,llamats_matches) + + # remove the trials used for three shot prompting, convert from Json, then + # throw out the old matches column + matches_parsed <- matches %>% + filter(!trial_id %in% c("NCT00000620", "NCT01483560", "NCT04280783")) %>% + mutate(new_matches = lapply(matches, fromJSON)) %>% + select(trial_id, model, new_matches) + + # keep the original df in case this part messes everything up (mostly for + # debugging, no longer needed as it works fine) + matches_parsed_test<-matches_parsed + + # ok, bear with me here + # loop through each row of the matches df + # + # The if conditions account for if any of the lists are empty; otherwise, it + # returns NA values which mess with the later code chunks + for (ind in 1:nrow(matches_parsed)){ + # extract the matched reference features into its own column + matches_parsed_test$matched_reference_features[[ind]]=if(length( + matches_parsed$new_matches[[ind]]$matched_features)>0){ + matches_parsed$new_matches[[ind]]$matched_features[,1]} else {list()} + # extract the matched candidate features into its own column + matches_parsed_test$matched_candidate_features[[ind]]=if(length( + matches_parsed$new_matches[[ind]]$matched_features)>0){ + matches_parsed$new_matches[[ind]]$matched_features[,2]}else{list()} + # extract the remaining reference features into its own column + matches_parsed_test$remaining_reference_features[[ind]]=if(length( + matches_parsed$new_matches[[ind]]$remaining_reference_features)>0){ + matches_parsed$new_matches[[ind]]$remaining_reference_features}else{ + list()} + # extract the remaining candidate features into its own column + matches_parsed_test$remaining_candidate_features[[ind]]=if(length( + matches_parsed$new_matches[[ind]]$remaining_candidate_features)>0){ + matches_parsed$new_matches[[ind]]$remaining_candidate_features}else{list() + } + + # concatenate a sequence of NAs to separate the remaining candidate features + # from the remaining reference features (i.e. make it look more like what + # the class had originally for the matches) + matches_parsed_test$reference[[ind]]=as.list(c( + matches_parsed_test$matched_reference_features[[ind]], + matches_parsed_test$remaining_reference_features[[ind]], + rep(NA,length(matches_parsed_test$remaining_candidate_features[[ind]])))) + matches_parsed_test$candidate[[ind]]=as.list(c( + matches_parsed_test$matched_candidate_features[[ind]], + rep(NA,length(matches_parsed_test$remaining_reference_features[[ind]])), + matches_parsed_test$remaining_candidate_features[[ind]])) + } + + # just take the columns with the trial id, generative model, reference feature + # list, and candidate feature list (the ones we just created with the NAs), + # then expand it out and sort by trial id + full_matches<-matches_parsed_test %>% + select(trial_id,model,reference,candidate) %>% + unnest(c(reference,candidate)) %>% + arrange(trial_id) + + + + # going from wide to long form for the trial info dataframe + # remove the trial group (as that was not in the original class data in this + # table) and all generated columns + trial_gptzs<-select(trial_df,-c(TrialGroup,gpt4o_zs_gen,gpt4o_ts_gen, + llama3_70b_it_zs_gen,llama3_70b_it_ts_gen)) + # identify model as gpt 0 shot + trial_gptzs$model='gpt4-omni-zs' + # re-add the gpt 0 shot generated results + trial_gptzs$candidate=trial_df$gpt4o_zs_gen + + # same as above but for gpt 3 shot + trial_gptts<-select(trial_df,-c(TrialGroup,gpt4o_zs_gen,gpt4o_ts_gen, + llama3_70b_it_zs_gen,llama3_70b_it_ts_gen)) + trial_gptts$model='gpt4-omni-ts' + trial_gptts$candidate=trial_df$gpt4o_ts_gen + + # same as above but for llama 0 shot + trial_llamazs<-select(trial_df,-c(TrialGroup,gpt4o_zs_gen,gpt4o_ts_gen, + llama3_70b_it_zs_gen,llama3_70b_it_ts_gen)) + trial_llamazs$model='llama3-70b-in-zs' + trial_llamazs$candidate=trial_df$llama3_70b_it_zs_gen + + # same as above but for llama 3 shot + trial_llamats<-select(trial_df,-c(TrialGroup,gpt4o_zs_gen,gpt4o_ts_gen, + llama3_70b_it_zs_gen,llama3_70b_it_ts_gen)) + trial_llamats$model='llama3-70b-in-ts' + trial_llamats$candidate=trial_df$llama3_70b_it_ts_gen + + # combine the above; it is now long form :) + new_trial_df<-rbind(trial_gptzs,trial_gptts,trial_llamazs,trial_llamats) + + # this is to differentiate between CT-Pub and CT-Repo; the true reference + # features are stored in different column names between the two + # + # In both cases, take the trial id, reference feature list, candidate feature + # list, and generative model columns, and remove the trials used for 3 shot + # prompting + if ('Paper_BaselineMeasures_Corrected' %in% colnames(new_trial_df)){ + trial_features<-new_trial_df %>% + dplyr::select(NCTId,Paper_BaselineMeasures_Corrected,candidate,model) %>% + dplyr::filter(!NCTId %in% c("NCT00000620", "NCT01483560", "NCT04280783")) + colnames(trial_features)<-c('trial_id','true_ref_features', + 'true_can_features','model') + } else { + trial_features<-new_trial_df %>% + dplyr::select(NCTId,API_BaselineMeasures_Corrected,candidate,model) %>% + dplyr::filter(!NCTId %in% c("NCT00000620", "NCT01483560", "NCT04280783")) + colnames(trial_features)<-c('trial_id','true_ref_features', + 'true_can_features','model') + } + + # remove factors from matches df; it was giving me some issues when trying to + # get rid of NAs, so this fixed that + full_matches<-data.frame(matrix(unlist(full_matches),nrow=nrow(full_matches)), + stringsAsFactors=FALSE) + colnames(full_matches)<-c('trial_id','model','reference','candidate') + + # extract the reference features for each trial according to the evaluator + eval_ref_features <- full_matches %>% + dplyr::select(trial_id,model,reference) %>% + dplyr::filter(!trial_id%in%c("NCT00000620","NCT01483560","NCT04280783"))%>% + drop_na() + + # surround these features in backticks and add a comma and space after each + eval_ref_features$reference<-paste0("`",eval_ref_features$reference,"`, ") + + # roll up the evaluators reference feature list into a df with 1 row for each + # trial instance + eval_ref_features<-eval_ref_features %>% + dplyr::group_by(trial_id,model) %>% + dplyr::mutate(match_ref_features=paste0(reference,collapse="")) %>% + dplyr::select(trial_id,model,match_ref_features) %>% + dplyr::distinct() + + # extract the candidate features for each trial according to the evaluator + eval_can_features <- full_matches %>% + dplyr::select(trial_id,model,candidate) %>% + dplyr::filter(!trial_id%in%c("NCT00000620","NCT01483560","NCT04280783"))%>% + drop_na() + + # surround these features in backticks and add a comma and space after each + eval_can_features$candidate<-paste0("`",eval_can_features$candidate,"`, ") + + # roll up the evaluators candidate feature list into a df with 1 row for each + # trial instance + eval_can_features<-eval_can_features %>% + dplyr::group_by(trial_id,model) %>% + dplyr::mutate(match_can_features=paste0(candidate,collapse="")) %>% + dplyr::select(trial_id,model,match_can_features) %>% + dplyr::distinct() + + # combine the dfs with the true features and evaluator-reported features + features<-merge(merge(trial_features,eval_ref_features),eval_can_features) + + # loop through each row of this df to count each of the 3 types of + # hallucinations + for (i in 1:nrow(features)){ + # calculate addition hallucinations by counting how many reference features + # the evaluator reported, counting how many of the reference features the + # evaluator reported are in the true reference feature list, then finding + # the difference between those two numbers, then doing the same thing for + # the candidate features, and summing those 2 final numbers + features$num_pos_halls[i]<-(length(extract_elements_v2( + features$match_ref_features[[i]]))-sum(extract_elements_v2( + features$match_ref_features[[i]]) %in% extract_elements_v2( + features$true_ref_features[[i]])))+(length(extract_elements_v2( + features$match_can_features[[i]]))-sum(extract_elements_v2( + features$match_can_features[[i]]) %in% extract_elements_v2( + features$true_can_features[[i]]))) + # calculate removal hallucinations by counting how many true reference + # features there were, counting how many true reference features were + # reported by the evaluator, then finding the difference between those two + # numbers, then doing the same thing for the candidate features, and summing + # those 2 final numbers + features$num_neg_halls[i]<-(length(extract_elements_v2( + features$true_ref_features[[i]]))-sum(extract_elements_v2( + features$true_ref_features[[i]]) %in% extract_elements_v2( + features$match_ref_features[[i]])))+(length(extract_elements_v2( + features$true_can_features[[i]]))-sum(extract_elements_v2( + features$true_can_features[[i]]) %in% extract_elements_v2( + features$match_can_features[[i]]))) + + # calculate the multi-match hallucinations + # create a table of counts for each true reference feature + true_ref_count=table(extract_elements_v2(features$true_ref_features[[i]])) + # create a table of counts for each reference feature according to the + # evaluator + match_ref_count=table(extract_elements_v2(features$match_ref_features[[i]])) + # initialize the reference multi-match hallucination counter + multi_halls_ref=c() + # loop through each true reference feature + for (feat1 in extract_elements_v2(features$true_ref_features[[i]])){ + # calculate the multi-match hallucinations for that feature by counting + # how many times it appears in the true reference feature list, counting + # how many times it appears in the evaluators reference feature list, and + # finding the difference between those two numbers. If the difference is + # negative, that is a negative hallucination, not a multi-match, so set + # those to 0 to count correctly + multi_halls_ref[feat1]=max(sum(as.numeric( + match_ref_count[feat1])-as.numeric(true_ref_count[feat1])),0,na.rm=TRUE) + } + + # create a table of counts for each true candidate feature + true_can_count=table(extract_elements_v2(features$true_can_features[[i]])) + # create a table of counts for each candidate feature according to the + # evaluator + match_can_count=table(extract_elements_v2(features$match_can_features[[i]])) + # initialize the reference multi-match hallucination counter + multi_halls_can=c() + # loop through each true reference feature + for (feat2 in extract_elements_v2(features$true_can_features[[i]])){ + # calculate the multi-match hallucinations for that feature by counting + # how many times it appears in the true candidate feature list, counting + # how many times it appears in the evaluators candidate feature list, and + # finding the difference between those two numbers. If the difference is + # negative, that is a negative hallucination, not a multi-match, so set + # those to 0 to count correctly + multi_halls_can[feat2]=max(sum(as.numeric( + match_can_count[feat1])-as.numeric(true_can_count[feat2])),0,na.rm=TRUE) + } + # the number of multi-match hallucinations for the trial is the sum of the + # multi-match hallucinations for each of its features + features$num_multi_halls[[i]]=sum(multi_halls_ref,multi_halls_can) + + } + # the above returned the multi-match hallucinations as a list, which is not + # ideal, so convert it to a number + features$num_multi_halls=as.numeric(features$num_multi_halls) + return(features) +} + + + + + +id_hallucinations_class<-function(trial_df,matches_df){ + + ### This function does not work because the class data does not have the + ### generated features in the trial info dataframe + + + + + # this is to differentiate between CT-Pub and CT-Repo; the true reference + # features are stored in different column names between the two + # + # In both cases, take the trial id, reference feature list, candidate feature + # list, and generative model columns, and remove the trials used for 3 shot + # prompting + if ('Paper_BaselineMeasures_Corrected' %in% colnames(trial_df)){ + trial_features<-trial_df %>% + dplyr::select(NCTId,Paper_BaselineMeasures_Corrected,candidate,model) %>% + dplyr::filter(!NCTId %in% c("NCT00000620", "NCT01483560", "NCT04280783")) + colnames(trial_features)<-c('trial_id','true_ref_features', + 'true_can_features','model') + } else { + trial_features<-trial_df %>% + dplyr::select(NCTId,API_BaselineMeasures_Corrected,candidate,model) %>% + dplyr::filter(!NCTId %in% c("NCT00000620", "NCT01483560", "NCT04280783")) + colnames(trial_features)<-c('trial_id','true_ref_features', + 'true_can_features','model') + } + + # remove factors from matches df; it was giving me some issues when trying to + # get rid of NAs, so this fixed that + full_matches<-data.frame(matrix(unlist(matches_df),nrow=nrow(matches_df)), + stringsAsFactors=FALSE) + colnames(full_matches)<-c('trial_id','model','reference','candidate') + + # extract the reference features for each trial according to the evaluator + eval_ref_features <- full_matches %>% + dplyr::select(trial_id,model,reference) %>% + dplyr::filter(!trial_id%in%c("NCT00000620","NCT01483560","NCT04280783"))%>% + drop_na() + + # surround these features in backticks and add a comma and space after each + eval_ref_features$reference<-paste0("`",eval_ref_features$reference,"`, ") + + # roll up the evaluators reference feature list into a df with 1 row for each + # trial instance + eval_ref_features<-eval_ref_features %>% + dplyr::group_by(trial_id,model) %>% + dplyr::mutate(match_ref_features=paste0(reference,collapse="")) %>% + dplyr::select(trial_id,model,match_ref_features) %>% + dplyr::distinct() + + # extract the candidate features for each trial according to the evaluator + eval_can_features <- full_matches %>% + dplyr::select(trial_id,model,candidate) %>% + dplyr::filter(!trial_id%in%c("NCT00000620","NCT01483560","NCT04280783"))%>% + drop_na() + + # surround these features in backticks and add a comma and space after each + eval_can_features$candidate<-paste0("`",eval_can_features$candidate,"`, ") + + # roll up the evaluators candidate feature list into a df with 1 row for each + # trial instance + eval_can_features<-eval_can_features %>% + dplyr::group_by(trial_id,model) %>% + dplyr::mutate(match_can_features=paste0(candidate,collapse="")) %>% + dplyr::select(trial_id,model,match_can_features) %>% + dplyr::distinct() + + # combine the dfs with the true features and evaluator-reported features + features<-merge(merge(trial_features,eval_ref_features),eval_can_features) + + # loop through each row of this df to count each of the 3 types of + # hallucinations + for (i in 1:nrow(features)){ + # calculate addition hallucinations by counting how many reference features + # the evaluator reported, counting how many of the reference features the + # evaluator reported are in the true reference feature list, then finding + # the difference between those two numbers, then doing the same thing for + # the candidate features, and summing those 2 final numbers + features$num_pos_halls[i]<-(length(extract_elements_v2( + features$match_ref_features[[i]]))-sum(extract_elements_v2( + features$match_ref_features[[i]]) %in% extract_elements_v2( + features$true_ref_features[[i]])))+(length(extract_elements_v2( + features$match_can_features[[i]]))-sum(extract_elements_v2( + features$match_can_features[[i]]) %in% extract_elements_v2( + features$true_can_features[[i]]))) + # calculate removal hallucinations by counting how many true reference + # features there were, counting how many true reference features were + # reported by the evaluator, then finding the difference between those two + # numbers, then doing the same thing for the candidate features, and summing + # those 2 final numbers + features$num_neg_halls[i]<-(length(extract_elements_v2( + features$true_ref_features[[i]]))-sum(extract_elements_v2( + features$true_ref_features[[i]]) %in% extract_elements_v2( + features$match_ref_features[[i]])))+(length(extract_elements_v2( + features$true_can_features[[i]]))-sum(extract_elements_v2( + features$true_can_features[[i]]) %in% extract_elements_v2( + features$match_can_features[[i]]))) + + # calculate the multi-match hallucinations + # create a table of counts for each true reference feature + true_ref_count=table(extract_elements_v2(features$true_ref_features[[i]])) + # create a table of counts for each reference feature according to the + # evaluator + match_ref_count=table(extract_elements_v2(features$match_ref_features[[i]])) + # initialize the reference multi-match hallucination counter + multi_halls_ref=c() + # loop through each true reference feature + for (feat1 in extract_elements_v2(features$true_ref_features[[i]])){ + # calculate the multi-match hallucinations for that feature by counting + # how many times it appears in the true reference feature list, counting + # how many times it appears in the evaluators reference feature list, and + # finding the difference between those two numbers. If the difference is + # negative, that is a negative hallucination, not a multi-match, so set + # those to 0 to count correctly + multi_halls_ref[feat1]=max(sum(as.numeric( + match_ref_count[feat1])-as.numeric(true_ref_count[feat1])),0,na.rm=TRUE) + } + + # create a table of counts for each true candidate feature + true_can_count=table(extract_elements_v2(features$true_can_features[[i]])) + # create a table of counts for each candidate feature according to the + # evaluator + match_can_count=table(extract_elements_v2(features$match_can_features[[i]])) + # initialize the reference multi-match hallucination counter + multi_halls_can=c() + # loop through each true reference feature + for (feat2 in extract_elements_v2(features$true_can_features[[i]])){ + # calculate the multi-match hallucinations for that feature by counting + # how many times it appears in the true candidate feature list, counting + # how many times it appears in the evaluators candidate feature list, and + # finding the difference between those two numbers. If the difference is + # negative, that is a negative hallucination, not a multi-match, so set + # those to 0 to count correctly + multi_halls_can[feat2]=max(sum(as.numeric( + match_can_count[feat1])-as.numeric(true_can_count[feat2])),0,na.rm=TRUE) + } + # the number of multi-match hallucinations for the trial is the sum of the + # multi-match hallucinations for each of its features + features$num_multi_halls[[i]]=sum(multi_halls_ref,multi_halls_can) + + } + # the above returned the multi-match hallucinations as a list, which is not + # ideal, so convert it to a number + features$num_multi_halls=as.numeric(features$num_multi_halls) + return(features) + + + +} + +``` + +```{r } +# Code to read in data if appropriate. +pub_data <- readRDS("../../Data/Hallucinations/ct_pub/CT_Pub_data.Rds") +pub_matches <- readRDS("../../Data/Hallucinations/ct_pub/trials.matches.Rds") +pub_responses <- readRDS("../../Data/Hallucinations/ct_pub/trials.responses.Rds") + +metrics <- readRDS("../../StudentData/CT-Pub-Hallucination-Metrics.Rds") + +trial_df <- read.csv("../../CTBench_source/corrected_data/ct_pub/CT-Pub-With-Examples-Corrected-allgen.csv", stringsAsFactors = FALSE) +matches_df <- read.csv("../../CTBench_source/corrected_data/ct_pub/CT-Pub-With-Examples-Corrected-allgpteval.csv", stringsAsFactors = FALSE) + +hall_data <- id_hallucinations_v2(trial_df, matches_df) + + +# Create a mapping table for model names +model_mapping <- data.frame( + hall_data_model_name = c("gpt4-omni-ts", "gpt4-omni-zs", "llama3-70b-in-ts", "llama3-70b-in-zs"), # Names in hall_data + metrics_model_name = c("gpt4o_ts_gen_hal", "gpt4o_zs_gen_hal", "llama3_70b_it_ts_gen_hal", "llama3_70b_it_zs_gen_hal") # Corresponding names in metrics +) + +# Join hall_data with model_mapping to standardize model names +hall_data_standardized <- hall_data %>% + left_join(model_mapping, by = c("model" = "hall_data_model_name")) %>% + mutate(model_name = metrics_model_name) %>% # Replace model_name with standardized name + select(-metrics_model_name) # Drop the temporary column + +# Perform a full join on trial_id and standardized model_name +combined_data <- full_join( + hall_data_standardized, metrics, + by = c("trial_id" = "NCTId", "model_name" = "Generation Model") +) + +# Step 1: Summarize total hallucinations by trial group and model +hallucinations_by_trial_group <- combined_data %>% + group_by(TrialGroup, model_name) %>% + summarise( + total_positive_hallucinations = sum(num_pos_halls, na.rm = TRUE), + total_negative_hallucinations = sum(num_neg_halls, na.rm = TRUE), + total_multimatch_hallucinations = sum(num_multi_halls, na.rm = TRUE) + ) + +# Step 2: Reshape the data for visualization +hallucinations_long <- hallucinations_by_trial_group %>% + pivot_longer( + cols = starts_with("total_"), + names_to = "hallucination_type", + values_to = "count" + ) %>% + mutate(hallucination_type = recode(hallucination_type, + "total_positive_hallucinations" = "Positive", + "total_negative_hallucinations" = "Negative", + "total_multimatch_hallucinations" = "Multimatch")) +``` + + +## 3.2 Contribution + +_State if this section is sole work or joint work. If joint work describe who you worked with and your contribution. You can also indicate any work by others that you reused._ + +This section is my work, which is building off of work that Corey has done with Hallucinations. + + +## 3.3 Methods Description + + +_Describe the data analytics methods you used and why you chose them. +Discuss your data analytics "pipeline" from *data preparation* and *experimental design*, to *methods*, to *results*. Were you able to use pre-existing implementations? If the techniques required user-specified parameters, how did you choose what parameter values to use?_ + + + +## 3.4 Result and Discussion + + + +_For each result, state the method used. Run the code to perform it here (or state how it was run if run elsewhere) +Provide relvant visual illustrations of findings such as tables and graphs. +Then discuss the result. Repeat as necessary. Remember that readers will only read text and results and not code._ + +This section will be looking into each model's Hallucinations. Hallucinations in the context of language models for clinical study design refer to outputs that deviate from the expected or correct response. There are three primary types of hallucinations: positive, negative, and multimatch. Positive hallucinations occur when the model invents or adds information that is not present in the reference dataset, such as generating a demographic feature like "Height" when it is not included in the trial's baseline features. These hallucinations can mislead researchers by introducing irrelevant or non-existent features into the study design. Negative hallucinations, on the other hand, involve the omission of information that is explicitly present in the reference dataset. For example, if the baseline features include "Age" and "Sex/Gender," but the model only outputs "Age," the omission of "Sex/Gender" is a negative hallucination, leading to incomplete or inadequate study designs. Multimatch hallucinations occur when the model incorrectly matches a single reference feature to multiple generated features or vice versa, violating the expectation of one-to-one correspondence. For instance, if the reference includes "Race," and the model outputs both "Race" and "Ethnicity" as separate features, this creates redundant or conflicting information. These hallucinations introduce ambiguity and complicate data interpretation and analysis. Each type of hallucination poses unique challenges and can undermine the reliability of the model's output, highlighting the importance of addressing these issues to ensure accurate and effective language model performance in clinical trial design. + +We will be looking comparing the different LLM models in their numbers of hallucinations. + +```{r} + +# Summarize data to get the average number of hallucinations per model +hallucinations_by_model_temp <- hall_data %>% + group_by(model) %>% + summarise(avg_positive_hallucinations = sum(num_pos_halls), + avg_negative_hallucinations = sum(num_neg_halls), + avg_multimatch_halls = sum(num_multi_halls)) + +# Reshape data to long format for combined bar plot +hallucinations_long_temp <- hallucinations_by_model_temp %>% + pivot_longer(cols = starts_with("avg_"), names_to = "hallucination_type", values_to = "count") %>% + mutate(hallucination_type = recode(hallucination_type, + "avg_positive_hallucinations" = "Positive", + "avg_negative_hallucinations" = "Negative", + "avg_multimatch_halls" = "Multimatch")) + +# Create combined bar chart +ggplot(hallucinations_long_temp, aes(x = model, y = count, fill = hallucination_type)) + + geom_bar(stat = "identity", position = "dodge") + + labs(title = "Total Hallucinations by Model and Type for CT_Pub", x = "Model", y = "Total Hallucinations") + + scale_fill_manual(values = c("steelblue", "tomato", "purple"), + name = "Hallucination Type") + + theme_minimal() + +``` + + +Overall, almost all of the models followed a similar trend as shown by the bar chart. Each model except for llama3-zs had the least number of multimatch hallucinations, the most amount of positive hallucinations, the number of negative hallucinations was in between the two. All the models had roughly the same number of negative hallucinations, with llama3-zs having the least and gpt4o-ts having the most. For multimatch hallucinations, interestingly gpt4o-ts had the least, and llama3-zs had the most. And finally for the positive hallucinations, gpt4o-zs performed the worst, while llama3-zs performed the best. + +With this in mind, lets dig deeper. Now lets take a look at how each each different trial group affects the number of hallucinations. + +```{r} + +ggplot(hallucinations_long, aes(x = TrialGroup, y = count, color = model_name, group = model_name)) + + geom_line(size = 1) + + geom_point(size = 2) + + facet_wrap(~ hallucination_type, scales = "free_y") + + labs(title = "Total Hallucinations by Trial Group and Model", + x = "Trial Group (Disease)", + y = "Total Hallucination Counts", + color = "Model") + + theme_minimal() + + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +``` +Here we see a visualization of what this looks like. However, to draw any conclusions, below I am running a kruskal-wallis test by both model and trialgroup for each type of hallucination. + +```{r} +# Create a list to store the results +results <- list() + +# Define hallucination types +hallucination_types <- c("Positive", "Negative", "Multimatch") + +# Step 1: Loop through each hallucination type and perform Kruskal-Wallis tests +for (type in hallucination_types) { + # Filter data for the current hallucination type + hallucination_data <- hallucinations_long %>% + filter(hallucination_type == type) + + # Perform Kruskal-Wallis Test for TrialGroup + kruskal_trialgroup <- kruskal.test(count ~ TrialGroup, data = hallucination_data) + p_value_kruskal_trialgroup <- kruskal_trialgroup$p.value + + # Perform Kruskal-Wallis Test for model_name + kruskal_model <- kruskal.test(count ~ model_name, data = hallucination_data) + p_value_kruskal_model <- kruskal_model$p.value + + # Add Kruskal-Wallis results to list + results <- append(results, list( + list(Hallucination_Type = type, Test = "Kruskal-Wallis", Group = "TrialGroup", p_value = p_value_kruskal_trialgroup), + list(Hallucination_Type = type, Test = "Kruskal-Wallis", Group = "Model", p_value = p_value_kruskal_model) + )) +} + +# Convert the list of results to a DataFrame +results_df <- do.call(rbind, lapply(results, as.data.frame)) + +# Display results as a table using knitr::kable +kable(results_df, col.names = c("Hallucination Type", "Test", "Group", "p-value"), + caption = "Kruskal-Wallis Test Results for Hallucination Types by Trial Group and Model") +``` +Here we can see that we have p values of under 0.05 for positive hallucinations by model, and negative hallucinations by trial group. This is an interesting finding, and leads to the idea that Positive hallucinations are created completely model dependent, and to minimize these, we need to choose a better model. Negative hallucinations on the other hand, are dependent on trial group. This could indicate that harder trials that the LLMs struggle with, seem to be consistent across models. + +## 3.5 Conclusions, Limitations, and Future Work. + +**Discuss the significance of your finding. Discuss any limitations that should be addressed in the future. Give suggestions for future work.** +This analysis revealed significant insights into the behavior of large language models (LLMs) in generating hallucinations during clinical trial design tasks. The results demonstrated that positive hallucinations—where models invent features not present in the reference data—are largely model-dependent, with specific models like llama3-zs consistently outperforming others. On the other hand, negative hallucinations, where models omit critical features, appear to be trial group-dependent, suggesting that certain trial contexts are inherently more challenging for LLMs regardless of the model used. Multimatch hallucinations, though less frequent, showed variability across models and warrant further exploration. + +Despite these findings, there are limitations. The study was confined to a specific dataset and models, potentially limiting the generalizability of results. Additionally, while statistical tests like the Kruskal-Wallis provided robust evidence of differences, further qualitative analysis of hallucinated features could reveal more nuanced insights. The study also focused primarily on identifying and quantifying hallucinations without implementing corrective measures to address them. + +Future work should address these limitations by expanding the dataset to include a wider variety of trial groups and exploring additional LLMs. Additionally, fine tuning the prompts with an emphasis on these hallucinations, will hopefully reduce them across the board. + + +# X.0 Finding 1: Name + +_These sections can be duplicated for each finding as needed._ + +## X.1 Data, Code, and Resources + +## X.2 Contribution + +## X.3 Methods Description + +## X.5 Conclusions and Future Work. + +## X.4 Result and Discussion + + +## X.5 Conclusions, Limitations, and Future Work. + + + + +# Bibliography +Provide a listing of references and other sources. + +* Citations from literature. Give each reference a unique name combining first author last name, year, and additional letter if required. e.g.[Bennett22a]. If there is no known author, make something reasonable up. +* Significant R packages used + + + + + +# Appendix + +*Include here whatever you think is relevant to support the main content of your notebook. For example, you may have only include example figures above in your main text but include additional ones here. Or you may have done a more extensive investigation, and want to put more results here to document your work in the semester. Be sure to divide appendix into appropriate sections and make the contents clear to the reader using approaches discussed above. * + diff --git a/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/finalProjectNotebookF24-balajy.pdf b/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/finalProjectNotebookF24-balajy.pdf new file mode 100644 index 0000000..db0951d Binary files /dev/null and b/StudentNotebooks/Assignment07_DraftFinalProjectNotebook/finalProjectNotebookF24-balajy.pdf differ