There are no required packages required for this notebook. Linked +.Rmd and .R files in other github repositories will have their own +technical instructions and comments respectively.
+The project CTEval consists of analytical and technical methods, in +addition to a R shiny app, to analyze the ability of large language +models that are tasked with both the generation, evaluation, and +benchmarking of producing reference features of clinical trials (CT) +given CT metadata. Specifically we focused on prompt engineering to make +LLMs produce these reference features and benchmark these generated +features against the true values of known CTs. The evaluation team +focused on analyzing these result to find trends specific to certain +features, models, and other independent variables specific to this +project. Additionally, What this notebook will focus on is the specific +translation of the code originally in python to R and the +implemenetation of the evaluation and benchmarking of LLMs into an R +shiny app. For context the R shiny app is a web app built to provide a +platform for those in the CT domain to make use of a the aforementioned +features in a user-friendly manner.
+This report is organize as follows:
+Section 3.0. Translation of the CTEval code: The original +codebase is written in Python where all of the generation, evaluation, +and benchmarking of LLMs is performed. To allow for R code to perform +these same functionalities, I focused on translating the pertinent code +functions and files to R, such that these scripts can be run on R, in +effort to make the process of data generationa and analysis more +concise.
Section 4.0: Implementation of Evaluation and Benchmarking in R +shiny App: This section will cover the results of implementing the +evaluation and benchmarking features into the R shiny app.
Section 5.0: Hallucination Mitigation and Prompt engineering for +Llama: This section will cover the methods used to prompt engineer to +obtain viable results for the evaluation section, specifically from the +Llama LLM.
Section 6.0 Overall conclusions and suggestions
Section 7.0 Appendix
The primary goal was to translate the CTEval codebase from Python to +R to enhance compatablity with existing R-based workflows. Some +questions driving this effor were, How effecitively can python logic and +constructs be translated to R? What modifications were necessary to +preserve performance and functionality?
+The approach consisted of analysis of the python code to identify +core functionalities, including data manipualtion, feature matching, and +evaluation logic. By leveraging R specific libraries like dplyr and +purrr I was albe to mimic the pythonic logic.
+The outcome resulted in preservation of the functionality of the +python code which were successfully replicated in R.
+The main datasets that were used are the CT_Pub_updated.df and +CT_Repo_updated.df dataframes. Specifics of each dataframe can be +analyzed in the linked .Rmd file.
+The code translated for this section is work done solely by me. +However, Soumeek Mishra, also worked on the translation of the the +codebase from Python to R, and his could should contain slight +differences in the ability to make API calls to a separate set of +LLMs.
+The work I did for this section did not work with any anlytical +methods.
+Methods that I did use for the translation included analyzing the +function and goals of the original python codebase, researching into +what libraries could provide similar support as structure used in python +(API calls, pandas data structure), and lastly using these to produce R +code that would be able to emulate the results of the original +codebase.
+This section will contain the results of the major R functions +derived from translating the CTEval codebase to R. For reference the +source file can be found here: CTBench_LLM_prompt.Rmd +link
+The following is a result of running the generation prompt on an LLM +for both single-shot and triple-shot generation. Specifically run on +trial NCT00395746 using the generation model, +Meta-Llama-3.1-8B-Instruct.
+Batch Generation is the process of obtaining the generated results +for different trials in a “batch” or in bulk. Specifically, in such a +way such that all features are generated for the whole dataset of +trials. The inputs include the dataframe CT_Pub_updated.df or +CT_Repo_updated.df and the specific model used to generated the +candidate features.
+When running batch generation on multiple CT trials, the resulting
+metadata and generation result is stored in a dataframe with the
+following format. Additional columns are added for evaluation and
+benchmarking. In more detail, whereas, the original CTBench
+representation contains the NCTid, generation model, and generated
+candidate features, the dataframe I developed also contains the matching
+model, length of matches, length of unmatched references, length of
+unmatched candidate features, precision, recall, and f1 columns. This is
+done to provide a common dataframe that contains the majority of
+information trials which can be parsed to create more use-case friendly
+dataframes.
+
+
This is the example of the output after running the evaluation prompt
+on a single trial (using generated output from single generation, with
+Meta-Llama-3.1-8B-Instruct as the matching LLM). The output consits of a
+R dataframe object with the matched features:
+
A second portion for the unmatched reference features:
+
The last portion for the unmatched candidate features:
+
Batch evaluation, similar to batch generation, runs the evaluation +algorithm on all trials contained in the pertinent dataframe.
+When running batch evaluation the the length of the matches,
+unmatched reference features, and unmatched candidate features are
+stored in the aforementioned dataframe created when running batch
+generation:
+
Lastly, one can perform benchmarking to retreive the associated
+precision, recall, and f1 scores, for the generations of multiple
+different trials’ generation and evaluation. Similar to both previous
+examples of batch generation and evaluation, the benchmark metrics will
+be derived for all trials in the pertinent dataframe, assuming that the
+relevant information is available. Specifically, the columns of
+len_matches, len_reference, and len_candidate are populated.
+
The translation of the CTEval codebase from Python to R significantly +enhances its integration with existing R-based workflows in clinical +trial evaluations. While the core functionality of the Python code was +successfully preserved, the translation lacks sufficient abstraction, +which limits its scalability and ease of use. Future work should focus +on creating an object-oriented design that encapsulates all +functionalities, enabling better flexibility and extensibility. +Additionally, performance benchmarking and optimization should be +conducted, especially for handling large datasets.
+The major finding was the successful integration of evaluation and +benchmarking features into an R Shiny app, enabling an interactive and +dynamic environment for analyzing data. The primary questions driving +this effort were: How effectively can R Shiny support real-time data +generation and benchmarking? How can evaluation metrics be seamlessly +incorporated into a user-friendly interface?
+The approach involved designing a user-centric interface that could +display key evaluation metrics and benchmark results dynamically. Core +functionalities like data manipulation and metric calculation were +implemented using R Shiny’s reactive framework.
+The outcome is a fully functional Shiny app that allows users to +evaluate and compare data efficiently. It provides real-time updates and +intuitive controls, enhancing both usability and analytical +capabilities.
+The main datasets that were used are the CT_Pub_updated.df and +CT_Repo_updated.df dataframes. Specifics of each dataframe can be +analyzed in the linked app.R file.
+The web application can be simply launched locally with the following +command:
+ + + +shiny::runApp("app.R")
+
+
+
+This app was built in collaboration with Xiheng Liu and Tianyan Lin. +Both Xiheng and Tianyan contributed to the bulk of the web application +development, developing both the structure and implementations of LLM +generation.
+Given the codebase structure developed by the two, I implemented the +LLM evaluation and benchmarking capabilities. Given that the codebase +has lots of moving parts, I worked closely in conjunction with both +Xiheng and Tianyan.
+The development of the LLM evaluation and benchmarking features did +not rely on analytical methods but instead focused on adapting the +functionality to an R Shiny app environment. This process involved +careful consideration of how the interactive nature of Shiny apps could +be leveraged to enhance usability.
+The approach began with analyzing the goals of the original +evaluation framework, particularly how user inputs and dynamic updates +would be handled. One of the key challenges was ensuring smooth data +flow between reactive elements, such as user-generated prompts and +real-time results. The intricacies of managing reactivity and state +within the Shiny app required precise structuring of the code to avoid +unintended updates or performance lags.
+Much of the foundational work from Section 3 was reused to streamline +the implementation. Functions developed earlier were integrated to +handle core operations like data manipulation and evaluation logic, +allowing the focus to shift toward building a responsive, user-friendly +interface. This method ensured that the essential features of the +evaluation process were preserved while adapting them to the interactive +capabilities of the Shiny app.
+This section will contain a step-by-step walkthrough of the +Evaluation and Benchmarking portion of the R Shiny app.
+When first loading into the web application, there is a navigation +bar located on the top of the screen that contains an “Evaluate” tab +amongst the other functionalities included in the app.
+On the home screen, there is a “Evaluate” Tab on the navigation bar
+located on the top of the page. By clicking on the tab, one can navigate
+the Evaluation portion of the app.
+
On the evaluation page there are two sidebar tabs. The first is the
+Options Tab which contains a Dropdown menu for the Evaluator LLM choice
+which is set to “Meta-Llama-3.1-8B-Instruct”, a “Run Evaluation” button
+to perform the evaluation and benchmarking, and lastly a textbox
+containing the generated candidate features from the “Generate
+Descriptors” portion of the app.
+
The Evaluator LLM List contains all the supported LLMs that can be
+used to evaluate the generated candidate descriptors against the
+specified Clinical Trial’s original reference features. There are
+multiple prompts used for different models to mitigate issues in the
+evaluation stage.
+
When the “Run Evalation” button is clicked, multiple steps are +performed by the server logic of the application. First the metadata of +the trial selected in the “Specify Trial” section of the app is parsed. +With this parsed information multiple different server and user prompts +are constructed depending on the Evaluating LLM. Next an API call is +made to the specified LLM. Once the output is received, multiple error +checks and helper functions are used to either fix up any invalid JSON +outputs or retry the evaluation. These results are then stored in +“Reactive Vals” which are dynamic types provided by the shiny +library.
+Once the evaluation completes running, we can navigate to the “Report +Page”. This page includes, multiple different results and metrics.
+The first of which are the True Matches, this list will contain the +set of original reference features that are matched with the generated +candidate features.
+The second section contains features that were hallucinated by the +Evaluator LLM.
+The third section contains Unmatched Reference Features, which
+include all the original features in the clinical trial that have no
+matching counterpart in the generated features.
+
The fourth sections contains Unmatched Candidate Features, which
+include all generated features that have no corresponding matches in the
+set of original reference features.
+
Lastly, the performance of the Evaluator LLM is represented by three
+metrices: Precision, Recall, and F1 Score.
+
For a final option, users can download the resulting evaluation
+categories and benchmarking metrics (True Matches, Hallucinations,
+Unmatched Generated Features, Unmatched Reference Features, Precision,
+Recall, and F1).
+
+
The implementation of LLM evaluation and benchmarking in an R Shiny +app successfully demonstrated the ability to dynamically evaluate and +compare model outputs in an interactive environment. By integrating +real-time user input and reactive displays, the app provided a flexible +platform for exploring the capabilities of LLMs in generating clinical +trial metadata. The work highlighted the importance of adapting +previously developed code for use in a dynamic interface, maintaining +consistency with the original evaluation goals while offering a more +user-friendly experience.
+However, significant limitations were encountered, primarily due to +the volatile nature of LLM responses. LLM outputs can vary widely across +different runs, even when using the same prompts and configurations. +This inconsistency posed challenges for benchmarking, as reproducibility +is a critical factor in evaluation. To mitigate this, multiple parsing +and cleaning methods of generated responses were employed to obtain more +robust results. Aggregating these outputs helped reduce variability, +providing a more reliable basis for comparison.
+Future work should focus on improving abstraction and automation +within the Shiny app. Currently, the app relies on distinct functions +for various tasks, but creating an overarching object or framework to +encapsulate the entire evaluation and benchmarking process could +streamline future enhancements. Additionally, exploring ways to further +stabilize LLM responses—potentially by fine-tuning the models or +implementing advanced prompt engineering techniques—could enhance the +reliability and consistency of the results. This would strengthen the +app’s utility as a benchmarking tool for LLM performance in clinical +trial evaluations
+CT_Pub.df<- readRDS("../../CTBench_source/corrected_data/ct_pub/CT_Pub_data_updated.Rds")
+summary(CT_Pub.df, 2)
+
+
+ NCTId BriefTitle EligibilityCriteria BriefSummary Conditions Interventions PrimaryOutcomes
+ Length:103 Length:103 Length:103 Length:103 Length:103 Length:103 Length:103
+ Class :character Class :character Class :character Class :character Class :character Class :character Class :character
+ Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character
+ API_BaselineMeasures API_BaselineMeasures_Corrected Paper_BaselineMeasures Paper_BaselineMeasures_Corrected
+ Length:103 Length:103 Length:103 Length:103
+ Class :character Class :character Class :character Class :character
+ Mode :character Mode :character Mode :character Mode :character
+
+
+
+The original evaluation prompt from the CTBench codebase can be +found in the module.py file: module.py +link
Helper functions to fix output of Evaluator LLMs can be found +here: CTBench_LLM_prompt.Rmd +link.
The R shiny app that employs techniques to remove hallucinations +can be found here: app.R +link
Lastly the function “RemoveHallucinationsV2” can be found here: +functions.R
The prompt engineering for this task was independently designed and +developed by me. Hallucination mitigation techniques and tools were +developed by Corey Curran.
+The methods applied in this work involved a structured and iterative +approach to prompt development. Initially, model behavior was analyzed +by testing the initial prompts with sample inputs to evaluate the +quality and structure of the outputs. This step provided insights into +how effectively the prompts guided the models in producing accurate and +coherent results. Following this analysis, the prompts were refined +iteratively to address identified shortcomings, such as instances of +invalid JSON outputs or incomplete feature matching. Finally, a +cross-model comparison was conducted to ensure that the prompts +maintained consistency in functionality and output format across +different models, despite inherent differences in syntax and +interpretive emphasis.
+For the functions used for removing hallucinations from the Evaluator +LLMs output, I employed the R functions corey developed in R. Given that +it had already been implemented my task was to simply make sure all +passed parameters were in the correct format.
+Through multiple iterations of testing a prompt for +Meta-Llama-3.1-8B-Instruct was developed. Tests were run on both gpt and +llama, prompts, however, it was only Llama that needed modification to +its prompt to produce borderline acceptable results.
+The following are the two prompts for reference:
+ + + +
+systemPromptText_Evaluation.gpt <- "You are an expert assistant in the medical domain and clinical trial design. You are provided with details of a clinical trial. Your task is to determine which candidate baseline features match any feature in a reference baseline feature list for that trial. You need to consider the context and semantics while matching the features.
+
+For each candidate feature:
+
+1. Identify a matching reference feature based on similarity in context and semantics.
+2. Remember the matched pair.
+3. A reference feature can only be matched to one candidate feature and cannot be further considered for any consecutive matches.
+4. If there are multiple possible matches (i.e. one reference feature can be matched to multiple candidate features or vice versa), choose the most contextually similar one.
+5. Also keep track of which reference and candidate features remain unmatched.
+
+Once the matching is complete, provide the results in a JSON format as follows:
+{ \"matched_features\":
+ [[ \"<reference feature 1>\", \"<candidate feature 1>\"],
+ [ \"<reference feature 2>\", \"<candidate feature 2>\"]],
+\"remaining_reference_features\":
+ [\"<unmatched reference feature 1>\", \"<unmatched reference feature 2>\"],
+\"remaining_candidate_features\":
+ [\"<unmatched candidate feature 1>\", \"<unmatched candidate feature 2>\"]}
+
+Don't give code, just return the result."
+
+systemPromptText_Evaluation.llama <- "
+ You are an expert assistant in the medical domain and clinical trial design. You are provided with details of a clinical trial.
+ Your task is to determine which candidate baseline features match any feature in a reference baseline feature list for that trial.
+ You need to consider the context and semantics while matching the features.
+
+ For each candidate feature:
+
+ 1. Identify a matching reference feature based on similarity in context and semantics.
+ 2. Remember the matched pair.
+ 3. A reference feature can only be matched to one candidate feature and cannot be further considered for any consecutive matches.
+ 4. If there are multiple possible matches (i.e. one reference feature can be matched to multiple candidate features or vice versa), choose the most contextually similar one.
+ 5. Also keep track of which reference and candidate features remain unmatched.
+ 6. DO NOT provide the code to accomplish this and ONLY respond with the following JSON. Perform the matching yourself.
+ Once the matching is complete, omitting explanations provide the answer only in the following form:
+ {\"matched_features\": [[\"<reference feature 1>\" , \"<candidate feature 1>\" ],[\"<reference feature 2>\" , \"<candidate feature 2>\"]],\"remaining_reference_features\": [\"<unmatched reference feature 1>\" ,\"<unmatched reference feature 2>\"],\"remaining_candidate_features\" : [\"<unmatched candidate feature 1>\" ,\"<unmatched candidate feature 2>\"]}
+ 7. Please generate a valid JSON object, ensuring it fits within a single JSON code block, with all keys and values properly quoted and all elements closed. Do not include line breaks within array elements."
+
+
+
+Note the differences the prompts for Meta Llama compared to GPT.
+The first issues present in the Llama results was actual code that +could be used to write a script to generate results, rather than the +results itself. This was mitigated by instructions number 6 and 7, which +repeatedly ask for valid JSON objects and specific instructions to not +produce code.
+Lastly, one observation made regarding the llama model was that it +was not able to discern the difference between whether to add a newline +special character within the JSON, vs it just being there for formatting +reasons. To mitigate this issue, all newlines near or within the JSON +template provided in the prompt were removed.
+Note that no changes were made to the OpenAI GPT model, as its +translation into R did not pose any issues.
+To combat the presence of hallucinations in the output of the +evaluator, the main functions used was “RemoveHallucinations_v2”. For +reference this function and other related hallucination mitigation and +removal functions can be found in: functions.R
+ + + +RemoveHallucinations_v2<-function(Matches,ReferenceList,CandidateList){
+ # Matches should be a list containing the matches, with Matches[1] being from
+ # the reference list and Matches[2] being from the candidate list
+ # ReferenceList should be the true reference feature list
+ # CandidateList should be the true candidate feature list
+ #
+ # Currently, this extracts all true (non-hallucinated) matches, all addition
+ # match hallucinations (just the hallucinated feature, not the whole match),
+ # and all multi-match hallucinations (again, just the hallucinated feature),
+ # and calculates the corrected metrics.
+
+ # count the number of times each feature appears in each list; useful for
+ # multi-match hallucination identification
+ Rtab<-as.data.frame(table(ReferenceList))
+ Ctab<-as.data.frame(table(CandidateList))
+ MRtab<-as.data.frame(table(Matches[,1]))
+ MCtab<-as.data.frame(table(Matches[,2]))
+
+ # Extract the matches in which both the reference feature and candidate
+ # feature are real original features
+ TrueMatches<-Matches[(Matches[,1]%in%ReferenceList)&
+ (Matches[,2]%in%CandidateList),,drop=FALSE]
+ # Extract the addition hallucinations i.e. all the matched features which were
+ # not in the original lists
+ AHallucinations<-c(Matches[!(Matches[,1]%in%ReferenceList),1],
+ Matches[!(Matches[,2]%in%CandidateList),2])
+
+ # initialize empty vectors for the indices in which multi-match hallucinations
+ # occur...
+ Hindices<-c()
+ # ...and for the hallucinations themselves
+ MHallucinations<-c()
+ # loop through the rows of the matches
+ if (length(TrueMatches)>0){
+ for (Riter in 1:nrow(TrueMatches)){
+ feat<-TrueMatches[Riter,1]
+ if (MRtab$Freq[MRtab$Var1==feat]>Rtab$Freq[Rtab$ReferenceList==feat]){
+ MRtab$Freq[MRtab$Var1==feat]=MRtab$Freq[MRtab$Var1==feat]-1
+ MHallucinations<-c(MHallucinations,feat)
+ Hindices<-c(Hindices,Riter)
+ }
+ }
+ for (Citer in 1:nrow(TrueMatches)){
+ feat<-TrueMatches[Riter,2]
+ if (MCtab$Freq[MCtab$Var1==feat]>Ctab$Freq[Ctab$CandidateList==feat]){
+ MCtab$Freq[MCtab$Var1==feat]=MCtab$Freq[MCtab$Var1==feat]-1
+ MHallucinations<-c(MHallucinations,feat)
+ Hindices<-c(Hindices,Citer)
+ }
+ }
+ if (length(Hindices)>0){
+ TrueMatches<-TrueMatches[-Hindices,,drop=FALSE]
+ }
+ }
+
+ Hallucinations<-c(AHallucinations,MHallucinations)
+
+ precision<-max(nrow(TrueMatches),0,na.rm=TRUE)/length(CandidateList)
+ recall<-max(nrow(TrueMatches),0,na.rm=TRUE)/length(ReferenceList)
+ f1<-max(2*precision*recall/(precision+recall),0,na.rm=TRUE)
+
+ UnmatchedReferenceFeature<-ReferenceList[!(ReferenceList%in%TrueMatches[,1])]
+ UnmatchedCandidateFeature<-CandidateList[!(CandidateList%in%TrueMatches[,2])]
+
+ result<-list(TrueMatches=TrueMatches,Hallucinations=Hallucinations,
+ UnmatchedReferenceFeature=UnmatchedReferenceFeature,
+ UnmatchedCandidateFeature=UnmatchedCandidateFeature,
+ precision=precision,recall=recall,f1=f1)
+
+ return(result)
+}
+
+
+
+In short this function takes in the generated candidate features +(from the generation prompt), generated matches (from the evaluation +prompt), and a list of the original reference features from the relevant +clinical trial. Positive and Multi-match hallucinations are removed and +a datastructure containing True Matches (with hallucinations removed), +the positive and multi-match hallucinations, unmatched reference and +candidate features, recall, precision, and f1 scores are returned.
+The prompt engineering process successfully adapted the Meta LLaMA +model to perform context-sensitive feature matching within clinical +trial datasets. By iterative refinement, prompts achieved functional +similarity, with adjustments tailored to each model’s specific quirks. +The resulting outputs aligned closely with expected formats, +particularly after addressing challenges such as invalid JSON and +unnecessary code generation in the LLaMA model.
+Future work will focus on automating the prompt refinement process +through feedback loops and extending the scope to include more complex +feature matching scenarios. Additionally, efforts will aim to generalize +the prompts for use with a broader range of models while preserving +their precision and reliability, possibly even fine tuning for the +complete removal of any hallucinations regardless of model.
+The translation of the CTEval codebase from Python to R significantly +enhanced the handling of clinical trial evaluations within the R +ecosystem. This effort retained the core functionality of the original +Python implementation while paving the way for integration into +R-specific workflows. However, R’s limitations in managing certain +complex operations highlighted the need for a more structured and +abstracted framework. Addressing these organizational challenges will be +crucial for scaling the implementation to handle larger datasets and +support additional features effectively.
+The development of the R Shiny app for LLM evaluation and +benchmarking further contributed to the ease and interactivity of +clinical trial evaluations. The app enabled real-time comparison of LLM +outputs, allowing users to explore and analyze results dynamically. +Despite the utility of this tool, inconsistencies in LLM responses posed +challenges that were mitigated by employing methods to clean and combine +outputs, thereby improving result reliability. Future iterations of the +app should aim to enhance consistency and incorporate more advanced +evaluation metrics.
+The prompt engineering process successfully adapted the Meta LLaMA +model to perform context-sensitive feature matching within clinical +trial datasets. Iterative refinement allowed the prompts to address +model-specific challenges, such as invalid JSON generation and +unnecessary code output, particularly in the LLaMA model. Additionally, +R’s inherent difficulties in handling certain prompt complexities were +mitigated by simplifying instructions and introducing structured +feedback mechanisms. These refinements ensured outputs aligned closely +with the required format and functionality. Future work will focus on +automating the refinement process, extending the scope to more complex +feature matching scenarios, and generalizing prompts to support a wider +range of models while maintaining precision and reliability.
+For a more in-depth guide on website navigation for the evaluation
+page, or a code explanation for the evaluation portion refer to the
+following google document:
+User
+Guide Link