From 46ef88ca0c8c3e13f333b4f8ecdb8eb4ffecb1f2 Mon Sep 17 00:00:00 2001 From: Kristin Bennett Date: Sat, 7 Sep 2024 11:19:33 -0400 Subject: [PATCH] assignment 3 and status notebook updates --- .../dar-f24-assignment3-template.Rmd | 212 +- .../dar-f24-assignment3-template.html | 2548 +++++++++++++++-- .../dar-f24-assignment3-template.pdf | Bin 269331 -> 286880 bytes StudentNotebooks/StatusNotebookTemplate.Rmd | 203 ++ 4 files changed, 2656 insertions(+), 307 deletions(-) create mode 100644 StudentNotebooks/StatusNotebookTemplate.Rmd diff --git a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd index 3d6d769..c66f296 100644 --- a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd +++ b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd @@ -3,13 +3,13 @@ title: 'CTBench Eval Project Notebook:' author: "Your Name Here" date: "`r format(Sys.time(), '%d %B %Y')`" output: - pdf_document: default - word_document: - toc: true html_document: toc: true number_sections: true df_print: paged + word_document: + toc: true + pdf_document: default subtitle: DAR Assignment 3 (Fall 2024) --- ```{r setup, include=FALSE} @@ -162,19 +162,17 @@ You're now ready to start coding Assignment 3! # DAR ASSIGNMENT 3 (Part 2): Loading the CTBench Eval Datasets -In this CTBench there are two sorts of data: clinical trial data and results data. For your conveniences, these dataset have been converted to R Rds files regardless of how they appear originally. . +In this CTBench there are two sorts of data: clinical trial data used to generate the prompts and results data that shows the results. For your conveniences, these dataset have been converted to R Rds files regardless of how they appear originally. * **Data Rds:** These are the datasets that describe each clinical trial and give it's baseline descriptors. Each row is a different clinical trial. - * `CT_Repo_data.Rds` The descriptors here were taken from the clinicaltrial.gov data repository for clinical trials. + * `CT_Repo_data.Rds` This contains selected clinical trials and their attributes from clinicaltrial.gov data repository for clinical trials. - * `CT_Pub_data.Rds` These trials also had descriptors taken from the clinical trial publication. + * `CT_Pub_data.Rds` These contain a subset of clinical trials and their attributes. But the descriptors are taken from the clinical trial publications. - - * **Results Rds:** @@ -194,40 +192,44 @@ Each of these datastructure has one row per trial. The features are given below.
- + Number Name Notes ------ ----------------------------------- ----------------------------------------- [1] "NCTId" Unique ID of each clinical trial - [2] "BriefTitle" - [3] "EligibilityCriteria" - [4] "BriefSummary" - [5] "Conditions" - [6] "Interventions" - [7] "PrimaryOutcomes" + [2] "BriefTitle" Title of Trial + [3] "EligibilityCriteria" Eligibility Criteria + [4] "BriefSummary" Summary of Trials + [5] "Conditions" Health Conditions the Trial addresses + [6] "Interventions" Intervention (a.k.a treatments) + [7] "PrimaryOutcomes" Measure of success or failure of trial [8] "BaselineMeasures" Original Features in ClinicalTrial.gov [9] "BaselineMeasures_Processed" Cleaned up Features used in CTBench
Table 1: Features of CT_Repot.df Table 1: Features of CT_Repo.df
+ Number Name Notes ------ ----------------------------------- ----------------------------------------- - [1] "NCTId" Unique ID of each clinical trial - [2] "BriefTitle" - [3] "EligibilityCriteria" - [4] "BriefSummary" - [5] "Conditions" - [6] "Interventions" - [7] "PrimaryOutcomes" + [1] "NCTId" Unique ID of each clinical trial [2] "BriefTitle" Title of Trial + [3] "EligibilityCriteria" Eligibility Criteria + [4] "BriefSummary" Summary of Trials + [5] "Conditions" Health Conditions the Trial addresses + [6] "Interventions" Intervention (a.k.a treatments) + [7] "PrimaryOutcomes" Measure of success or failure of trial [8] "BaselineMeasures" Original Features in ClinicalTrial.gov [9] "Paper_BaselineMeasures" Original Features in trial paper [10] "Paper_BaselineMeasures_Processed" Cleaned up Features used in CTBench
Table 2: Features of CT_Pub.df.
+For CT_Repo.df, the reference baseline desriptors used in the experiments are in a comma separated list in "BaselineMeasures_Processed". +For CT_Pub.df, the reference baseline desriptors used in the experiments are in a comma separated list in "Paper_BaselineMeasures_Processed". + +First we load the data and see that CT_Pub.df contains 103 trials with 10 dimensions and that CT_Repo.df contains 1693 trials with 9 dimensions. Note that 3 of the trials in each dataset are reserved to be used as examples to be included in the prompts, a process called three shot learning. ```{r} # Load the CT_Pub data CT_Pub.df<- readRDS("/academics/MATP-4910-F24/DAR-CTEval-F24/Data/CT_Pub_data.Rds") @@ -240,16 +242,15 @@ CT_Repo.df<- readRDS("/academics/MATP-4910-F24/DAR-CTEval-F24/Data/CT_Repo_data. dim(CT_Repo.df) -# Review the structure of our data -# Messy! Un-comment if you want to see the structure... -#str(CT_Pub.df) +# Uncomment these if you want to look at a few examples of each dataset, but don't knit into your final notebook. +#head(CT_Pub.df,5) -#str(CT_Repo.df) +#head(CT_Repo.df,5) ``` ## Load the CTBench Eval _Results_ -This is the results file. In theory there should be one row given for each trial x number of models. We show you how to prepare data for CT_Pub. CT_Repo can be done similarly. +This is the results file. There is one row given for each trial evaluated x number of models. We show you how to prepare data for CT_Pub. CT_Repo can be done similarly. These are the features in CT_Pub_responses.df. @@ -317,42 +318,75 @@ dim(CT_Pub.matches.df) # Look at some samples of each file # Commment these out before you make pdf -#head(CT_Pub.matches.df) +#head(CT_Pub.matches.df,5) -#head(CT_Pub.responses.df) +#head(CT_Pub.responses.df,5) ``` -The same process can be used for CT_Repo. But take care on the variable names. +The same process can be used for CT_Repo. But be aware that variable names may be slightly different. + + +#Analysis of Response Data + +For each clinical trial, the evaluation program (which calls two different LLM) calculates several measures of how good the candidate descriptor proposed for each LLM are as compared to the reference descriptors. + +The Bert score is a measure of similarity of the candidate and reference LLM in the the latent space of the Bert LLM. This is a quick but very rough way to calculate similarity. You can read about Bert Scores here. https://arxiv.org/abs/1904.09675 + +A more accurate evaluation is done by matching each candidate descriptor with at most 1 reference descriptor. This is done using the LLM GPT-4o. Let matches.len = number of candidate descriptors matched with the reference descriptors. Let candidate.len = number of unmatched candidate descriptors and reference.len = number of unmatched candidate descriptor. + +Precision measures the proportion of candidate descriptors that were matched. Recall measure the proportions of the reference descriptor that were in the candidate descriptors. F1 is a standard measure that combines precision and recall. These calculations have already been done for each trial. +For example, say reference = "Age, sex (OHM), race/ethnicity, cardiovascular disease, socio-economic status" and candidate = "age, gender, race, SBP, DBP, cholesterol ). +There are three matches: (Age, age), (sex(OHM),gender}, (race/ethnicity,race) +There are two unmatched reference descriptors: socio-econmic status and blood pressure. +There are three unmatched candidate descriptors: SBP, DBP, cholesterol. -# Sample Analysis of Response Data -The goal here is to see if different subgroups of data have different average precision, averge recall, and averge F1 scores. +This is how precision, recall, and f1 are calculated: +precision <- matches.len / (matches.len + candidate.len) +In example, precision = 3/(3+3) = .5 + +recall <- matches.len / (matches.len + reference.len) +In example, recall = 3/5 = .6 + +f1 <- ifelse(precision == 0 | recall == 0, 0, 2 * (precision * recall) / (precision + recall)) +In example, f1 = 0.5454545 + +Side note: one could argue that the candidate descriptors are actually better than measured by these metrics since blood presure and cholesterol are test used to determine cardiovascular disease. but what to do it about this is an open question. + + + +The goal in this analysis is to see if different subgroups of data have different average precision, averge recall, and averge F1 scores. + +## How do results differ by model on CT-pub? + +We want to summarize the results across the trials. So we calculate mean and standard error of statistics for each trial for each model. Standard error is the standard deviation divided by the sqrt of the number of total observations. The dplyr package is being used to calculate the results in an easy way. Check out the cheatsheet on dplyr for more information on using dplyr https://rstudio.github.io/cheatsheets/html/data-transformation.html -Calculate mean and std of statistics response of a subgroups based on model. ```{r} -CT_Pub_model_results.df <- CT_Pub.responses.df |> group_by(model)|> summarize(meanPrecision=mean(precision),sePrecision=std.error(precision),meanRecall=mean(recall),seRecall=std.error(recall),meanF1=mean(f1),sef1=std.error(f1)) +CT_Pub_model_results.df <- CT_Pub.responses.df %>% group_by(model) %>% summarize(meanPrecision=mean(precision),sePrecision=std.error(precision),meanRecall=mean(recall),seRecall=std.error(recall),meanF1=mean(f1),sef1=std.error(f1)) -kable(CT_Pub_model_results.df, caption="Table 3: Differences by Model") +kable(CT_Pub_model_results.df, caption="Table 3: Differences by Model on CT-Pub") ``` +The models are gpt4-omni and llama3-70. The ending zs means xero shot and ts mean three shot prompts. Here we can see differences between the results for different models and prompts. But this differences may not always be statistically significant. No significance tests were done in the original CTBench paper. - -Calculate mean and standard error of response measures for different model and trial combinations and display results in a table. -```{r} +## How do results differ by model and trial type on CT-pub? +Now we calculate calculate mean and standard error of response measures for different model and trial type combinations and display results in a table. +```{r} CT_Pub_MT_results.df <- CT_Pub.responses.df |> group_by(model,trial_group)|> summarize(meanPrecision=mean(precision),sePrecision=std.error(precision), meanRecall=mean(recall),seRecall=std.error(recall),meanF1=mean(f1),sef1=std.error(f1)) -kable(CT_Pub_MT_results.df, caption="Table 4: Differences by Model and Subgroup") +kable(CT_Pub_MT_results.df, caption="Table 4: Differences by Model and Subgroup on CT-pub") ``` +Here we can see that predicting descriptors seems to be harder for some combinations of models and trial types. But futher analysis is needed to present the results in a more informative way and see if the differences are statistically significant. -# Sample Analysis of Matches Data +# Do we see differences in results for different candidate descriptors? This is a sample analysis of the matches data frame. The goal is to count the number of trials for the 'gpt4-onmi-zs' model results in which each reference term is accurately matched. It does this by using tidyverse to find all the matches row with a value in reference and model="gpt4-omni-zs". Then it groups by references and computes the number of trial the term occurs and the number of trials where match failed (i.e. how many have na for candidate). Then it does a mutate to add the accuracy. Then it sorts the descripts by the number of trials. ```{r} @@ -368,14 +402,17 @@ nrow(CT_Pub_reference_count.df) kable(head(CT_Pub_reference_count.df,20),caption="Table 5: Accuracy of Top 20 descriptors in CT_Pub") ``` +Here we see that some desriptors are really easy like age but others like Hypertension may be more challenging, but the sample sizes can be very low. CT-Pub is really too small to do this type of analysis. Ct-repo would yield more definitive results that may show signficant different. Also different trials may use different terms to represent similar things. For example hypertension means a subject's high systolic or diastolic blood pressure exceeds desired thresholds. So many trials may measure blood pressure but only a few of them may have a descriptors called hypertension. So some more thought needs to be put in this type of analysis. # Your Job You job is to do a more indepth analysis of the results of the two models. Each member of your team can focus on a different question or model. You can use any analyses or visualizations in R that you like, -Some ideas for questions but feel free to make up your own. Try to make up and answer at least two questions. The additional question can be a follow-up to previous questions. Coordinate with your team. +Some ideas for questions but feel free to make up your own. Try to make up and answer at least two questions. The additional questions can be a follow-up to previous questions. Coordinate with your team so you look at different. + +Here are some ideas for questions to inspire you. 1) Does the LLM models perform differently in terms of precision, recall, and F1 scores? 2) Are prompts for some disease types e.g. (group_types) harder than others? Does this difference hold across different models? Note we will refer to this as a subgroup analysis. 3) How does the performance compare for CT_Pub and CT_Repo? @@ -390,31 +427,96 @@ Some ideas for questions but feel free to make up your own. Try to make up and a -## Question 1: Give title here -Describe the question you are answering. Note this could be one of the ones above or a question of your own choosing. +## Analysis: Question 1 (Provide short name) +### Question being asked + +_Provide in natural language a statement of what question you're trying to answer_ + +### Data Preparation + +_Provide in natural language a description of the data you are using for this analysis_ + +_Include a step-by-step description of how you prepare your data for analysis_ + +_If you're re-using dataframes prepared in another section, simply re-state what data you're using_ + +```{r, result01_data} +# Include all data processing code (if necessary), clearly commented -Describe how you prepared the data. -```{r} -# Insert Code Here ``` -Describe the analysis you performed. -```{r} -# Insert Code Here +### Analysis: Methods and results + +_Describe in natural language a statement of the analysis you're trying to do_ + +_Provide clearly commented analysis code; include code for tables and figures!_ + +```{r, result01_analysis} +# Include all analysis code, clearly commented +# If not possible, screen shots are acceptable. +# If your contributions included things that are not done in an R-notebook, +# (e.g. researching, writing, and coding in Python), you still need to do +# this status notebook in R. Describe what you did here and put any products +# that you created in github. If you are writing online documents (e.g. overleaf +# or google docs), you can include links to the documents in this notebook +# instead of actual text. + ``` -Show your results as appropriate in tables or graphs -```{r} -# Insert Code Here +### Discussion of results + +_Provide in natural language a clear discussion of your observations._ + + +## Analysis: Question 2 (Provide short name) + +### Question being asked + +_Provide in natural language a statement of what question you're trying to answer_ + +### Data Preparation + +_Provide in natural language a description of the data you are using for this analysis_ + +_Include a step-by-step description of how you prepare your data for analysis_ + +_If you're re-using dataframes prepared in another section, simply re-state what data you're using_ + +```{r, result02_data} +# Include all data processing code (if necessary), clearly commented + ``` -Discuss your results +### Analysis: Methods and Results + +_Describe in natural language a statement of the analysis you're trying to do_ + +_Provide clearly commented analysis code; include code for tables and figures!_ + +```{r, result02_analysis} +# Include all analysis code, clearly commented +# If not possible, screen shots are acceptable. +# If your contributions included things that are not done in an R-notebook, +# (e.g. researching, writing, and coding in Python), you still need to do +# this status notebook in R. Describe what you did here and put any products +# that you created in github (documents, jupytor notebooks, etc). If you are writing online documents (e.g. overleaf +# or google docs), you can include links to the documents in this notebook +# instead of actual text. + +``` + +### Discussion of results + +_Provide in natural language a clear discussion of your observations._ + + + +## Summary and next steps -## Question 2: Give title here +_Provide in natural language a clear summary and your proposed next steps._ -Repeat format of question 1 for additional questions. # When you're done: SAVE, COMMIT and PUSH YOUR CHANGES! diff --git a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.html b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.html index e2c4244..cffe38a 100644 --- a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.html +++ b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.html @@ -11,40 +11,1503 @@ - + CTBench Eval Project Notebook: - - + + - - - - + + + + - - +h1.title {font-size: 38px;} +h2 {font-size: 30px;} +h3 {font-size: 24px;} +h4 {font-size: 18px;} +h5 {font-size: 16px;} +h6 {font-size: 12px;} +code {color: inherit; background-color: rgba(0, 0, 0, 0.04);} +pre:not([class]) { background-color: white } + + - - + + +code{white-space: pre-wrap;} +span.smallcaps{font-variant: small-caps;} +span.underline{text-decoration: underline;} +div.column{display: inline-block; vertical-align: top; width: 50%;} +div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} +ul.task-list{list-style: none;} +