diff --git a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd index ce93db5..6a2dd88 100644 --- a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd +++ b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd @@ -3,13 +3,13 @@ title: 'CTBench Eval Project Notebook:' author: "Your Name Here" date: "`r format(Sys.time(), '%d %B %Y')`" output: - pdf_document: default - word_document: - toc: true html_document: toc: true number_sections: true df_print: paged + word_document: + toc: true + pdf_document: default subtitle: DAR Assignment 3 (Fall 2024) --- ```{r setup, include=FALSE} @@ -77,7 +77,7 @@ https://arxiv.org/abs/2406.17888 ## CTBenchEval Goals -The high-level goals of CTBenchEval project are to: +The high-level goals of CTBenchEval project for the semester are to: 1. Imagine you are trying your own LLM approach and want to compare to the published CTBench results. @@ -98,12 +98,12 @@ The high-level goals of CTBenchEval project are to: * "How can we make the evaluation software provided effective and easy to use?" * "How can we make CTBench readily extensible to more clinical trials?" -# DAR ASSIGNMENT 3 (Introduction): Introductory DAR Notebook +# DAR ASSIGNMENT 3 (Introduction): Introductory DAR CTBench Eval Notebook This notebook is broken into two main parts: * **Part 1:** Preparing your local repo for **DAR Assignment 3** -* **Part 2:** Loading the CTBench Eval Datasets +* **Part 2:** Loading and Analysis the CTBench Eval Datasets * **Part 3:** Individual analysis of your team's dataset **NOTE:** The RPI github repository for all the code and data required for this notebook may be found at: @@ -112,7 +112,8 @@ This notebook is broken into two main parts: * **Part 4:** Be prepared to discuss your results in you team breakout. -# DAR ASSIGNMENT 3 (Part 1): Preparing your local repo for Assignment 3 +# DAR ASSIGNMENT 3 (Part 1): Preparing your local repo for Assignment 3 +**Delete this Section from your submitted notebook** In this assignment you'll start by making a copy of the Assignment 3 template notebook, then you'll add to your copy with your original work. The instructions which follow explain how to accomplish this. @@ -175,8 +176,9 @@ You're now ready to start coding Assignment 3! * One of the DAR instructors will merge your branch, and your new files will be added to the master branch of the repo. _Do not merge your branch yourself!_ # DAR ASSIGNMENT 3 (Part 2): Loading the CTBench Eval Datasets +**Delete this Section from your submitted notebook. You can reuse this code as needed in the answers to your questions** -In this CTBench there are two sorts of data: clinical trial data used to generate the prompts and results data that shows the results. For your conveniences, these dataset have been converted to R Rds files regardless of how they appear originally. +In this CTBench there are two sources of data: clinical trial data used to generate the prompts and results data from the LM and its evaluation. For your conveniences, these dataset have been converted to R Rds files regardless of how they appear originally. * **Data Rds:** @@ -192,7 +194,7 @@ These are the datasets that describe each clinical trial and give it's baseline These include the results of various LLM on each trial. * `trials.responses.Rds` contains the results of the LLM for each clinical trial and model combination. - * `trials.matches.Rds` contains the specific matches that the evaluation LLM made between the candidate and references descriptors. + * `trials.matches.Rds` contains the specific matches that the evaluation LLM made between the candidate and references descriptors for each clinical trial and model. NOTES: @@ -216,10 +218,9 @@ Each of these datastructure has one row per trial. The features are given below. [5] "Conditions" Health Conditions the Trial addresses [6] "Interventions" Intervention (a.k.a treatments) [7] "PrimaryOutcomes" Measure of success or failure of trial - [8] "BaselineMeasures" Original Features in ClinicalTrial.gov - [9] "BaselineMeasures_Processed" Cleaned up Features used in CTBench - - + [8] "BaselineMeasures" List of original Features in ClinicalTrial.gov + [9] "BaselineMeasures_Processed" List of cleaned up Features used in CTBench +
The high-level goals of CTBenchEval project are to:
+The high-level goals of CTBenchEval project for the semester are +to:
This notebook is broken into two main parts:
Delete this Section from your submitted notebook
In this assignment you’ll start by making a copy of the Assignment 3 template notebook, then you’ll add to your copy with your original work. The instructions which follow explain how to accomplish this.
@@ -1929,10 +1935,12 @@In this CTBench there are two sorts of data: clinical trial data used -to generate the prompts and results data that shows the results. For -your conveniences, these dataset have been converted to R Rds files -regardless of how they appear originally.
+Delete this Section from your submitted notebook. You can +reuse this code as needed in the answers to your questions
+In this CTBench there are two sources of data: clinical trial data +used to generate the prompts and results data from the LM and its +evaluation. For your conveniences, these dataset have been converted to +R Rds files regardless of how they appear originally.
trials.responses.Rds
contains the results of the LLM
for each clinical trial and model combination.trials.matches.Rds
contains the specific matches that
-the evaluation LLM made between the candidate and references
-descriptors.NOTES:
[8] | “BaselineMeasures” | -Original Features in ClinicalTrial.gov | +List of original Features in ClinicalTrial.gov |
[9] | “Paper_BaselineMeasures” | -Original Features in trial paper | +List of original features in trial paper |
[10] | “Paper_BaselineMeasures_Processed” | -Cleaned up Features used in CTBench | +List of cleaned up Features used in trial paper | +
</table | +> | +
CT_Repo.df
, the reference baseline descriptors used
in the experiments are in a comma separated list in
@@ -2139,67 +2155,63 @@ Number | Name | -Notes | +Notes | ||
---|---|---|---|---|---|
[1] “tr | -ial_id” Unique trial id - same | -as NCTID | +[1] | +“trial_id” | +Unique trial id - same as NCTID |
[2] “mo | -del” LLM model used | -+ | [2] | +“model” | +LLM model used |
[3] “re | -ference” matched reference featu | -re (NA if none) | +[3] | +“reference” | +matched reference feature (NA if none) |
[4] “ca | -ndidate” matched candidate featu | -re (NA if none) | +[4] | +“candidate” | +matched candidate feature (NA if none) |
If the table has an entry such as trial_id_A model_id_B NA candidate D, this means that for trial_id A using model B, candidate descriptor D had no match in the reference list.
-# Load the trials.responses
+# Load the trials.responses
CT_Pub.responses.df<- readRDS("/academics/MATP-4910-F24/DAR-CTEval-F24/Data/trials.responses.Rds")
# convert model and type to factors
@@ -2275,15 +2287,20 @@ 4.2 Load the CTBench Eval
#head(CT_Pub.responses.df,5)
The same process can be used for CT_Repo. But be aware that variable
names may be slightly different.
-#Analysis of Response Data
+
For each clinical trial, the evaluation program (which calls two -different LLM) calculates several measures of how good the candidate +different LLMs) calculates several measures of how good the candidate descriptor proposed for each LLM are as compared to the reference descriptors.
-The Bert score is a measure of similarity of the candidate and -reference LLM in the the latent space of the Bert LLM. This is a quick -but very rough way to calculate similarity. You can read about Bert -Scores here. https://arxiv.org/abs/1904.09675
+The Bert score is a measure of semantic similarity of the candidate +and reference LLM descriptor lists in the the latent space of the Bert +LLM. This is a quick but very rough way to calculate similarity. 0 is no +semantic similarity. 1 is perfect semantic similarity. You can read +about Bert Scores here. https://arxiv.org/abs/1904.09675
A more accurate evaluation is done by matching each candidate descriptor with at most 1 reference descriptor. This is done using the LLM GPT-4o. Let matches.len = number of candidate descriptors matched @@ -2291,7 +2308,7 @@
Precision measures the proportion of candidate descriptors that were -matched. Recall measure the proportions of the reference descriptor that +matched. Recall measure the proportion of the reference descriptor that were in the candidate descriptors. F1 is a standard measure that combines precision and recall. These calculations have already been done for each trial.
@@ -2299,11 +2316,11 @@f1 <- ifelse(precision == 0 | recall == 0, 0, 2 * (precision * recall) / (precision + recall))
f1 = 0.5454545
For this toy example, the entries in CT_Pub.matches.df would look +like this:
+trial_id” | +“model” | +“reference” | +“candidate” | +
---|---|---|---|
NCT1 | +gpt4-omni-zs | +Age | +age | +
NCT1 | +gpt4-omni-zs | +sex(OHM) | +gender | +
NCT1 | +gpt4-omni-zs | +race/ethnicity | +race | +
NCT1 | +gpt4-omni-zs | +socio-econmic status | +NA | +
NCT1 | +gpt4-omni-zs | +cardiovascular disease | +NA | +
NCT1 | +gpt4-omni-zs | +NA | +SBP | +
NCT1 | +gpt4-omni-zs | +NA | +DBP | +
NCT1 | +gpt4-omni-zs | +NA | +cholesterol | +
+ | + | + | + |
Note: one could argue that the candidate descriptors are actually better than measured by these metrics since blood pressure -and cholesterol are test used to determine cardiovascular disease. What +and cholesterol are tests used to determine cardiovascular disease. What to do it about this is an open question.
The goal in this analysis is to see if different subgroups of data have different average precision, average recall, and average F1 scores.
- -We want to summarize the results across the trials. So we calculate mean and standard error of statistics for each trial for each model. @@ -2348,19 +2436,21 @@
model | meanPrecision | sePrecision | meanRecall | @@ -2371,12 +2461,40 @@|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.4344882 | -0.0084623 | -0.5272768 | -0.0104531 | -0.4517983 | -0.0072764 | +gpt4-omni-ts | +0.4194773 | +0.0165780 | +0.5465613 | +0.0206740 | +0.4519953 | +0.0145372 | +
gpt4-omni-zs | +0.4117923 | +0.0191232 | +0.4988831 | +0.0196749 | +0.4250843 | +0.0150289 | +||||||
llama3-70b-in-ts | +0.4372443 | +0.0146389 | +0.5267929 | +0.0209162 | +0.4550647 | +0.0137912 | +||||||
llama3-70b-in-zs | +0.4694388 | +0.0167251 | +0.5368701 | +0.0222861 | +0.4750490 | +0.0146080 |
Now we calculate calculate mean and standard error of response measures for different model and trial type combinations and display results in a table.
-# TODO: Use old pipe
+# Done using group by and summarize commands in Dplyr
CT_Pub_MT_results.df <- CT_Pub.responses.df %>%
group_by(model,trial_group) %>%
summarize(meanPrecision=mean(precision),
@@ -2401,21 +2519,26 @@ 4.4 How do results differ
meanRecall=mean(recall),
seRecall=std.error(recall),
meanF1=mean(f1),
- sef1=std.error(f1))
-
-kable(CT_Pub_MT_results.df, caption="Table 4: Differences by Model and Subgroup on CT-pub")
+ sef1=std.error(f1))
+## `summarise()` has grouped output by 'model'. You can override using the
+## `.groups` argument.
+kable(CT_Pub_MT_results.df, caption="Differences by Model and Subgroup on CT-pub")
model | +trial_group | meanPrecision | sePrecision | meanRecall | @@ -2426,12 +2549,204 @@|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.4344882 | -0.0084623 | -0.5272768 | -0.0104531 | -0.4517983 | -0.0072764 | +gpt4-omni-ts | +cancer | +0.3376333 | +0.0399497 | +0.5430899 | +0.0392855 | +0.3970881 | +0.0344699 | +
gpt4-omni-ts | +chronic kidney disease | +0.4430021 | +0.0303595 | +0.5625353 | +0.0518243 | +0.4789217 | +0.0327797 | +||||||
gpt4-omni-ts | +diabetes | +0.4315031 | +0.0261183 | +0.5984520 | +0.0363788 | +0.4815179 | +0.0242987 | +||||||
gpt4-omni-ts | +hypertension | +0.4936892 | +0.0421646 | +0.5076353 | +0.0570618 | +0.4708821 | +0.0340076 | +||||||
gpt4-omni-ts | +obesity | +0.3882668 | +0.0495100 | +0.4659332 | +0.0487458 | +0.4034206 | +0.0390612 | +||||||
gpt4-omni-zs | +cancer | +0.3593896 | +0.0648699 | +0.5177248 | +0.0378228 | +0.3884822 | +0.0416444 | +||||||
gpt4-omni-zs | +chronic kidney disease | +0.4498255 | +0.0359125 | +0.4961775 | +0.0422165 | +0.4550535 | +0.0337143 | +||||||
gpt4-omni-zs | +diabetes | +0.4398874 | +0.0275178 | +0.5670120 | +0.0350287 | +0.4747453 | +0.0243584 | +||||||
gpt4-omni-zs | +hypertension | +0.4110457 | +0.0525773 | +0.4404236 | +0.0524825 | +0.3916399 | +0.0329269 | +||||||
gpt4-omni-zs | +obesity | +0.3678517 | +0.0488932 | +0.4016209 | +0.0472745 | +0.3598584 | +0.0359429 | +||||||
llama3-70b-in-ts | +cancer | +0.4093769 | +0.0321481 | +0.5666599 | +0.0483330 | +0.4519619 | +0.0284682 | +||||||
llama3-70b-in-ts | +chronic kidney disease | +0.4538399 | +0.0367768 | +0.5158242 | +0.0568843 | +0.4591601 | +0.0350896 | +||||||
llama3-70b-in-ts | +diabetes | +0.4571022 | +0.0248011 | +0.5708732 | +0.0343273 | +0.4862324 | +0.0235926 | +||||||
llama3-70b-in-ts | +hypertension | +0.4983549 | +0.0370043 | +0.4818363 | +0.0599463 | +0.4657995 | +0.0376744 | +||||||
llama3-70b-in-ts | +obesity | +0.3603799 | +0.0328818 | +0.4540276 | +0.0437946 | +0.3865056 | +0.0317837 | +||||||
llama3-70b-in-zs | +cancer | +0.4138544 | +0.0414752 | +0.6322974 | +0.0492822 | +0.4836421 | +0.0366250 | +||||||
llama3-70b-in-zs | +chronic kidney disease | +0.5265988 | +0.0432749 | +0.5701615 | +0.0637368 | +0.5070008 | +0.0362418 | +||||||
llama3-70b-in-zs | +diabetes | +0.4925006 | +0.0255036 | +0.5353139 | +0.0350723 | +0.4980289 | +0.0246477 | +||||||
llama3-70b-in-zs | +hypertension | +0.5075860 | +0.0488254 | +0.5109607 | +0.0630818 | +0.4757989 | +0.0373848 | +||||||
llama3-70b-in-zs | +obesity | +0.3884561 | +0.0340609 | +0.4418456 | +0.0460536 | +0.3914692 | +0.0307577 |
This is a sample analysis of the matches data frame. The goal is to count the number of trials for the ‘gpt4-onmi-zs’ model results in which @@ -2465,9 +2780,9 @@
## [1] 843
# these are top 20 most common descriptors.
-kable(head(CT_Pub_reference_count.df,20),caption="Table 5: Accuracy of Top 20 descriptors in CT_Pub")
+kable(head(CT_Pub_reference_count.df,20),caption="Accuracy of Top 20 descriptors in CT_Pub")
reference | @@ -2610,21 +2925,22 @@
---|