DataINCITE · bennek · Sep 9, 2024 · Sep 9, 2024
diff --git a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd
@@ -3,13 +3,13 @@ title: 'CTBench Eval Project Notebook:'
 author: "Your Name Here"
 date: "`r format(Sys.time(), '%d %B %Y')`"
 output:
-  pdf_document: default
-  word_document:
-    toc: true
   html_document:
     toc: true
     number_sections: true
     df_print: paged
+  word_document:
+    toc: true
+  pdf_document: default
 subtitle: DAR Assignment 3 (Fall 2024)
 ---
 ```{r setup, include=FALSE}
@@ -77,7 +77,7 @@ https://arxiv.org/abs/2406.17888
 
 ## CTBenchEval Goals
 
-The high-level goals of CTBenchEval project are to:
+The high-level goals of CTBenchEval project for the semester are to:
 
 1. Imagine you are trying your own LLM approach and want to compare to the published CTBench results. 
 
@@ -98,12 +98,12 @@ The high-level goals of CTBenchEval project are to:
 * "How can we make the  evaluation software provided effective and easy to use?" 
 * "How can we make CTBench readily extensible to more clinical trials?" 
 
-# DAR ASSIGNMENT 3 (Introduction): Introductory DAR Notebook
+# DAR ASSIGNMENT 3 (Introduction): Introductory DAR  CTBench Eval Notebook
 
 This notebook is broken into two main parts:
 
 * **Part 1:** Preparing your local repo for **DAR Assignment 3**
-* **Part 2:** Loading the CTBench Eval Datasets
+* **Part 2:** Loading and Analysis the CTBench Eval Datasets
 * **Part 3:** Individual analysis of your team's dataset
 
 **NOTE:** The RPI github repository for all the code and data required for this notebook may be found at:
@@ -112,7 +112,8 @@ This notebook is broken into two main parts:
 
 * **Part 4:** Be prepared to discuss your results in you team breakout. 
 
-# DAR ASSIGNMENT 3 (Part 1): Preparing your local repo for Assignment 3
+# DAR ASSIGNMENT 3 (Part 1): Preparing your local repo for Assignment 3 
+**Delete this Section from your  submitted notebook**
 
 In this assignment you'll start by making a copy of the Assignment 3 template notebook, then you'll add to your copy with your original work. The instructions which follow explain how to accomplish this. 
 
@@ -175,8 +176,9 @@ You're now ready to start coding Assignment 3!
     * One of the DAR instructors will merge your branch, and your new files will be added to the master branch of the repo. _Do not merge your branch yourself!_
 
 # DAR ASSIGNMENT 3 (Part 2): Loading the CTBench Eval Datasets
+**Delete this Section from your submitted notebook. You can reuse this code as needed in the answers to your questions**
 
-In this CTBench there are two sorts of data: clinical trial data used to generate the prompts and results data that shows the results.  For your conveniences, these dataset have been converted to R Rds files regardless of how they appear originally.   
+In this CTBench there are two sources of data: clinical trial data used to generate the prompts and results data from the LM and its evaluation.  For your conveniences, these dataset have been converted to R Rds files regardless of how they appear originally.   
 
 * **Data Rds:** 
 
@@ -192,7 +194,7 @@ These are the datasets that describe each clinical trial and give it's baseline
 These include the results of various LLM on each trial.  
 
   * `trials.responses.Rds` contains the results of the LLM for each clinical trial and model combination. 
-  * `trials.matches.Rds` contains the specific matches that the evaluation LLM made between the candidate and references descriptors.
+  * `trials.matches.Rds` contains the specific matches that the evaluation LLM made between the candidate and references descriptors for each clinical trial and model.
 
 NOTES: 
 
@@ -216,10 +218,9 @@ Each of these datastructure has one row per trial. The features are given below.
  [5]    "Conditions"                            Health Conditions the Trial addresses
  [6]    "Interventions"                         Intervention (a.k.a treatments)    
  [7]    "PrimaryOutcomes"                       Measure of success or failure of trial
- [8]    "BaselineMeasures"                        Original Features in ClinicalTrial.gov          
- [9]   "BaselineMeasures_Processed"           Cleaned up Features used in CTBench
-
-</table>
+ [8]    "BaselineMeasures"                      List of original Features in ClinicalTrial.gov          
+ [9]   "BaselineMeasures_Processed"             List of cleaned up Features used in CTBench
+ </table>
 
 
 <table>
@@ -233,11 +234,10 @@ Each of these datastructure has one row per trial. The features are given below.
  [5]    "Conditions"                            Health Conditions the Trial addresses
  [6]    "Interventions"                         Intervention (a.k.a treatments)    
  [7]    "PrimaryOutcomes"                       Measure of success or failure of trial 
- [8]    "BaselineMeasures"                        Original Features in ClinicalTrial.gov          
- [9]    "Paper_BaselineMeasures"                  Original Features in trial paper        
- [10]   "Paper_BaselineMeasures_Processed"        Cleaned up Features used in CTBench
-
-</table>
+ [8]    "BaselineMeasures"                      List of original Features in ClinicalTrial.gov          
+ [9]    "Paper_BaselineMeasures"                  List of original features in trial paper        
+ [10]   "Paper_BaselineMeasures_Processed"        List of cleaned up Features used in trial paper 
+ </table>
 
 * For `CT_Repo.df`, the reference baseline descriptors used in the experiments are in a comma separated list in  `BaselineMeasures_Processed`. 
 * For `CT_Pub.df`, the reference baseline decsriptors used in the experiments are in a comma separated list in  `Paper_BaselineMeasures_Processed`. 
@@ -273,29 +273,30 @@ These are the features in CT_Pub_responses.df.
 
  Number Name                                    Notes
  ------ -----------------------------------     -----------------------------------------
- [1] "trial_id"          Unique trial id - same as NCTID     
- [2] "trial_group"       trial address what group of disease.    
- [3] "model"             LLM model used   
- [4] "gen_response"           Result generated  
- [5] "processed_gen_response" Cleaned up result generated
- [6] "len_matches"           # of matching desriptors
- [7] "len_reference"         # unmatched descriptors from reference
- [8] "len_candidate"         # unmatched descriptors from LLM response
- [9] "precision"             precision for this trial and model
-[10] "recall"                recall for this trial and model
-[11] "f1"                    F1 score for this trial and model 
+ [1]    "trial_id"                                Unique trial id - same as NCTID     
+ [2]    "trial_group"                             trial address what group of disease.    
+ [3]    "model"                                   LLM model used   
+ [4]    "gen_response"                            Result generated  
+ [5]    "processed_gen_response"                    Cleaned up result generated
+ [6]    "len_matches"                               # of matching desriptors
+ [7]    "len_reference"                             # unmatched descriptors from reference
+ [8]    "len_candidate"                             # unmatched descriptors from LLM response
+ [9]    "precision"                                 precision for this trial and model
+[10]    "recall"                                    recall for this trial and model
+[11]    "f1"                                        F1 score for this trial and model 
+
 </table>
 
 These are the features in CT_Pub_matches.df.  
 <table>
 <caption><span id="tab:table4">Table 4: </span> Features of CT_Pub.matches.df.</caption>
 
  Number Name                                    Notes
- ------ -----------------------------------     -----------------------------------------
- [1] "trial_id"          Unique trial id - same as NCTID     
- [2] "model"             LLM model used   
- [3] "reference"         matched reference feature (NA if none)   
- [4] "candidate"         matched candidate feature (NA if none) 
+ ------ ---------------     -----------------------------------------
+ [1]    "trial_id"          Unique trial id - same as NCTID     
+ [2]    "model"             LLM model used   
+ [3]    "reference"         matched reference feature (NA if none)   
+ [4]    "candidate"         matched candidate feature (NA if none) 
 
 </table>
 
@@ -311,7 +312,7 @@ If the table has an entry such as
    trial_id_A model_id_B NA  candidate D, this means that for trial_id A using model B,  candidate descriptor D had no match in the reference list.   
 
 ```{r}
-# Load the trials.responses
+# Load the trials.responses 
 CT_Pub.responses.df<- readRDS("/academics/MATP-4910-F24/DAR-CTEval-F24/Data/trials.responses.Rds")
 
 # convert model and type to factors
@@ -341,20 +342,20 @@ dim(CT_Pub.matches.df)
 The same process can be used for CT_Repo.  But be aware that variable names may be slightly different.  
 
 
-#Analysis of Response Data 
+# Analysis of Response Data 
 
-For each clinical trial, the evaluation program (which calls two different LLM) calculates several measures of how good the candidate descriptor proposed for each LLM are as compared to the reference descriptors.  
+For each clinical trial, the evaluation program (which calls two different LLMs) calculates several measures of how good the candidate descriptor proposed for each LLM are as compared to the reference descriptors.  
 
-The Bert score is a measure of similarity of the candidate and reference LLM in the the latent space of the Bert LLM.  This is a quick but very rough way to calculate similarity. You can read about Bert Scores here.  https://arxiv.org/abs/1904.09675
+The Bert score is a measure of semantic similarity of the candidate and reference LLM  descriptor lists in the the latent space of the Bert LLM.  This is a quick but very rough way to calculate similarity. 0 is no semantic similarity.  1 is perfect semantic similarity. You can read about Bert Scores here.  https://arxiv.org/abs/1904.09675
 
 A more accurate evaluation is done by matching each candidate descriptor with at most 1 reference descriptor.   This is done using the LLM GPT-4o. Let matches.len = number of candidate descriptors matched with the reference descriptors.   Let  candidate.len = number of unmatched candidate descriptors and reference.len = number of unmatched candidate descriptor.   
 
-Precision measures the proportion of candidate descriptors that were matched. Recall measure the proportions of the reference descriptor that were in the candidate descriptors.  F1 is a standard measure that combines precision and recall.  These calculations have already been done for each trial. 
+Precision measures the proportion of candidate descriptors that were matched. Recall measure the proportion of the reference descriptor that were in the candidate descriptors.  F1 is a standard measure that combines precision and recall.  These calculations have already been done for each trial. 
 
 For example, say reference = "Age, sex (OHM), race/ethnicity, cardiovascular disease, socio-economic status" and candidate = "age, gender, race, SBP, DBP, cholesterol ).
 
-* There are three matches:  (Age, age), (sex(OHM),gender}, (race/ethnicity,race)  
-* There are two unmatched reference  descriptors: socio-econmic status and blood pressure.
+* There are three matches:  <Age, age>, <sex(OHM),gender>, <race/ethnicity,race>  
+* There are two unmatched reference  descriptors: socio-econmic status, blood pressure.
 * There are three unmatched candidate descriptors:  SBP, DBP, cholesterol.
 
 This is how precision, recall, and f1 are calculated: 
@@ -374,7 +375,23 @@ This is how precision, recall, and f1 are calculated:
 `f1 <- ifelse(precision == 0 | recall == 0, 0, 2 * (precision * recall) / (precision + recall))`
 `f1 = 0.5454545`
 
-**Note:** one could argue that the candidate descriptors are actually better than measured by these metrics since blood pressure and cholesterol are test used to determine  cardiovascular disease. What to do it about this is an open question. 
+For this toy example, the entries in CT_Pub.matches.df would look like this:
+
+<table>
+<caption><span id="tab:table5">Table 5: </span> Toy example of CT-Pub.matches.df.</caption>
+"trial_id"   "model"        "reference"                "candidate"
+ ---------   -------------  -----------------------    --------------
+ NCT1        gpt4-omni-zs   Age                        age
+ NCT1        gpt4-omni-zs   sex(OHM)                   gender
+ NCT1        gpt4-omni-zs   race/ethnicity             race
+ NCT1        gpt4-omni-zs   socio-econmic status       NA
+ NCT1        gpt4-omni-zs   cardiovascular disease     NA
+ NCT1        gpt4-omni-zs   NA                         SBP
+ NCT1        gpt4-omni-zs   NA                         DBP 
+ NCT1        gpt4-omni-zs   NA                         cholesterol
+ </table>
+
+**Note:** one could argue that the candidate descriptors are actually better than measured by these metrics since blood pressure and cholesterol are tests used to determine  cardiovascular disease. What to do it about this is an open question. 
 
 _The goal in this analysis is to see if different subgroups of data have different average precision, average recall, and average F1 scores._
 
@@ -392,7 +409,7 @@ CT_Pub_model_results.df <- CT_Pub.responses.df %>%
             seRecall=std.error(recall),
             meanF1=mean(f1),sef1=std.error(f1))
 
-kable(CT_Pub_model_results.df, caption="Table 3: Differences by Model on CT-Pub")
+kable(CT_Pub_model_results.df, caption="Differences by Model on CT-Pub")
 
 ```
 
@@ -404,7 +421,7 @@ Now we calculate calculate mean and standard error of response measures for diff
 
 ```{r}
 
-# TODO: Use old pipe
+# Done using group by and summarize commands in Dplyr
 CT_Pub_MT_results.df <- CT_Pub.responses.df %>% 
   group_by(model,trial_group) %>% 
   summarize(meanPrecision=mean(precision),
@@ -414,7 +431,7 @@ CT_Pub_MT_results.df <- CT_Pub.responses.df %>%
             meanF1=mean(f1),
             sef1=std.error(f1))
 
-kable(CT_Pub_MT_results.df, caption="Table 4: Differences by Model and Subgroup on CT-pub")
+kable(CT_Pub_MT_results.df, caption="Differences by Model and Subgroup on CT-pub")
 ```
 
 Here we can see that predicting descriptors seems to be harder for some combinations of models and trial types.    But further analysis is needed to present the results in a more informative way and see if the differences are statistically significant. 
@@ -436,7 +453,7 @@ CT_Pub_reference_count.df <- CT_Pub.matches.df %>%
 nrow(CT_Pub_reference_count.df)
 
 # these are top 20 most common descriptors. 
-kable(head(CT_Pub_reference_count.df,20),caption="Table 5: Accuracy of Top 20 descriptors in CT_Pub")
+kable(head(CT_Pub_reference_count.df,20),caption="Accuracy of Top 20 descriptors in CT_Pub")
 
 ```
 
@@ -445,13 +462,13 @@ sample sizes can be very low.    CT-Pub is really too small to do this type of a
 
 # Your Job 
 
-You job is to do a more in-depth analysis of the results of the two models. Each member of your team can focus on a different question or model. You can use any analyses or visualizations in R that you like.  
+You job is to do a more in-depth analysis of the results of the two models. Each member of your team can focus on a different question or model. You can use any analyses or visualizations in R that you like. We will coach you through this process at the Monday and Wednesday weekly team breakouts. 
 
 Here are some ideas for questions to pursue, but feel free to make up your own. _Try to make up and answer at least two questions._  The additional questions can be a follow-up to previous questions.   Coordinate with your team so you look at different questions.  
 
 Here are some ideas for questions to inspire you: 
 
-1. Does the LLM models perform differently in terms of precision, recall, and F1 scores?   
+1. Do the the LLM models perform differently in terms of precision, recall, and F1 scores and are these differences statistically significant? 
 2. Are prompts for some disease types e.g. (group_types) harder than others?  Does this difference hold across different models? Note we will refer to this as a subgroup analysis. 
 3. How does the performance compare for CT_Pub and CT_Repo? 
 4. Can you use multiple regression on the response to understand how different factors (e.g. model, group_type, and/or source(Repo or Pub)) effect the results controlling for the others? Note this can let you know the significance of any difference too.