diff --git a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd
index ce93db5..6a2dd88 100644
--- a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd
+++ b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd
@@ -3,13 +3,13 @@ title: 'CTBench Eval Project Notebook:'
 author: "Your Name Here"
 date: "`r format(Sys.time(), '%d %B %Y')`"
 output:
-  pdf_document: default
-  word_document:
-    toc: true
   html_document:
     toc: true
     number_sections: true
     df_print: paged
+  word_document:
+    toc: true
+  pdf_document: default
 subtitle: DAR Assignment 3 (Fall 2024)
 ---
 ```{r setup, include=FALSE}
@@ -77,7 +77,7 @@ https://arxiv.org/abs/2406.17888
 
 ## CTBenchEval Goals
 
-The high-level goals of CTBenchEval project are to:
+The high-level goals of CTBenchEval project for the semester are to:
 
 1. Imagine you are trying your own LLM approach and want to compare to the published CTBench results. 
 
@@ -98,12 +98,12 @@ The high-level goals of CTBenchEval project are to:
 * "How can we make the  evaluation software provided effective and easy to use?" 
 * "How can we make CTBench readily extensible to more clinical trials?" 
 
-# DAR ASSIGNMENT 3 (Introduction): Introductory DAR Notebook
+# DAR ASSIGNMENT 3 (Introduction): Introductory DAR  CTBench Eval Notebook
 
 This notebook is broken into two main parts:
 
 * **Part 1:** Preparing your local repo for **DAR Assignment 3**
-* **Part 2:** Loading the CTBench Eval Datasets
+* **Part 2:** Loading and Analysis the CTBench Eval Datasets
 * **Part 3:** Individual analysis of your team's dataset
 
 **NOTE:** The RPI github repository for all the code and data required for this notebook may be found at:
@@ -112,7 +112,8 @@ This notebook is broken into two main parts:
 
 * **Part 4:** Be prepared to discuss your results in you team breakout. 
 
-# DAR ASSIGNMENT 3 (Part 1): Preparing your local repo for Assignment 3
+# DAR ASSIGNMENT 3 (Part 1): Preparing your local repo for Assignment 3 
+**Delete this Section from your  submitted notebook**
 
 In this assignment you'll start by making a copy of the Assignment 3 template notebook, then you'll add to your copy with your original work. The instructions which follow explain how to accomplish this. 
 
@@ -175,8 +176,9 @@ You're now ready to start coding Assignment 3!
     * One of the DAR instructors will merge your branch, and your new files will be added to the master branch of the repo. _Do not merge your branch yourself!_
 
 # DAR ASSIGNMENT 3 (Part 2): Loading the CTBench Eval Datasets
+**Delete this Section from your submitted notebook. You can reuse this code as needed in the answers to your questions**
 
-In this CTBench there are two sorts of data: clinical trial data used to generate the prompts and results data that shows the results.  For your conveniences, these dataset have been converted to R Rds files regardless of how they appear originally.   
+In this CTBench there are two sources of data: clinical trial data used to generate the prompts and results data from the LM and its evaluation.  For your conveniences, these dataset have been converted to R Rds files regardless of how they appear originally.   
 
 * **Data Rds:** 
 
@@ -192,7 +194,7 @@ These are the datasets that describe each clinical trial and give it's baseline
 These include the results of various LLM on each trial.  
 
   * `trials.responses.Rds` contains the results of the LLM for each clinical trial and model combination. 
-  * `trials.matches.Rds` contains the specific matches that the evaluation LLM made between the candidate and references descriptors.
+  * `trials.matches.Rds` contains the specific matches that the evaluation LLM made between the candidate and references descriptors for each clinical trial and model.
 
 NOTES: 
 
@@ -216,10 +218,9 @@ Each of these datastructure has one row per trial. The features are given below.
  [5]    "Conditions"                            Health Conditions the Trial addresses
  [6]    "Interventions"                         Intervention (a.k.a treatments)    
  [7]    "PrimaryOutcomes"                       Measure of success or failure of trial
- [8]    "BaselineMeasures"                        Original Features in ClinicalTrial.gov          
- [9]   "BaselineMeasures_Processed"           Cleaned up Features used in CTBench
-
-</table>
+ [8]    "BaselineMeasures"                      List of original Features in ClinicalTrial.gov          
+ [9]   "BaselineMeasures_Processed"             List of cleaned up Features used in CTBench
+ </table>
 
 
 <table>
@@ -233,11 +234,10 @@ Each of these datastructure has one row per trial. The features are given below.
  [5]    "Conditions"                            Health Conditions the Trial addresses
  [6]    "Interventions"                         Intervention (a.k.a treatments)    
  [7]    "PrimaryOutcomes"                       Measure of success or failure of trial 
- [8]    "BaselineMeasures"                        Original Features in ClinicalTrial.gov          
- [9]    "Paper_BaselineMeasures"                  Original Features in trial paper        
- [10]   "Paper_BaselineMeasures_Processed"        Cleaned up Features used in CTBench
-
-</table>
+ [8]    "BaselineMeasures"                      List of original Features in ClinicalTrial.gov          
+ [9]    "Paper_BaselineMeasures"                  List of original features in trial paper        
+ [10]   "Paper_BaselineMeasures_Processed"        List of cleaned up Features used in trial paper 
+ </table>
 
 * For `CT_Repo.df`, the reference baseline descriptors used in the experiments are in a comma separated list in  `BaselineMeasures_Processed`. 
 * For `CT_Pub.df`, the reference baseline decsriptors used in the experiments are in a comma separated list in  `Paper_BaselineMeasures_Processed`. 
@@ -273,17 +273,18 @@ These are the features in CT_Pub_responses.df.
 
  Number Name                                    Notes
  ------ -----------------------------------     -----------------------------------------
- [1] "trial_id"          Unique trial id - same as NCTID     
- [2] "trial_group"       trial address what group of disease.    
- [3] "model"             LLM model used   
- [4] "gen_response"           Result generated  
- [5] "processed_gen_response" Cleaned up result generated
- [6] "len_matches"           # of matching desriptors
- [7] "len_reference"         # unmatched descriptors from reference
- [8] "len_candidate"         # unmatched descriptors from LLM response
- [9] "precision"             precision for this trial and model
-[10] "recall"                recall for this trial and model
-[11] "f1"                    F1 score for this trial and model 
+ [1]    "trial_id"                                Unique trial id - same as NCTID     
+ [2]    "trial_group"                             trial address what group of disease.    
+ [3]    "model"                                   LLM model used   
+ [4]    "gen_response"                            Result generated  
+ [5]    "processed_gen_response"                    Cleaned up result generated
+ [6]    "len_matches"                               # of matching desriptors
+ [7]    "len_reference"                             # unmatched descriptors from reference
+ [8]    "len_candidate"                             # unmatched descriptors from LLM response
+ [9]    "precision"                                 precision for this trial and model
+[10]    "recall"                                    recall for this trial and model
+[11]    "f1"                                        F1 score for this trial and model 
+
 </table>
 
 These are the features in CT_Pub_matches.df.  
@@ -291,11 +292,11 @@ These are the features in CT_Pub_matches.df.
 <caption><span id="tab:table4">Table 4: </span> Features of CT_Pub.matches.df.</caption>
 
  Number Name                                    Notes
- ------ -----------------------------------     -----------------------------------------
- [1] "trial_id"          Unique trial id - same as NCTID     
- [2] "model"             LLM model used   
- [3] "reference"         matched reference feature (NA if none)   
- [4] "candidate"         matched candidate feature (NA if none) 
+ ------ ---------------     -----------------------------------------
+ [1]    "trial_id"          Unique trial id - same as NCTID     
+ [2]    "model"             LLM model used   
+ [3]    "reference"         matched reference feature (NA if none)   
+ [4]    "candidate"         matched candidate feature (NA if none) 
 
 </table>
 
@@ -311,7 +312,7 @@ If the table has an entry such as
    trial_id_A model_id_B NA  candidate D, this means that for trial_id A using model B,  candidate descriptor D had no match in the reference list.   
 
 ```{r}
-# Load the trials.responses
+# Load the trials.responses 
 CT_Pub.responses.df<- readRDS("/academics/MATP-4910-F24/DAR-CTEval-F24/Data/trials.responses.Rds")
 
 # convert model and type to factors
@@ -341,20 +342,20 @@ dim(CT_Pub.matches.df)
 The same process can be used for CT_Repo.  But be aware that variable names may be slightly different.  
 
 
-#Analysis of Response Data 
+# Analysis of Response Data 
 
-For each clinical trial, the evaluation program (which calls two different LLM) calculates several measures of how good the candidate descriptor proposed for each LLM are as compared to the reference descriptors.  
+For each clinical trial, the evaluation program (which calls two different LLMs) calculates several measures of how good the candidate descriptor proposed for each LLM are as compared to the reference descriptors.  
 
-The Bert score is a measure of similarity of the candidate and reference LLM in the the latent space of the Bert LLM.  This is a quick but very rough way to calculate similarity. You can read about Bert Scores here.  https://arxiv.org/abs/1904.09675
+The Bert score is a measure of semantic similarity of the candidate and reference LLM  descriptor lists in the the latent space of the Bert LLM.  This is a quick but very rough way to calculate similarity. 0 is no semantic similarity.  1 is perfect semantic similarity. You can read about Bert Scores here.  https://arxiv.org/abs/1904.09675
 
 A more accurate evaluation is done by matching each candidate descriptor with at most 1 reference descriptor.   This is done using the LLM GPT-4o. Let matches.len = number of candidate descriptors matched with the reference descriptors.   Let  candidate.len = number of unmatched candidate descriptors and reference.len = number of unmatched candidate descriptor.   
 
-Precision measures the proportion of candidate descriptors that were matched. Recall measure the proportions of the reference descriptor that were in the candidate descriptors.  F1 is a standard measure that combines precision and recall.  These calculations have already been done for each trial. 
+Precision measures the proportion of candidate descriptors that were matched. Recall measure the proportion of the reference descriptor that were in the candidate descriptors.  F1 is a standard measure that combines precision and recall.  These calculations have already been done for each trial. 
 
 For example, say reference = "Age, sex (OHM), race/ethnicity, cardiovascular disease, socio-economic status" and candidate = "age, gender, race, SBP, DBP, cholesterol ).
 
-* There are three matches:  (Age, age), (sex(OHM),gender}, (race/ethnicity,race)  
-* There are two unmatched reference  descriptors: socio-econmic status and blood pressure.
+* There are three matches:  <Age, age>, <sex(OHM),gender>, <race/ethnicity,race>  
+* There are two unmatched reference  descriptors: socio-econmic status, blood pressure.
 * There are three unmatched candidate descriptors:  SBP, DBP, cholesterol.
 
 This is how precision, recall, and f1 are calculated: 
@@ -374,7 +375,23 @@ This is how precision, recall, and f1 are calculated:
 `f1 <- ifelse(precision == 0 | recall == 0, 0, 2 * (precision * recall) / (precision + recall))`
 `f1 = 0.5454545`
 
-**Note:** one could argue that the candidate descriptors are actually better than measured by these metrics since blood pressure and cholesterol are test used to determine  cardiovascular disease. What to do it about this is an open question. 
+For this toy example, the entries in CT_Pub.matches.df would look like this:
+
+<table>
+<caption><span id="tab:table5">Table 5: </span> Toy example of CT-Pub.matches.df.</caption>
+"trial_id"   "model"        "reference"                "candidate"
+ ---------   -------------  -----------------------    --------------
+ NCT1        gpt4-omni-zs   Age                        age
+ NCT1        gpt4-omni-zs   sex(OHM)                   gender
+ NCT1        gpt4-omni-zs   race/ethnicity             race
+ NCT1        gpt4-omni-zs   socio-econmic status       NA
+ NCT1        gpt4-omni-zs   cardiovascular disease     NA
+ NCT1        gpt4-omni-zs   NA                         SBP
+ NCT1        gpt4-omni-zs   NA                         DBP 
+ NCT1        gpt4-omni-zs   NA                         cholesterol
+ </table>
+
+**Note:** one could argue that the candidate descriptors are actually better than measured by these metrics since blood pressure and cholesterol are tests used to determine  cardiovascular disease. What to do it about this is an open question. 
 
 _The goal in this analysis is to see if different subgroups of data have different average precision, average recall, and average F1 scores._
 
@@ -392,7 +409,7 @@ CT_Pub_model_results.df <- CT_Pub.responses.df %>%
             seRecall=std.error(recall),
             meanF1=mean(f1),sef1=std.error(f1))
 
-kable(CT_Pub_model_results.df, caption="Table 3: Differences by Model on CT-Pub")
+kable(CT_Pub_model_results.df, caption="Differences by Model on CT-Pub")
 
 ```
 
@@ -404,7 +421,7 @@ Now we calculate calculate mean and standard error of response measures for diff
 
 ```{r}
 
-# TODO: Use old pipe
+# Done using group by and summarize commands in Dplyr
 CT_Pub_MT_results.df <- CT_Pub.responses.df %>% 
   group_by(model,trial_group) %>% 
   summarize(meanPrecision=mean(precision),
@@ -414,7 +431,7 @@ CT_Pub_MT_results.df <- CT_Pub.responses.df %>%
             meanF1=mean(f1),
             sef1=std.error(f1))
 
-kable(CT_Pub_MT_results.df, caption="Table 4: Differences by Model and Subgroup on CT-pub")
+kable(CT_Pub_MT_results.df, caption="Differences by Model and Subgroup on CT-pub")
 ```
 
 Here we can see that predicting descriptors seems to be harder for some combinations of models and trial types.    But further analysis is needed to present the results in a more informative way and see if the differences are statistically significant. 
@@ -436,7 +453,7 @@ CT_Pub_reference_count.df <- CT_Pub.matches.df %>%
 nrow(CT_Pub_reference_count.df)
 
 # these are top 20 most common descriptors. 
-kable(head(CT_Pub_reference_count.df,20),caption="Table 5: Accuracy of Top 20 descriptors in CT_Pub")
+kable(head(CT_Pub_reference_count.df,20),caption="Accuracy of Top 20 descriptors in CT_Pub")
 
 ```
 
@@ -445,13 +462,13 @@ sample sizes can be very low.    CT-Pub is really too small to do this type of a
 
 # Your Job 
 
-You job is to do a more in-depth analysis of the results of the two models. Each member of your team can focus on a different question or model. You can use any analyses or visualizations in R that you like.  
+You job is to do a more in-depth analysis of the results of the two models. Each member of your team can focus on a different question or model. You can use any analyses or visualizations in R that you like. We will coach you through this process at the Monday and Wednesday weekly team breakouts. 
 
 Here are some ideas for questions to pursue, but feel free to make up your own. _Try to make up and answer at least two questions._  The additional questions can be a follow-up to previous questions.   Coordinate with your team so you look at different questions.  
 
 Here are some ideas for questions to inspire you: 
 
-1. Does the LLM models perform differently in terms of precision, recall, and F1 scores?   
+1. Do the the LLM models perform differently in terms of precision, recall, and F1 scores and are these differences statistically significant? 
 2. Are prompts for some disease types e.g. (group_types) harder than others?  Does this difference hold across different models? Note we will refer to this as a subgroup analysis. 
 3. How does the performance compare for CT_Pub and CT_Repo? 
 4. Can you use multiple regression on the response to understand how different factors (e.g. model, group_type, and/or source(Repo or Pub)) effect the results controlling for the others? Note this can let you know the significance of any difference too. 
diff --git a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.html b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.html
index 7eef820..0f6432b 100644
--- a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.html
+++ b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.html
@@ -1636,8 +1636,8 @@ <h4 class="date">09 September 2024</h4>
 <li><a href="#general-overview" id="toc-general-overview"><span class="toc-section-number">1.1</span> General Overview:</a></li>
 <li><a href="#ctbencheval-goals" id="toc-ctbencheval-goals"><span class="toc-section-number">1.2</span> CTBenchEval Goals</a></li>
 </ul></li>
-<li><a href="#dar-assignment-3-introduction-introductory-dar-notebook" id="toc-dar-assignment-3-introduction-introductory-dar-notebook"><span class="toc-section-number">2</span> DAR ASSIGNMENT 3 (Introduction):
-Introductory DAR Notebook</a></li>
+<li><a href="#dar-assignment-3-introduction-introductory-dar-ctbench-eval-notebook" id="toc-dar-assignment-3-introduction-introductory-dar-ctbench-eval-notebook"><span class="toc-section-number">2</span> DAR ASSIGNMENT 3 (Introduction):
+Introductory DAR CTBench Eval Notebook</a></li>
 <li><a href="#dar-assignment-3-part-1-preparing-your-local-repo-for-assignment-3" id="toc-dar-assignment-3-part-1-preparing-your-local-repo-for-assignment-3"><span class="toc-section-number">3</span> DAR ASSIGNMENT 3 (Part 1): Preparing
 your local repo for Assignment 3</a>
 <ul>
@@ -1653,43 +1653,46 @@ <h4 class="date">09 September 2024</h4>
 <em>Data</em></a></li>
 <li><a href="#load-the-ctbench-eval-results" id="toc-load-the-ctbench-eval-results"><span class="toc-section-number">4.2</span> Load the CTBench Eval
 <em>Results</em></a></li>
-<li><a href="#how-do-results-differ-by-model-on-ct-pub" id="toc-how-do-results-differ-by-model-on-ct-pub"><span class="toc-section-number">4.3</span> How do results differ by model on
+</ul></li>
+<li><a href="#analysis-of-response-data" id="toc-analysis-of-response-data"><span class="toc-section-number">5</span> Analysis of Response Data</a>
+<ul>
+<li><a href="#how-do-results-differ-by-model-on-ct-pub" id="toc-how-do-results-differ-by-model-on-ct-pub"><span class="toc-section-number">5.1</span> How do results differ by model on
 CT-pub?</a></li>
-<li><a href="#how-do-results-differ-by-model-and-trial-type-on-ct-pub" id="toc-how-do-results-differ-by-model-and-trial-type-on-ct-pub"><span class="toc-section-number">4.4</span> How do results differ by model and
+<li><a href="#how-do-results-differ-by-model-and-trial-type-on-ct-pub" id="toc-how-do-results-differ-by-model-and-trial-type-on-ct-pub"><span class="toc-section-number">5.2</span> How do results differ by model and
 trial type on CT-pub?</a></li>
 </ul></li>
-<li><a href="#do-we-see-differences-in-results-for-different-candidate-descriptors" id="toc-do-we-see-differences-in-results-for-different-candidate-descriptors"><span class="toc-section-number">5</span> Do we see differences in results for
+<li><a href="#do-we-see-differences-in-results-for-different-candidate-descriptors" id="toc-do-we-see-differences-in-results-for-different-candidate-descriptors"><span class="toc-section-number">6</span> Do we see differences in results for
 different candidate descriptors?</a></li>
-<li><a href="#your-job" id="toc-your-job"><span class="toc-section-number">6</span> Your Job</a>
+<li><a href="#your-job" id="toc-your-job"><span class="toc-section-number">7</span> Your Job</a>
 <ul>
-<li><a href="#analysis-question-1-provide-short-name" id="toc-analysis-question-1-provide-short-name"><span class="toc-section-number">6.1</span> Analysis: Question 1 (Provide
+<li><a href="#analysis-question-1-provide-short-name" id="toc-analysis-question-1-provide-short-name"><span class="toc-section-number">7.1</span> Analysis: Question 1 (Provide
 short name)</a>
 <ul>
-<li><a href="#question-being-asked" id="toc-question-being-asked"><span class="toc-section-number">6.1.1</span> Question being asked</a></li>
-<li><a href="#data-preparation" id="toc-data-preparation"><span class="toc-section-number">6.1.2</span> Data Preparation</a></li>
-<li><a href="#analysis-methods-and-results" id="toc-analysis-methods-and-results"><span class="toc-section-number">6.1.3</span> Analysis: Methods and
+<li><a href="#question-being-asked" id="toc-question-being-asked"><span class="toc-section-number">7.1.1</span> Question being asked</a></li>
+<li><a href="#data-preparation" id="toc-data-preparation"><span class="toc-section-number">7.1.2</span> Data Preparation</a></li>
+<li><a href="#analysis-methods-and-results" id="toc-analysis-methods-and-results"><span class="toc-section-number">7.1.3</span> Analysis: Methods and
 results</a></li>
-<li><a href="#discussion-of-results" id="toc-discussion-of-results"><span class="toc-section-number">6.1.4</span> Discussion of results</a></li>
+<li><a href="#discussion-of-results" id="toc-discussion-of-results"><span class="toc-section-number">7.1.4</span> Discussion of results</a></li>
 </ul></li>
-<li><a href="#analysis-question-2-provide-short-name" id="toc-analysis-question-2-provide-short-name"><span class="toc-section-number">6.2</span> Analysis: Question 2 (Provide
+<li><a href="#analysis-question-2-provide-short-name" id="toc-analysis-question-2-provide-short-name"><span class="toc-section-number">7.2</span> Analysis: Question 2 (Provide
 short name)</a>
 <ul>
-<li><a href="#question-being-asked-1" id="toc-question-being-asked-1"><span class="toc-section-number">6.2.1</span> Question being asked</a></li>
-<li><a href="#data-preparation-1" id="toc-data-preparation-1"><span class="toc-section-number">6.2.2</span> Data Preparation</a></li>
-<li><a href="#analysis-methods-and-results-1" id="toc-analysis-methods-and-results-1"><span class="toc-section-number">6.2.3</span> Analysis: Methods and
+<li><a href="#question-being-asked-1" id="toc-question-being-asked-1"><span class="toc-section-number">7.2.1</span> Question being asked</a></li>
+<li><a href="#data-preparation-1" id="toc-data-preparation-1"><span class="toc-section-number">7.2.2</span> Data Preparation</a></li>
+<li><a href="#analysis-methods-and-results-1" id="toc-analysis-methods-and-results-1"><span class="toc-section-number">7.2.3</span> Analysis: Methods and
 Results</a></li>
-<li><a href="#discussion-of-results-1" id="toc-discussion-of-results-1"><span class="toc-section-number">6.2.4</span> Discussion of results</a></li>
+<li><a href="#discussion-of-results-1" id="toc-discussion-of-results-1"><span class="toc-section-number">7.2.4</span> Discussion of results</a></li>
 </ul></li>
-<li><a href="#summary-and-next-steps" id="toc-summary-and-next-steps"><span class="toc-section-number">6.3</span> Summary and next steps</a></li>
+<li><a href="#summary-and-next-steps" id="toc-summary-and-next-steps"><span class="toc-section-number">7.3</span> Summary and next steps</a></li>
 </ul></li>
-<li><a href="#when-youre-done-save-commit-and-push-your-changes" id="toc-when-youre-done-save-commit-and-push-your-changes"><span class="toc-section-number">7</span> When you’re done: SAVE, COMMIT and
+<li><a href="#when-youre-done-save-commit-and-push-your-changes" id="toc-when-youre-done-save-commit-and-push-your-changes"><span class="toc-section-number">8</span> When you’re done: SAVE, COMMIT and
 PUSH YOUR CHANGES!</a></li>
-<li><a href="#appendix-accessing-rstudio-server-on-the-idea-cluster" id="toc-appendix-accessing-rstudio-server-on-the-idea-cluster"><span class="toc-section-number">8</span> APPENDIX: Accessing RStudio Server
+<li><a href="#appendix-accessing-rstudio-server-on-the-idea-cluster" id="toc-appendix-accessing-rstudio-server-on-the-idea-cluster"><span class="toc-section-number">9</span> APPENDIX: Accessing RStudio Server
 on the IDEA Cluster</a></li>
-<li><a href="#more-info-about-rstudio-on-our-cluster" id="toc-more-info-about-rstudio-on-our-cluster"><span class="toc-section-number">9</span> More info about Rstudio on our
+<li><a href="#more-info-about-rstudio-on-our-cluster" id="toc-more-info-about-rstudio-on-our-cluster"><span class="toc-section-number">10</span> More info about Rstudio on our
 Cluster</a>
 <ul>
-<li><a href="#rstudio-gui-access" id="toc-rstudio-gui-access"><span class="toc-section-number">9.1</span> RStudio GUI Access:</a></li>
+<li><a href="#rstudio-gui-access" id="toc-rstudio-gui-access"><span class="toc-section-number">10.1</span> RStudio GUI Access:</a></li>
 </ul></li>
 </ul>
 </div>
@@ -1735,7 +1738,8 @@ <h2><span class="header-section-number">1.1</span> General
 <div id="ctbencheval-goals" class="section level2" number="1.2">
 <h2><span class="header-section-number">1.2</span> CTBenchEval
 Goals</h2>
-<p>The high-level goals of CTBenchEval project are to:</p>
+<p>The high-level goals of CTBenchEval project for the semester are
+to:</p>
 <ol style="list-style-type: decimal">
 <li>Imagine you are trying your own LLM approach and want to compare to
 the published CTBench results.</li>
@@ -1777,14 +1781,15 @@ <h2><span class="header-section-number">1.2</span> CTBenchEval
 </ul>
 </div>
 </div>
-<div id="dar-assignment-3-introduction-introductory-dar-notebook" class="section level1" number="2">
+<div id="dar-assignment-3-introduction-introductory-dar-ctbench-eval-notebook" class="section level1" number="2">
 <h1><span class="header-section-number">2</span> DAR ASSIGNMENT 3
-(Introduction): Introductory DAR Notebook</h1>
+(Introduction): Introductory DAR CTBench Eval Notebook</h1>
 <p>This notebook is broken into two main parts:</p>
 <ul>
 <li><strong>Part 1:</strong> Preparing your local repo for <strong>DAR
 Assignment 3</strong></li>
-<li><strong>Part 2:</strong> Loading the CTBench Eval Datasets</li>
+<li><strong>Part 2:</strong> Loading and Analysis the CTBench Eval
+Datasets</li>
 <li><strong>Part 3:</strong> Individual analysis of your team’s
 dataset</li>
 </ul>
@@ -1799,6 +1804,7 @@ <h1><span class="header-section-number">2</span> DAR ASSIGNMENT 3
 <div id="dar-assignment-3-part-1-preparing-your-local-repo-for-assignment-3" class="section level1" number="3">
 <h1><span class="header-section-number">3</span> DAR ASSIGNMENT 3 (Part
 1): Preparing your local repo for Assignment 3</h1>
+<p><strong>Delete this Section from your submitted notebook</strong></p>
 <p>In this assignment you’ll start by making a copy of the Assignment 3
 template notebook, then you’ll add to your copy with your original work.
 The instructions which follow explain how to accomplish this.</p>
@@ -1929,10 +1935,12 @@ <h2><span class="header-section-number">3.2</span> Creating your copy of
 <div id="dar-assignment-3-part-2-loading-the-ctbench-eval-datasets" class="section level1" number="4">
 <h1><span class="header-section-number">4</span> DAR ASSIGNMENT 3 (Part
 2): Loading the CTBench Eval Datasets</h1>
-<p>In this CTBench there are two sorts of data: clinical trial data used
-to generate the prompts and results data that shows the results. For
-your conveniences, these dataset have been converted to R Rds files
-regardless of how they appear originally.</p>
+<p><strong>Delete this Section from your submitted notebook. You can
+reuse this code as needed in the answers to your questions</strong></p>
+<p>In this CTBench there are two sources of data: clinical trial data
+used to generate the prompts and results data from the LM and its
+evaluation. For your conveniences, these dataset have been converted to
+R Rds files regardless of how they appear originally.</p>
 <ul>
 <li><strong>Data Rds:</strong></li>
 </ul>
@@ -1952,8 +1960,8 @@ <h1><span class="header-section-number">4</span> DAR ASSIGNMENT 3 (Part
 <li><code>trials.responses.Rds</code> contains the results of the LLM
 for each clinical trial and model combination.</li>
 <li><code>trials.matches.Rds</code> contains the specific matches that
-the evaluation LLM made between the candidate and references
-descriptors.</li>
+the evaluation LLM made between the candidate and references descriptors
+for each clinical trial and model.</li>
 </ul>
 <p>NOTES:</p>
 <ul>
@@ -2018,16 +2026,20 @@ <h2><span class="header-section-number">4.1</span> Load the CTBench Eval
 <tr class="even">
 <td>[8]</td>
 <td align="left">“BaselineMeasures”</td>
-<td align="left">Original Features in ClinicalTrial.gov</td>
+<td align="left">List of original Features in ClinicalTrial.gov</td>
 </tr>
 <tr class="odd">
 <td>[9] ”</td>
-<td align="left">BaselineMeasures_Processed” Cl</td>
-<td align="left">eaned up Features used in CTBench</td>
+<td align="left">BaselineMeasures_Processed”</td>
+<td align="left">List of cleaned up Features used in CTBench</td>
+</tr>
+<tr class="even">
+<td>&lt;/table</td>
+<td align="left">&gt;</td>
+<td align="left"></td>
 </tr>
 </tbody>
 </table>
-</table>
 <table>
 <caption>
 <span id="tab:table2">Table 2: </span> Features of CT_Pub.df.
@@ -2075,21 +2087,25 @@ <h2><span class="header-section-number">4.1</span> Load the CTBench Eval
 <tr class="odd">
 <td>[8]</td>
 <td align="left">“BaselineMeasures”</td>
-<td align="left">Original Features in ClinicalTrial.gov</td>
+<td align="left">List of original Features in ClinicalTrial.gov</td>
 </tr>
 <tr class="even">
 <td>[9]</td>
 <td align="left">“Paper_BaselineMeasures”</td>
-<td align="left">Original Features in trial paper</td>
+<td align="left">List of original features in trial paper</td>
 </tr>
 <tr class="odd">
 <td>[10]</td>
 <td align="left">“Paper_BaselineMeasures_Processed”</td>
-<td align="left">Cleaned up Features used in CTBench</td>
+<td align="left">List of cleaned up Features used in trial paper</td>
+</tr>
+<tr class="even">
+<td>&lt;/table</td>
+<td align="left">&gt;</td>
+<td align="left"></td>
 </tr>
 </tbody>
 </table>
-</table>
 <ul>
 <li>For <code>CT_Repo.df</code>, the reference baseline descriptors used
 in the experiments are in a comma separated list in
@@ -2139,67 +2155,63 @@ <h2><span class="header-section-number">4.2</span> Load the CTBench Eval
 </thead>
 <tbody>
 <tr class="odd">
-<td>[1] “tr</td>
-<td align="left">ial_id” Unique trial id - same</td>
-<td align="left">as NCTID</td>
+<td>[1]</td>
+<td align="left">“trial_id”</td>
+<td align="left">Unique trial id - same as NCTID</td>
 </tr>
 <tr class="even">
-<td>[2] “tr</td>
-<td align="left">ial_group” trial address what grou</td>
-<td align="left">p of disease.</td>
+<td>[2]</td>
+<td align="left">“trial_group”</td>
+<td align="left">trial address what group of disease.</td>
 </tr>
 <tr class="odd">
-<td>[3] “mo</td>
-<td align="left">del” LLM model used</td>
-<td align="left"></td>
+<td>[3]</td>
+<td align="left">“model”</td>
+<td align="left">LLM model used</td>
 </tr>
 <tr class="even">
-<td>[4] “ge</td>
-<td align="left">n_response” Result generated</td>
-<td align="left"></td>
+<td>[4]</td>
+<td align="left">“gen_response”</td>
+<td align="left">Result generated</td>
 </tr>
 <tr class="odd">
-<td>[5] “pr</td>
-<td align="left">ocessed_gen_response” Cleaned up result</td>
-<td align="left">generated</td>
+<td>[5]</td>
+<td align="left">“processed_gen_response”</td>
+<td align="left">Cleaned up result generated</td>
 </tr>
 <tr class="even">
-<td>[6] “le</td>
-<td align="left">n_matches” # of matching desri</td>
-<td align="left">ptors</td>
+<td>[6]</td>
+<td align="left">“len_matches”</td>
+<td align="left"># of matching desriptors</td>
 </tr>
 <tr class="odd">
-<td>[7] “le</td>
-<td align="left">n_reference” # unmatched descrip</td>
-<td align="left">tors from reference</td>
+<td>[7]</td>
+<td align="left">“len_reference”</td>
+<td align="left"># unmatched descriptors from reference</td>
 </tr>
 <tr class="even">
-<td>[8] “le</td>
-<td align="left">n_candidate” # unmatched descrip</td>
-<td align="left">tors from LLM response</td>
+<td>[8]</td>
+<td align="left">“len_candidate”</td>
+<td align="left"># unmatched descriptors from LLM response</td>
 </tr>
 <tr class="odd">
-<td>[9] “pr</td>
-<td align="left">ecision” precision for this</td>
-<td align="left">trial and model</td>
+<td>[9]</td>
+<td align="left">“precision”</td>
+<td align="left">precision for this trial and model</td>
 </tr>
 <tr class="even">
-<td>10] “re</td>
-<td align="left">call” recall for this tri</td>
-<td align="left">al and model</td>
+<td>10]</td>
+<td align="left">“recall”</td>
+<td align="left">recall for this trial and model</td>
 </tr>
 <tr class="odd">
-<td>11] “f1</td>
-<td align="left">” F1 score for this t</td>
-<td align="left">rial and model</td>
-</tr>
-<tr class="even">
-<td>/table&gt;</td>
-<td align="left"></td>
-<td align="left"></td>
+<td>11]</td>
+<td align="left">“f1”</td>
+<td align="left">F1 score for this trial and model</td>
 </tr>
 </tbody>
 </table>
+</table>
 These are the features in CT_Pub_matches.df.<br />
 
 <table>
@@ -2211,29 +2223,29 @@ <h2><span class="header-section-number">4.2</span> Load the CTBench Eval
 <tr class="header">
 <th>Number</th>
 <th align="left">Name</th>
-<th align="left">Notes</th>
+<th align="center">Notes</th>
 </tr>
 </thead>
 <tbody>
 <tr class="odd">
-<td>[1] “tr</td>
-<td align="left">ial_id” Unique trial id - same</td>
-<td align="left">as NCTID</td>
+<td>[1]</td>
+<td align="left">“trial_id”</td>
+<td align="center">Unique trial id - same as NCTID</td>
 </tr>
 <tr class="even">
-<td>[2] “mo</td>
-<td align="left">del” LLM model used</td>
-<td align="left"></td>
+<td>[2]</td>
+<td align="left">“model”</td>
+<td align="center">LLM model used</td>
 </tr>
 <tr class="odd">
-<td>[3] “re</td>
-<td align="left">ference” matched reference featu</td>
-<td align="left">re (NA if none)</td>
+<td>[3]</td>
+<td align="left">“reference”</td>
+<td align="center">matched reference feature (NA if none)</td>
 </tr>
 <tr class="even">
-<td>[4] “ca</td>
-<td align="left">ndidate” matched candidate featu</td>
-<td align="left">re (NA if none)</td>
+<td>[4]</td>
+<td align="left">“candidate”</td>
+<td align="center">matched candidate feature (NA if none)</td>
 </tr>
 </tbody>
 </table>
@@ -2250,7 +2262,7 @@ <h2><span class="header-section-number">4.2</span> Load the CTBench Eval
 <p>If the table has an entry such as trial_id_A model_id_B NA candidate
 D, this means that for trial_id A using model B, candidate descriptor D
 had no match in the reference list.</p>
-<pre class="r"><code># Load the trials.responses
+<pre class="r"><code># Load the trials.responses 
 CT_Pub.responses.df&lt;- readRDS(&quot;/academics/MATP-4910-F24/DAR-CTEval-F24/Data/trials.responses.Rds&quot;)
 
 # convert model and type to factors
@@ -2275,15 +2287,20 @@ <h2><span class="header-section-number">4.2</span> Load the CTBench Eval
 #head(CT_Pub.responses.df,5)</code></pre>
 <p>The same process can be used for CT_Repo. But be aware that variable
 names may be slightly different.</p>
-<p>#Analysis of Response Data</p>
+</div>
+</div>
+<div id="analysis-of-response-data" class="section level1" number="5">
+<h1><span class="header-section-number">5</span> Analysis of Response
+Data</h1>
 <p>For each clinical trial, the evaluation program (which calls two
-different LLM) calculates several measures of how good the candidate
+different LLMs) calculates several measures of how good the candidate
 descriptor proposed for each LLM are as compared to the reference
 descriptors.</p>
-<p>The Bert score is a measure of similarity of the candidate and
-reference LLM in the the latent space of the Bert LLM. This is a quick
-but very rough way to calculate similarity. You can read about Bert
-Scores here. <a href="https://arxiv.org/abs/1904.09675" class="uri">https://arxiv.org/abs/1904.09675</a></p>
+<p>The Bert score is a measure of semantic similarity of the candidate
+and reference LLM descriptor lists in the the latent space of the Bert
+LLM. This is a quick but very rough way to calculate similarity. 0 is no
+semantic similarity. 1 is perfect semantic similarity. You can read
+about Bert Scores here. <a href="https://arxiv.org/abs/1904.09675" class="uri">https://arxiv.org/abs/1904.09675</a></p>
 <p>A more accurate evaluation is done by matching each candidate
 descriptor with at most 1 reference descriptor. This is done using the
 LLM GPT-4o. Let matches.len = number of candidate descriptors matched
@@ -2291,7 +2308,7 @@ <h2><span class="header-section-number">4.2</span> Load the CTBench Eval
 candidate descriptors and reference.len = number of unmatched candidate
 descriptor.</p>
 <p>Precision measures the proportion of candidate descriptors that were
-matched. Recall measure the proportions of the reference descriptor that
+matched. Recall measure the proportion of the reference descriptor that
 were in the candidate descriptors. F1 is a standard measure that
 combines precision and recall. These calculations have already been done
 for each trial.</p>
@@ -2299,11 +2316,11 @@ <h2><span class="header-section-number">4.2</span> Load the CTBench Eval
 cardiovascular disease, socio-economic status” and candidate = “age,
 gender, race, SBP, DBP, cholesterol ).</p>
 <ul>
-<li>There are three matches: (Age, age), (sex(OHM),gender},
-(race/ethnicity,race)<br />
+<li>There are three matches: &lt;Age, age&gt;, &lt;sex(OHM),gender&gt;,
+&lt;race/ethnicity,race&gt;<br />
 </li>
-<li>There are two unmatched reference descriptors: socio-econmic status
-and blood pressure.</li>
+<li>There are two unmatched reference descriptors: socio-econmic status,
+blood pressure.</li>
 <li>There are three unmatched candidate descriptors: SBP, DBP,
 cholesterol.</li>
 </ul>
@@ -2323,16 +2340,87 @@ <h2><span class="header-section-number">4.2</span> Load the CTBench Eval
 </ul>
 <p><code>f1 &lt;- ifelse(precision == 0 | recall == 0, 0, 2 * (precision * recall) / (precision + recall))</code>
 <code>f1 = 0.5454545</code></p>
+<p>For this toy example, the entries in CT_Pub.matches.df would look
+like this:</p>
+<table>
+<caption>
+<span id="tab:table5">Table 5: </span> Toy example of CT-Pub.matches.df.
+</caption>
+<table>
+<thead>
+<tr class="header">
+<th>trial_id”</th>
+<th align="left">“model”</th>
+<th align="left">“reference”</th>
+<th align="left">“candidate”</th>
+</tr>
+</thead>
+<tbody>
+<tr class="odd">
+<td>NCT1</td>
+<td align="left">gpt4-omni-zs</td>
+<td align="left">Age</td>
+<td align="left">age</td>
+</tr>
+<tr class="even">
+<td>NCT1</td>
+<td align="left">gpt4-omni-zs</td>
+<td align="left">sex(OHM)</td>
+<td align="left">gender</td>
+</tr>
+<tr class="odd">
+<td>NCT1</td>
+<td align="left">gpt4-omni-zs</td>
+<td align="left">race/ethnicity</td>
+<td align="left">race</td>
+</tr>
+<tr class="even">
+<td>NCT1</td>
+<td align="left">gpt4-omni-zs</td>
+<td align="left">socio-econmic status</td>
+<td align="left">NA</td>
+</tr>
+<tr class="odd">
+<td>NCT1</td>
+<td align="left">gpt4-omni-zs</td>
+<td align="left">cardiovascular disease</td>
+<td align="left">NA</td>
+</tr>
+<tr class="even">
+<td>NCT1</td>
+<td align="left">gpt4-omni-zs</td>
+<td align="left">NA</td>
+<td align="left">SBP</td>
+</tr>
+<tr class="odd">
+<td>NCT1</td>
+<td align="left">gpt4-omni-zs</td>
+<td align="left">NA</td>
+<td align="left">DBP</td>
+</tr>
+<tr class="even">
+<td>NCT1</td>
+<td align="left">gpt4-omni-zs</td>
+<td align="left">NA</td>
+<td align="left">cholesterol</td>
+</tr>
+<tr class="odd">
+<td></td>
+<td align="left"></td>
+<td align="left"></td>
+<td align="left"></td>
+</tr>
+</tbody>
+</table>
 <p><strong>Note:</strong> one could argue that the candidate descriptors
 are actually better than measured by these metrics since blood pressure
-and cholesterol are test used to determine cardiovascular disease. What
+and cholesterol are tests used to determine cardiovascular disease. What
 to do it about this is an open question.</p>
 <p><em>The goal in this analysis is to see if different subgroups of
 data have different average precision, average recall, and average F1
 scores.</em></p>
-</div>
-<div id="how-do-results-differ-by-model-on-ct-pub" class="section level2" number="4.3">
-<h2><span class="header-section-number">4.3</span> How do results differ
+<div id="how-do-results-differ-by-model-on-ct-pub" class="section level2" number="5.1">
+<h2><span class="header-section-number">5.1</span> How do results differ
 by model on CT-pub?</h2>
 <p>We want to summarize the results across the trials. So we calculate
 mean and standard error of statistics for each trial for each model.
@@ -2348,19 +2436,21 @@ <h2><span class="header-section-number">4.3</span> How do results differ
             seRecall=std.error(recall),
             meanF1=mean(f1),sef1=std.error(f1))
 
-kable(CT_Pub_model_results.df, caption=&quot;Table 3: Differences by Model on CT-Pub&quot;)</code></pre>
+kable(CT_Pub_model_results.df, caption=&quot;Differences by Model on CT-Pub&quot;)</code></pre>
 <table>
-<caption>Table 3: Differences by Model on CT-Pub</caption>
+<caption>Differences by Model on CT-Pub</caption>
 <colgroup>
 <col width="20%" />
-<col width="17%" />
 <col width="16%" />
 <col width="14%" />
-<col width="14%" />
-<col width="14%" />
+<col width="13%" />
+<col width="11%" />
+<col width="11%" />
+<col width="11%" />
 </colgroup>
 <thead>
 <tr class="header">
+<th align="left">model</th>
 <th align="right">meanPrecision</th>
 <th align="right">sePrecision</th>
 <th align="right">meanRecall</th>
@@ -2371,12 +2461,40 @@ <h2><span class="header-section-number">4.3</span> How do results differ
 </thead>
 <tbody>
 <tr class="odd">
-<td align="right">0.4344882</td>
-<td align="right">0.0084623</td>
-<td align="right">0.5272768</td>
-<td align="right">0.0104531</td>
-<td align="right">0.4517983</td>
-<td align="right">0.0072764</td>
+<td align="left">gpt4-omni-ts</td>
+<td align="right">0.4194773</td>
+<td align="right">0.0165780</td>
+<td align="right">0.5465613</td>
+<td align="right">0.0206740</td>
+<td align="right">0.4519953</td>
+<td align="right">0.0145372</td>
+</tr>
+<tr class="even">
+<td align="left">gpt4-omni-zs</td>
+<td align="right">0.4117923</td>
+<td align="right">0.0191232</td>
+<td align="right">0.4988831</td>
+<td align="right">0.0196749</td>
+<td align="right">0.4250843</td>
+<td align="right">0.0150289</td>
+</tr>
+<tr class="odd">
+<td align="left">llama3-70b-in-ts</td>
+<td align="right">0.4372443</td>
+<td align="right">0.0146389</td>
+<td align="right">0.5267929</td>
+<td align="right">0.0209162</td>
+<td align="right">0.4550647</td>
+<td align="right">0.0137912</td>
+</tr>
+<tr class="even">
+<td align="left">llama3-70b-in-zs</td>
+<td align="right">0.4694388</td>
+<td align="right">0.0167251</td>
+<td align="right">0.5368701</td>
+<td align="right">0.0222861</td>
+<td align="right">0.4750490</td>
+<td align="right">0.0146080</td>
 </tr>
 </tbody>
 </table>
@@ -2387,13 +2505,13 @@ <h2><span class="header-section-number">4.3</span> How do results differ
 not always be statistically significant. No significance tests were done
 in the original CTBench paper.</p>
 </div>
-<div id="how-do-results-differ-by-model-and-trial-type-on-ct-pub" class="section level2" number="4.4">
-<h2><span class="header-section-number">4.4</span> How do results differ
+<div id="how-do-results-differ-by-model-and-trial-type-on-ct-pub" class="section level2" number="5.2">
+<h2><span class="header-section-number">5.2</span> How do results differ
 by model and trial type on CT-pub?</h2>
 <p>Now we calculate calculate mean and standard error of response
 measures for different model and trial type combinations and display
 results in a table.</p>
-<pre class="r"><code># TODO: Use old pipe
+<pre class="r"><code># Done using group by and summarize commands in Dplyr
 CT_Pub_MT_results.df &lt;- CT_Pub.responses.df %&gt;% 
   group_by(model,trial_group) %&gt;% 
   summarize(meanPrecision=mean(precision),
@@ -2401,21 +2519,26 @@ <h2><span class="header-section-number">4.4</span> How do results differ
             meanRecall=mean(recall),
             seRecall=std.error(recall),
             meanF1=mean(f1),
-            sef1=std.error(f1))
-
-kable(CT_Pub_MT_results.df, caption=&quot;Table 4: Differences by Model and Subgroup on CT-pub&quot;)</code></pre>
+            sef1=std.error(f1))</code></pre>
+<pre><code>## `summarise()` has grouped output by &#39;model&#39;. You can override using the
+## `.groups` argument.</code></pre>
+<pre class="r"><code>kable(CT_Pub_MT_results.df, caption=&quot;Differences by Model and Subgroup on CT-pub&quot;)</code></pre>
 <table>
-<caption>Table 4: Differences by Model and Subgroup on CT-pub</caption>
+<caption>Differences by Model and Subgroup on CT-pub</caption>
 <colgroup>
-<col width="20%" />
-<col width="17%" />
-<col width="16%" />
-<col width="14%" />
-<col width="14%" />
-<col width="14%" />
+<col width="15%" />
+<col width="21%" />
+<col width="13%" />
+<col width="11%" />
+<col width="10%" />
+<col width="9%" />
+<col width="9%" />
+<col width="9%" />
 </colgroup>
 <thead>
 <tr class="header">
+<th align="left">model</th>
+<th align="left">trial_group</th>
 <th align="right">meanPrecision</th>
 <th align="right">sePrecision</th>
 <th align="right">meanRecall</th>
@@ -2426,12 +2549,204 @@ <h2><span class="header-section-number">4.4</span> How do results differ
 </thead>
 <tbody>
 <tr class="odd">
-<td align="right">0.4344882</td>
-<td align="right">0.0084623</td>
-<td align="right">0.5272768</td>
-<td align="right">0.0104531</td>
-<td align="right">0.4517983</td>
-<td align="right">0.0072764</td>
+<td align="left">gpt4-omni-ts</td>
+<td align="left">cancer</td>
+<td align="right">0.3376333</td>
+<td align="right">0.0399497</td>
+<td align="right">0.5430899</td>
+<td align="right">0.0392855</td>
+<td align="right">0.3970881</td>
+<td align="right">0.0344699</td>
+</tr>
+<tr class="even">
+<td align="left">gpt4-omni-ts</td>
+<td align="left">chronic kidney disease</td>
+<td align="right">0.4430021</td>
+<td align="right">0.0303595</td>
+<td align="right">0.5625353</td>
+<td align="right">0.0518243</td>
+<td align="right">0.4789217</td>
+<td align="right">0.0327797</td>
+</tr>
+<tr class="odd">
+<td align="left">gpt4-omni-ts</td>
+<td align="left">diabetes</td>
+<td align="right">0.4315031</td>
+<td align="right">0.0261183</td>
+<td align="right">0.5984520</td>
+<td align="right">0.0363788</td>
+<td align="right">0.4815179</td>
+<td align="right">0.0242987</td>
+</tr>
+<tr class="even">
+<td align="left">gpt4-omni-ts</td>
+<td align="left">hypertension</td>
+<td align="right">0.4936892</td>
+<td align="right">0.0421646</td>
+<td align="right">0.5076353</td>
+<td align="right">0.0570618</td>
+<td align="right">0.4708821</td>
+<td align="right">0.0340076</td>
+</tr>
+<tr class="odd">
+<td align="left">gpt4-omni-ts</td>
+<td align="left">obesity</td>
+<td align="right">0.3882668</td>
+<td align="right">0.0495100</td>
+<td align="right">0.4659332</td>
+<td align="right">0.0487458</td>
+<td align="right">0.4034206</td>
+<td align="right">0.0390612</td>
+</tr>
+<tr class="even">
+<td align="left">gpt4-omni-zs</td>
+<td align="left">cancer</td>
+<td align="right">0.3593896</td>
+<td align="right">0.0648699</td>
+<td align="right">0.5177248</td>
+<td align="right">0.0378228</td>
+<td align="right">0.3884822</td>
+<td align="right">0.0416444</td>
+</tr>
+<tr class="odd">
+<td align="left">gpt4-omni-zs</td>
+<td align="left">chronic kidney disease</td>
+<td align="right">0.4498255</td>
+<td align="right">0.0359125</td>
+<td align="right">0.4961775</td>
+<td align="right">0.0422165</td>
+<td align="right">0.4550535</td>
+<td align="right">0.0337143</td>
+</tr>
+<tr class="even">
+<td align="left">gpt4-omni-zs</td>
+<td align="left">diabetes</td>
+<td align="right">0.4398874</td>
+<td align="right">0.0275178</td>
+<td align="right">0.5670120</td>
+<td align="right">0.0350287</td>
+<td align="right">0.4747453</td>
+<td align="right">0.0243584</td>
+</tr>
+<tr class="odd">
+<td align="left">gpt4-omni-zs</td>
+<td align="left">hypertension</td>
+<td align="right">0.4110457</td>
+<td align="right">0.0525773</td>
+<td align="right">0.4404236</td>
+<td align="right">0.0524825</td>
+<td align="right">0.3916399</td>
+<td align="right">0.0329269</td>
+</tr>
+<tr class="even">
+<td align="left">gpt4-omni-zs</td>
+<td align="left">obesity</td>
+<td align="right">0.3678517</td>
+<td align="right">0.0488932</td>
+<td align="right">0.4016209</td>
+<td align="right">0.0472745</td>
+<td align="right">0.3598584</td>
+<td align="right">0.0359429</td>
+</tr>
+<tr class="odd">
+<td align="left">llama3-70b-in-ts</td>
+<td align="left">cancer</td>
+<td align="right">0.4093769</td>
+<td align="right">0.0321481</td>
+<td align="right">0.5666599</td>
+<td align="right">0.0483330</td>
+<td align="right">0.4519619</td>
+<td align="right">0.0284682</td>
+</tr>
+<tr class="even">
+<td align="left">llama3-70b-in-ts</td>
+<td align="left">chronic kidney disease</td>
+<td align="right">0.4538399</td>
+<td align="right">0.0367768</td>
+<td align="right">0.5158242</td>
+<td align="right">0.0568843</td>
+<td align="right">0.4591601</td>
+<td align="right">0.0350896</td>
+</tr>
+<tr class="odd">
+<td align="left">llama3-70b-in-ts</td>
+<td align="left">diabetes</td>
+<td align="right">0.4571022</td>
+<td align="right">0.0248011</td>
+<td align="right">0.5708732</td>
+<td align="right">0.0343273</td>
+<td align="right">0.4862324</td>
+<td align="right">0.0235926</td>
+</tr>
+<tr class="even">
+<td align="left">llama3-70b-in-ts</td>
+<td align="left">hypertension</td>
+<td align="right">0.4983549</td>
+<td align="right">0.0370043</td>
+<td align="right">0.4818363</td>
+<td align="right">0.0599463</td>
+<td align="right">0.4657995</td>
+<td align="right">0.0376744</td>
+</tr>
+<tr class="odd">
+<td align="left">llama3-70b-in-ts</td>
+<td align="left">obesity</td>
+<td align="right">0.3603799</td>
+<td align="right">0.0328818</td>
+<td align="right">0.4540276</td>
+<td align="right">0.0437946</td>
+<td align="right">0.3865056</td>
+<td align="right">0.0317837</td>
+</tr>
+<tr class="even">
+<td align="left">llama3-70b-in-zs</td>
+<td align="left">cancer</td>
+<td align="right">0.4138544</td>
+<td align="right">0.0414752</td>
+<td align="right">0.6322974</td>
+<td align="right">0.0492822</td>
+<td align="right">0.4836421</td>
+<td align="right">0.0366250</td>
+</tr>
+<tr class="odd">
+<td align="left">llama3-70b-in-zs</td>
+<td align="left">chronic kidney disease</td>
+<td align="right">0.5265988</td>
+<td align="right">0.0432749</td>
+<td align="right">0.5701615</td>
+<td align="right">0.0637368</td>
+<td align="right">0.5070008</td>
+<td align="right">0.0362418</td>
+</tr>
+<tr class="even">
+<td align="left">llama3-70b-in-zs</td>
+<td align="left">diabetes</td>
+<td align="right">0.4925006</td>
+<td align="right">0.0255036</td>
+<td align="right">0.5353139</td>
+<td align="right">0.0350723</td>
+<td align="right">0.4980289</td>
+<td align="right">0.0246477</td>
+</tr>
+<tr class="odd">
+<td align="left">llama3-70b-in-zs</td>
+<td align="left">hypertension</td>
+<td align="right">0.5075860</td>
+<td align="right">0.0488254</td>
+<td align="right">0.5109607</td>
+<td align="right">0.0630818</td>
+<td align="right">0.4757989</td>
+<td align="right">0.0373848</td>
+</tr>
+<tr class="even">
+<td align="left">llama3-70b-in-zs</td>
+<td align="left">obesity</td>
+<td align="right">0.3884561</td>
+<td align="right">0.0340609</td>
+<td align="right">0.4418456</td>
+<td align="right">0.0460536</td>
+<td align="right">0.3914692</td>
+<td align="right">0.0307577</td>
 </tr>
 </tbody>
 </table>
@@ -2441,8 +2756,8 @@ <h2><span class="header-section-number">4.4</span> How do results differ
 differences are statistically significant.</p>
 </div>
 </div>
-<div id="do-we-see-differences-in-results-for-different-candidate-descriptors" class="section level1" number="5">
-<h1><span class="header-section-number">5</span> Do we see differences
+<div id="do-we-see-differences-in-results-for-different-candidate-descriptors" class="section level1" number="6">
+<h1><span class="header-section-number">6</span> Do we see differences
 in results for different candidate descriptors?</h1>
 <p>This is a sample analysis of the matches data frame. The goal is to
 count the number of trials for the ‘gpt4-onmi-zs’ model results in which
@@ -2465,9 +2780,9 @@ <h1><span class="header-section-number">5</span> Do we see differences
 nrow(CT_Pub_reference_count.df)</code></pre>
 <pre><code>## [1] 843</code></pre>
 <pre class="r"><code># these are top 20 most common descriptors. 
-kable(head(CT_Pub_reference_count.df,20),caption=&quot;Table 5: Accuracy of Top 20 descriptors in CT_Pub&quot;)</code></pre>
+kable(head(CT_Pub_reference_count.df,20),caption=&quot;Accuracy of Top 20 descriptors in CT_Pub&quot;)</code></pre>
 <table>
-<caption>Table 5: Accuracy of Top 20 descriptors in CT_Pub</caption>
+<caption>Accuracy of Top 20 descriptors in CT_Pub</caption>
 <thead>
 <tr class="header">
 <th align="left">reference</th>
@@ -2610,21 +2925,22 @@ <h1><span class="header-section-number">5</span> Do we see differences
 descriptor called “hypertension.” So some more thought needs to be put
 in this type of analysis.</p>
 </div>
-<div id="your-job" class="section level1" number="6">
-<h1><span class="header-section-number">6</span> Your Job</h1>
+<div id="your-job" class="section level1" number="7">
+<h1><span class="header-section-number">7</span> Your Job</h1>
 <p>You job is to do a more in-depth analysis of the results of the two
 models. Each member of your team can focus on a different question or
-model. You can use any analyses or visualizations in R that you
-like.</p>
+model. You can use any analyses or visualizations in R that you like. We
+will coach you through this process at the Monday and Wednesday weekly
+team breakouts.</p>
 <p>Here are some ideas for questions to pursue, but feel free to make up
 your own. <em>Try to make up and answer at least two questions.</em> The
 additional questions can be a follow-up to previous questions.
 Coordinate with your team so you look at different questions.</p>
 <p>Here are some ideas for questions to inspire you:</p>
 <ol style="list-style-type: decimal">
-<li>Does the LLM models perform differently in terms of precision,
-recall, and F1 scores?<br />
-</li>
+<li>Do the the LLM models perform differently in terms of precision,
+recall, and F1 scores and are these differences statistically
+significant?</li>
 <li>Are prompts for some disease types e.g. (group_types) harder than
 others? Does this difference hold across different models? Note we will
 refer to this as a subgroup analysis.</li>
@@ -2661,17 +2977,17 @@ <h1><span class="header-section-number">6</span> Your Job</h1>
 <li>Can you reproduce the results shown in the CTBench paper using the
 data we have? Can you think of other ways to present those results?</li>
 </ol>
-<div id="analysis-question-1-provide-short-name" class="section level2" number="6.1">
-<h2><span class="header-section-number">6.1</span> Analysis: Question 1
+<div id="analysis-question-1-provide-short-name" class="section level2" number="7.1">
+<h2><span class="header-section-number">7.1</span> Analysis: Question 1
 (Provide short name)</h2>
-<div id="question-being-asked" class="section level3" number="6.1.1">
-<h3><span class="header-section-number">6.1.1</span> Question being
+<div id="question-being-asked" class="section level3" number="7.1.1">
+<h3><span class="header-section-number">7.1.1</span> Question being
 asked</h3>
 <p><em>Provide in natural language a statement of what question you’re
 trying to answer</em></p>
 </div>
-<div id="data-preparation" class="section level3" number="6.1.2">
-<h3><span class="header-section-number">6.1.2</span> Data
+<div id="data-preparation" class="section level3" number="7.1.2">
+<h3><span class="header-section-number">7.1.2</span> Data
 Preparation</h3>
 <p><em>Provide in natural language a description of the data you are
 using for this analysis</em></p>
@@ -2681,8 +2997,8 @@ <h3><span class="header-section-number">6.1.2</span> Data
 re-state what data you’re using</em></p>
 <pre class="r"><code># Include all data processing code (if necessary), clearly commented</code></pre>
 </div>
-<div id="analysis-methods-and-results" class="section level3" number="6.1.3">
-<h3><span class="header-section-number">6.1.3</span> Analysis: Methods
+<div id="analysis-methods-and-results" class="section level3" number="7.1.3">
+<h3><span class="header-section-number">7.1.3</span> Analysis: Methods
 and results</h3>
 <p><em>Describe in natural language a statement of the analysis you’re
 trying to do</em></p>
@@ -2697,24 +3013,24 @@ <h3><span class="header-section-number">6.1.3</span> Analysis: Methods
 #   or google docs), you can include links to the documents in this notebook 
 #   instead of actual text.</code></pre>
 </div>
-<div id="discussion-of-results" class="section level3" number="6.1.4">
-<h3><span class="header-section-number">6.1.4</span> Discussion of
+<div id="discussion-of-results" class="section level3" number="7.1.4">
+<h3><span class="header-section-number">7.1.4</span> Discussion of
 results</h3>
 <p><em>Provide in natural language a clear discussion of your
 observations.</em></p>
 </div>
 </div>
-<div id="analysis-question-2-provide-short-name" class="section level2" number="6.2">
-<h2><span class="header-section-number">6.2</span> Analysis: Question 2
+<div id="analysis-question-2-provide-short-name" class="section level2" number="7.2">
+<h2><span class="header-section-number">7.2</span> Analysis: Question 2
 (Provide short name)</h2>
-<div id="question-being-asked-1" class="section level3" number="6.2.1">
-<h3><span class="header-section-number">6.2.1</span> Question being
+<div id="question-being-asked-1" class="section level3" number="7.2.1">
+<h3><span class="header-section-number">7.2.1</span> Question being
 asked</h3>
 <p><em>Provide in natural language a statement of what question you’re
 trying to answer</em></p>
 </div>
-<div id="data-preparation-1" class="section level3" number="6.2.2">
-<h3><span class="header-section-number">6.2.2</span> Data
+<div id="data-preparation-1" class="section level3" number="7.2.2">
+<h3><span class="header-section-number">7.2.2</span> Data
 Preparation</h3>
 <p><em>Provide in natural language a description of the data you are
 using for this analysis</em></p>
@@ -2724,8 +3040,8 @@ <h3><span class="header-section-number">6.2.2</span> Data
 re-state what data you’re using</em></p>
 <pre class="r"><code># Include all data processing code (if necessary), clearly commented</code></pre>
 </div>
-<div id="analysis-methods-and-results-1" class="section level3" number="6.2.3">
-<h3><span class="header-section-number">6.2.3</span> Analysis: Methods
+<div id="analysis-methods-and-results-1" class="section level3" number="7.2.3">
+<h3><span class="header-section-number">7.2.3</span> Analysis: Methods
 and Results</h3>
 <p><em>Describe in natural language a statement of the analysis you’re
 trying to do</em></p>
@@ -2740,22 +3056,22 @@ <h3><span class="header-section-number">6.2.3</span> Analysis: Methods
 #   or google docs), you can include links to the documents in this notebook 
 #   instead of actual text.</code></pre>
 </div>
-<div id="discussion-of-results-1" class="section level3" number="6.2.4">
-<h3><span class="header-section-number">6.2.4</span> Discussion of
+<div id="discussion-of-results-1" class="section level3" number="7.2.4">
+<h3><span class="header-section-number">7.2.4</span> Discussion of
 results</h3>
 <p><em>Provide in natural language a clear discussion of your
 observations.</em></p>
 </div>
 </div>
-<div id="summary-and-next-steps" class="section level2" number="6.3">
-<h2><span class="header-section-number">6.3</span> Summary and next
+<div id="summary-and-next-steps" class="section level2" number="7.3">
+<h2><span class="header-section-number">7.3</span> Summary and next
 steps</h2>
 <p><em>Provide in natural language a clear summary and your proposed
 next steps.</em></p>
 </div>
 </div>
-<div id="when-youre-done-save-commit-and-push-your-changes" class="section level1" number="7">
-<h1><span class="header-section-number">7</span> When you’re done: SAVE,
+<div id="when-youre-done-save-commit-and-push-your-changes" class="section level1" number="8">
+<h1><span class="header-section-number">8</span> When you’re done: SAVE,
 COMMIT and PUSH YOUR CHANGES!</h1>
 <p>When you are satisfied with your edits and your notebook knits
 successfully, remember to push your changes to the repo using the
@@ -2772,8 +3088,8 @@ <h1><span class="header-section-number">7</span> When you’re done: SAVE,
 <li>Submit pdf to gradescope</li>
 </ul>
 </div>
-<div id="appendix-accessing-rstudio-server-on-the-idea-cluster" class="section level1" number="8">
-<h1><span class="header-section-number">8</span> APPENDIX: Accessing
+<div id="appendix-accessing-rstudio-server-on-the-idea-cluster" class="section level1" number="9">
+<h1><span class="header-section-number">9</span> APPENDIX: Accessing
 RStudio Server on the IDEA Cluster</h1>
 <p>The IDEA Cluster provides seven compute nodes (4x 48 cores, 3x 80
 cores, 1x storage server)</p>
@@ -2788,11 +3104,11 @@ <h1><span class="header-section-number">8</span> APPENDIX: Accessing
 <li>Access via RPI physical network or VPN only</li>
 </ul>
 </div>
-<div id="more-info-about-rstudio-on-our-cluster" class="section level1" number="9">
-<h1><span class="header-section-number">9</span> More info about Rstudio
-on our Cluster</h1>
-<div id="rstudio-gui-access" class="section level2" number="9.1">
-<h2><span class="header-section-number">9.1</span> RStudio GUI
+<div id="more-info-about-rstudio-on-our-cluster" class="section level1" number="10">
+<h1><span class="header-section-number">10</span> More info about
+Rstudio on our Cluster</h1>
+<div id="rstudio-gui-access" class="section level2" number="10.1">
+<h2><span class="header-section-number">10.1</span> RStudio GUI
 Access:</h2>
 <ul>
 <li>Use:
diff --git a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.pdf b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.pdf
index 73f1a4a..9a98f1e 100644
Binary files a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.pdf and b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.pdf differ