diff --git a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd index ce93db5..6a2dd88 100644 --- a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd +++ b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd @@ -3,13 +3,13 @@ title: 'CTBench Eval Project Notebook:' author: "Your Name Here" date: "`r format(Sys.time(), '%d %B %Y')`" output: - pdf_document: default - word_document: - toc: true html_document: toc: true number_sections: true df_print: paged + word_document: + toc: true + pdf_document: default subtitle: DAR Assignment 3 (Fall 2024) --- ```{r setup, include=FALSE} @@ -77,7 +77,7 @@ https://arxiv.org/abs/2406.17888 ## CTBenchEval Goals -The high-level goals of CTBenchEval project are to: +The high-level goals of CTBenchEval project for the semester are to: 1. Imagine you are trying your own LLM approach and want to compare to the published CTBench results. @@ -98,12 +98,12 @@ The high-level goals of CTBenchEval project are to: * "How can we make the evaluation software provided effective and easy to use?" * "How can we make CTBench readily extensible to more clinical trials?" -# DAR ASSIGNMENT 3 (Introduction): Introductory DAR Notebook +# DAR ASSIGNMENT 3 (Introduction): Introductory DAR CTBench Eval Notebook This notebook is broken into two main parts: * **Part 1:** Preparing your local repo for **DAR Assignment 3** -* **Part 2:** Loading the CTBench Eval Datasets +* **Part 2:** Loading and Analysis the CTBench Eval Datasets * **Part 3:** Individual analysis of your team's dataset **NOTE:** The RPI github repository for all the code and data required for this notebook may be found at: @@ -112,7 +112,8 @@ This notebook is broken into two main parts: * **Part 4:** Be prepared to discuss your results in you team breakout. -# DAR ASSIGNMENT 3 (Part 1): Preparing your local repo for Assignment 3 +# DAR ASSIGNMENT 3 (Part 1): Preparing your local repo for Assignment 3 +**Delete this Section from your submitted notebook** In this assignment you'll start by making a copy of the Assignment 3 template notebook, then you'll add to your copy with your original work. The instructions which follow explain how to accomplish this. @@ -175,8 +176,9 @@ You're now ready to start coding Assignment 3! * One of the DAR instructors will merge your branch, and your new files will be added to the master branch of the repo. _Do not merge your branch yourself!_ # DAR ASSIGNMENT 3 (Part 2): Loading the CTBench Eval Datasets +**Delete this Section from your submitted notebook. You can reuse this code as needed in the answers to your questions** -In this CTBench there are two sorts of data: clinical trial data used to generate the prompts and results data that shows the results. For your conveniences, these dataset have been converted to R Rds files regardless of how they appear originally. +In this CTBench there are two sources of data: clinical trial data used to generate the prompts and results data from the LM and its evaluation. For your conveniences, these dataset have been converted to R Rds files regardless of how they appear originally. * **Data Rds:** @@ -192,7 +194,7 @@ These are the datasets that describe each clinical trial and give it's baseline These include the results of various LLM on each trial. * `trials.responses.Rds` contains the results of the LLM for each clinical trial and model combination. - * `trials.matches.Rds` contains the specific matches that the evaluation LLM made between the candidate and references descriptors. + * `trials.matches.Rds` contains the specific matches that the evaluation LLM made between the candidate and references descriptors for each clinical trial and model. NOTES: @@ -216,10 +218,9 @@ Each of these datastructure has one row per trial. The features are given below. [5] "Conditions" Health Conditions the Trial addresses [6] "Interventions" Intervention (a.k.a treatments) [7] "PrimaryOutcomes" Measure of success or failure of trial - [8] "BaselineMeasures" Original Features in ClinicalTrial.gov - [9] "BaselineMeasures_Processed" Cleaned up Features used in CTBench - - + [8] "BaselineMeasures" List of original Features in ClinicalTrial.gov + [9] "BaselineMeasures_Processed" List of cleaned up Features used in CTBench + @@ -233,11 +234,10 @@ Each of these datastructure has one row per trial. The features are given below. [5] "Conditions" Health Conditions the Trial addresses [6] "Interventions" Intervention (a.k.a treatments) [7] "PrimaryOutcomes" Measure of success or failure of trial - [8] "BaselineMeasures" Original Features in ClinicalTrial.gov - [9] "Paper_BaselineMeasures" Original Features in trial paper - [10] "Paper_BaselineMeasures_Processed" Cleaned up Features used in CTBench - -
+ [8] "BaselineMeasures" List of original Features in ClinicalTrial.gov + [9] "Paper_BaselineMeasures" List of original features in trial paper + [10] "Paper_BaselineMeasures_Processed" List of cleaned up Features used in trial paper + * For `CT_Repo.df`, the reference baseline descriptors used in the experiments are in a comma separated list in `BaselineMeasures_Processed`. * For `CT_Pub.df`, the reference baseline decsriptors used in the experiments are in a comma separated list in `Paper_BaselineMeasures_Processed`. @@ -273,17 +273,18 @@ These are the features in CT_Pub_responses.df. Number Name Notes ------ ----------------------------------- ----------------------------------------- - [1] "trial_id" Unique trial id - same as NCTID - [2] "trial_group" trial address what group of disease. - [3] "model" LLM model used - [4] "gen_response" Result generated - [5] "processed_gen_response" Cleaned up result generated - [6] "len_matches" # of matching desriptors - [7] "len_reference" # unmatched descriptors from reference - [8] "len_candidate" # unmatched descriptors from LLM response - [9] "precision" precision for this trial and model -[10] "recall" recall for this trial and model -[11] "f1" F1 score for this trial and model + [1] "trial_id" Unique trial id - same as NCTID + [2] "trial_group" trial address what group of disease. + [3] "model" LLM model used + [4] "gen_response" Result generated + [5] "processed_gen_response" Cleaned up result generated + [6] "len_matches" # of matching desriptors + [7] "len_reference" # unmatched descriptors from reference + [8] "len_candidate" # unmatched descriptors from LLM response + [9] "precision" precision for this trial and model +[10] "recall" recall for this trial and model +[11] "f1" F1 score for this trial and model + These are the features in CT_Pub_matches.df. @@ -291,11 +292,11 @@ These are the features in CT_Pub_matches.df. Table 4: Features of CT_Pub.matches.df. Number Name Notes - ------ ----------------------------------- ----------------------------------------- - [1] "trial_id" Unique trial id - same as NCTID - [2] "model" LLM model used - [3] "reference" matched reference feature (NA if none) - [4] "candidate" matched candidate feature (NA if none) + ------ --------------- ----------------------------------------- + [1] "trial_id" Unique trial id - same as NCTID + [2] "model" LLM model used + [3] "reference" matched reference feature (NA if none) + [4] "candidate" matched candidate feature (NA if none) @@ -311,7 +312,7 @@ If the table has an entry such as trial_id_A model_id_B NA candidate D, this means that for trial_id A using model B, candidate descriptor D had no match in the reference list. ```{r} -# Load the trials.responses +# Load the trials.responses CT_Pub.responses.df<- readRDS("/academics/MATP-4910-F24/DAR-CTEval-F24/Data/trials.responses.Rds") # convert model and type to factors @@ -341,20 +342,20 @@ dim(CT_Pub.matches.df) The same process can be used for CT_Repo. But be aware that variable names may be slightly different. -#Analysis of Response Data +# Analysis of Response Data -For each clinical trial, the evaluation program (which calls two different LLM) calculates several measures of how good the candidate descriptor proposed for each LLM are as compared to the reference descriptors. +For each clinical trial, the evaluation program (which calls two different LLMs) calculates several measures of how good the candidate descriptor proposed for each LLM are as compared to the reference descriptors. -The Bert score is a measure of similarity of the candidate and reference LLM in the the latent space of the Bert LLM. This is a quick but very rough way to calculate similarity. You can read about Bert Scores here. https://arxiv.org/abs/1904.09675 +The Bert score is a measure of semantic similarity of the candidate and reference LLM descriptor lists in the the latent space of the Bert LLM. This is a quick but very rough way to calculate similarity. 0 is no semantic similarity. 1 is perfect semantic similarity. You can read about Bert Scores here. https://arxiv.org/abs/1904.09675 A more accurate evaluation is done by matching each candidate descriptor with at most 1 reference descriptor. This is done using the LLM GPT-4o. Let matches.len = number of candidate descriptors matched with the reference descriptors. Let candidate.len = number of unmatched candidate descriptors and reference.len = number of unmatched candidate descriptor. -Precision measures the proportion of candidate descriptors that were matched. Recall measure the proportions of the reference descriptor that were in the candidate descriptors. F1 is a standard measure that combines precision and recall. These calculations have already been done for each trial. +Precision measures the proportion of candidate descriptors that were matched. Recall measure the proportion of the reference descriptor that were in the candidate descriptors. F1 is a standard measure that combines precision and recall. These calculations have already been done for each trial. For example, say reference = "Age, sex (OHM), race/ethnicity, cardiovascular disease, socio-economic status" and candidate = "age, gender, race, SBP, DBP, cholesterol ). -* There are three matches: (Age, age), (sex(OHM),gender}, (race/ethnicity,race) -* There are two unmatched reference descriptors: socio-econmic status and blood pressure. +* There are three matches: , , +* There are two unmatched reference descriptors: socio-econmic status, blood pressure. * There are three unmatched candidate descriptors: SBP, DBP, cholesterol. This is how precision, recall, and f1 are calculated: @@ -374,7 +375,23 @@ This is how precision, recall, and f1 are calculated: `f1 <- ifelse(precision == 0 | recall == 0, 0, 2 * (precision * recall) / (precision + recall))` `f1 = 0.5454545` -**Note:** one could argue that the candidate descriptors are actually better than measured by these metrics since blood pressure and cholesterol are test used to determine cardiovascular disease. What to do it about this is an open question. +For this toy example, the entries in CT_Pub.matches.df would look like this: + + + +"trial_id" "model" "reference" "candidate" + --------- ------------- ----------------------- -------------- + NCT1 gpt4-omni-zs Age age + NCT1 gpt4-omni-zs sex(OHM) gender + NCT1 gpt4-omni-zs race/ethnicity race + NCT1 gpt4-omni-zs socio-econmic status NA + NCT1 gpt4-omni-zs cardiovascular disease NA + NCT1 gpt4-omni-zs NA SBP + NCT1 gpt4-omni-zs NA DBP + NCT1 gpt4-omni-zs NA cholesterol +
Table 5: Toy example of CT-Pub.matches.df.
+ +**Note:** one could argue that the candidate descriptors are actually better than measured by these metrics since blood pressure and cholesterol are tests used to determine cardiovascular disease. What to do it about this is an open question. _The goal in this analysis is to see if different subgroups of data have different average precision, average recall, and average F1 scores._ @@ -392,7 +409,7 @@ CT_Pub_model_results.df <- CT_Pub.responses.df %>% seRecall=std.error(recall), meanF1=mean(f1),sef1=std.error(f1)) -kable(CT_Pub_model_results.df, caption="Table 3: Differences by Model on CT-Pub") +kable(CT_Pub_model_results.df, caption="Differences by Model on CT-Pub") ``` @@ -404,7 +421,7 @@ Now we calculate calculate mean and standard error of response measures for diff ```{r} -# TODO: Use old pipe +# Done using group by and summarize commands in Dplyr CT_Pub_MT_results.df <- CT_Pub.responses.df %>% group_by(model,trial_group) %>% summarize(meanPrecision=mean(precision), @@ -414,7 +431,7 @@ CT_Pub_MT_results.df <- CT_Pub.responses.df %>% meanF1=mean(f1), sef1=std.error(f1)) -kable(CT_Pub_MT_results.df, caption="Table 4: Differences by Model and Subgroup on CT-pub") +kable(CT_Pub_MT_results.df, caption="Differences by Model and Subgroup on CT-pub") ``` Here we can see that predicting descriptors seems to be harder for some combinations of models and trial types. But further analysis is needed to present the results in a more informative way and see if the differences are statistically significant. @@ -436,7 +453,7 @@ CT_Pub_reference_count.df <- CT_Pub.matches.df %>% nrow(CT_Pub_reference_count.df) # these are top 20 most common descriptors. -kable(head(CT_Pub_reference_count.df,20),caption="Table 5: Accuracy of Top 20 descriptors in CT_Pub") +kable(head(CT_Pub_reference_count.df,20),caption="Accuracy of Top 20 descriptors in CT_Pub") ``` @@ -445,13 +462,13 @@ sample sizes can be very low. CT-Pub is really too small to do this type of a # Your Job -You job is to do a more in-depth analysis of the results of the two models. Each member of your team can focus on a different question or model. You can use any analyses or visualizations in R that you like. +You job is to do a more in-depth analysis of the results of the two models. Each member of your team can focus on a different question or model. You can use any analyses or visualizations in R that you like. We will coach you through this process at the Monday and Wednesday weekly team breakouts. Here are some ideas for questions to pursue, but feel free to make up your own. _Try to make up and answer at least two questions._ The additional questions can be a follow-up to previous questions. Coordinate with your team so you look at different questions. Here are some ideas for questions to inspire you: -1. Does the LLM models perform differently in terms of precision, recall, and F1 scores? +1. Do the the LLM models perform differently in terms of precision, recall, and F1 scores and are these differences statistically significant? 2. Are prompts for some disease types e.g. (group_types) harder than others? Does this difference hold across different models? Note we will refer to this as a subgroup analysis. 3. How does the performance compare for CT_Pub and CT_Repo? 4. Can you use multiple regression on the response to understand how different factors (e.g. model, group_type, and/or source(Repo or Pub)) effect the results controlling for the others? Note this can let you know the significance of any difference too. diff --git a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.html b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.html index 7eef820..0f6432b 100644 --- a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.html +++ b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.html @@ -1636,8 +1636,8 @@

09 September 2024

  • 1.1 General Overview:
  • 1.2 CTBenchEval Goals
  • -
  • 2 DAR ASSIGNMENT 3 (Introduction): -Introductory DAR Notebook
  • +
  • 2 DAR ASSIGNMENT 3 (Introduction): +Introductory DAR CTBench Eval Notebook
  • 3 DAR ASSIGNMENT 3 (Part 1): Preparing your local repo for Assignment 3
  • +
  • 5 Analysis of Response Data +
  • -
  • 5 Do we see differences in results for +
  • 6 Do we see differences in results for different candidate descriptors?
  • -
  • 6 Your Job +
  • 7 Your Job
  • -
  • 7 When you’re done: SAVE, COMMIT and +
  • 8 When you’re done: SAVE, COMMIT and PUSH YOUR CHANGES!
  • -
  • 8 APPENDIX: Accessing RStudio Server +
  • 9 APPENDIX: Accessing RStudio Server on the IDEA Cluster
  • -
  • 9 More info about Rstudio on our +
  • 10 More info about Rstudio on our Cluster
  • @@ -1735,7 +1738,8 @@

    1.1 General

    1.2 CTBenchEval Goals

    -

    The high-level goals of CTBenchEval project are to:

    +

    The high-level goals of CTBenchEval project for the semester are +to:

    1. Imagine you are trying your own LLM approach and want to compare to the published CTBench results.
    2. @@ -1777,14 +1781,15 @@

      1.2 CTBenchEval

    -
    +

    2 DAR ASSIGNMENT 3 -(Introduction): Introductory DAR Notebook

    +(Introduction): Introductory DAR CTBench Eval Notebook

    This notebook is broken into two main parts:

    • Part 1: Preparing your local repo for DAR Assignment 3
    • -
    • Part 2: Loading the CTBench Eval Datasets
    • +
    • Part 2: Loading and Analysis the CTBench Eval +Datasets
    • Part 3: Individual analysis of your team’s dataset
    @@ -1799,6 +1804,7 @@

    2 DAR ASSIGNMENT 3

    3 DAR ASSIGNMENT 3 (Part 1): Preparing your local repo for Assignment 3

    +

    Delete this Section from your submitted notebook

    In this assignment you’ll start by making a copy of the Assignment 3 template notebook, then you’ll add to your copy with your original work. The instructions which follow explain how to accomplish this.

    @@ -1929,10 +1935,12 @@

    3.2 Creating your copy of

    4 DAR ASSIGNMENT 3 (Part 2): Loading the CTBench Eval Datasets

    -

    In this CTBench there are two sorts of data: clinical trial data used -to generate the prompts and results data that shows the results. For -your conveniences, these dataset have been converted to R Rds files -regardless of how they appear originally.

    +

    Delete this Section from your submitted notebook. You can +reuse this code as needed in the answers to your questions

    +

    In this CTBench there are two sources of data: clinical trial data +used to generate the prompts and results data from the LM and its +evaluation. For your conveniences, these dataset have been converted to +R Rds files regardless of how they appear originally.

    • Data Rds:
    @@ -1952,8 +1960,8 @@

    4 DAR ASSIGNMENT 3 (Part
  • trials.responses.Rds contains the results of the LLM for each clinical trial and model combination.
  • trials.matches.Rds contains the specific matches that -the evaluation LLM made between the candidate and references -descriptors.
  • +the evaluation LLM made between the candidate and references descriptors +for each clinical trial and model.

    NOTES:

      @@ -2018,16 +2026,20 @@

      4.1 Load the CTBench Eval [8] “BaselineMeasures” -Original Features in ClinicalTrial.gov +List of original Features in ClinicalTrial.gov [9] ” -BaselineMeasures_Processed” Cl -eaned up Features used in CTBench +BaselineMeasures_Processed” +List of cleaned up Features used in CTBench + + +</table +> + - - + - + - + + + + + +
      Table 2: Features of CT_Pub.df. @@ -2075,21 +2087,25 @@

      4.1 Load the CTBench Eval

      [8] “BaselineMeasures”Original Features in ClinicalTrial.govList of original Features in ClinicalTrial.gov
      [9] “Paper_BaselineMeasures”Original Features in trial paperList of original features in trial paper
      [10] “Paper_BaselineMeasures_Processed”Cleaned up Features used in CTBenchList of cleaned up Features used in trial paper
      </table>
      -
      • For CT_Repo.df, the reference baseline descriptors used in the experiments are in a comma separated list in @@ -2139,67 +2155,63 @@

        4.2 Load the CTBench Eval -[1] “tr -ial_id” Unique trial id - same -as NCTID +[1] +“trial_id” +Unique trial id - same as NCTID -[2] “tr -ial_group” trial address what grou -p of disease. +[2] +“trial_group” +trial address what group of disease. -[3] “mo -del” LLM model used - +[3] +“model” +LLM model used -[4] “ge -n_response” Result generated - +[4] +“gen_response” +Result generated -[5] “pr -ocessed_gen_response” Cleaned up result -generated +[5] +“processed_gen_response” +Cleaned up result generated -[6] “le -n_matches” # of matching desri -ptors +[6] +“len_matches” +# of matching desriptors -[7] “le -n_reference” # unmatched descrip -tors from reference +[7] +“len_reference” +# unmatched descriptors from reference -[8] “le -n_candidate” # unmatched descrip -tors from LLM response +[8] +“len_candidate” +# unmatched descriptors from LLM response -[9] “pr -ecision” precision for this -trial and model +[9] +“precision” +precision for this trial and model -10] “re -call” recall for this tri -al and model +10] +“recall” +recall for this trial and model -11] “f1 -” F1 score for this t -rial and model - - -/table> - - +11] +“f1” +F1 score for this trial and model + These are the features in CT_Pub_matches.df.
        @@ -2211,29 +2223,29 @@

        4.2 Load the CTBench Eval

        - + - - - + + + - - - + + + - - - + + + - - - + + +
        Number NameNotesNotes
        [1] “trial_id” Unique trial id - sameas NCTID[1]“trial_id”Unique trial id - same as NCTID
        [2] “model” LLM model used[2]“model”LLM model used
        [3] “reference” matched reference feature (NA if none)[3]“reference”matched reference feature (NA if none)
        [4] “candidate” matched candidate feature (NA if none)[4]“candidate”matched candidate feature (NA if none)
        @@ -2250,7 +2262,7 @@

        4.2 Load the CTBench Eval

        If the table has an entry such as trial_id_A model_id_B NA candidate D, this means that for trial_id A using model B, candidate descriptor D had no match in the reference list.

        -
        # Load the trials.responses
        +
        # Load the trials.responses 
         CT_Pub.responses.df<- readRDS("/academics/MATP-4910-F24/DAR-CTEval-F24/Data/trials.responses.Rds")
         
         # convert model and type to factors
        @@ -2275,15 +2287,20 @@ 

        4.2 Load the CTBench Eval #head(CT_Pub.responses.df,5)

        The same process can be used for CT_Repo. But be aware that variable names may be slightly different.

        -

        #Analysis of Response Data

        +

    +

    +
    +

    5 Analysis of Response +Data

    For each clinical trial, the evaluation program (which calls two -different LLM) calculates several measures of how good the candidate +different LLMs) calculates several measures of how good the candidate descriptor proposed for each LLM are as compared to the reference descriptors.

    -

    The Bert score is a measure of similarity of the candidate and -reference LLM in the the latent space of the Bert LLM. This is a quick -but very rough way to calculate similarity. You can read about Bert -Scores here. https://arxiv.org/abs/1904.09675

    +

    The Bert score is a measure of semantic similarity of the candidate +and reference LLM descriptor lists in the the latent space of the Bert +LLM. This is a quick but very rough way to calculate similarity. 0 is no +semantic similarity. 1 is perfect semantic similarity. You can read +about Bert Scores here. https://arxiv.org/abs/1904.09675

    A more accurate evaluation is done by matching each candidate descriptor with at most 1 reference descriptor. This is done using the LLM GPT-4o. Let matches.len = number of candidate descriptors matched @@ -2291,7 +2308,7 @@

    4.2 Load the CTBench Eval candidate descriptors and reference.len = number of unmatched candidate descriptor.

    Precision measures the proportion of candidate descriptors that were -matched. Recall measure the proportions of the reference descriptor that +matched. Recall measure the proportion of the reference descriptor that were in the candidate descriptors. F1 is a standard measure that combines precision and recall. These calculations have already been done for each trial.

    @@ -2299,11 +2316,11 @@

    4.2 Load the CTBench Eval cardiovascular disease, socio-economic status” and candidate = “age, gender, race, SBP, DBP, cholesterol ).

      -
    • There are three matches: (Age, age), (sex(OHM),gender}, -(race/ethnicity,race)
      +
    • There are three matches: <Age, age>, <sex(OHM),gender>, +<race/ethnicity,race>
    • -
    • There are two unmatched reference descriptors: socio-econmic status -and blood pressure.
    • +
    • There are two unmatched reference descriptors: socio-econmic status, +blood pressure.
    • There are three unmatched candidate descriptors: SBP, DBP, cholesterol.
    @@ -2323,16 +2340,87 @@

    4.2 Load the CTBench Eval

    f1 <- ifelse(precision == 0 | recall == 0, 0, 2 * (precision * recall) / (precision + recall)) f1 = 0.5454545

    +

    For this toy example, the entries in CT_Pub.matches.df would look +like this:

    + + +
    +Table 5: Toy example of CT-Pub.matches.df. +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    trial_id”“model”“reference”“candidate”
    NCT1gpt4-omni-zsAgeage
    NCT1gpt4-omni-zssex(OHM)gender
    NCT1gpt4-omni-zsrace/ethnicityrace
    NCT1gpt4-omni-zssocio-econmic statusNA
    NCT1gpt4-omni-zscardiovascular diseaseNA
    NCT1gpt4-omni-zsNASBP
    NCT1gpt4-omni-zsNADBP
    NCT1gpt4-omni-zsNAcholesterol

    Note: one could argue that the candidate descriptors are actually better than measured by these metrics since blood pressure -and cholesterol are test used to determine cardiovascular disease. What +and cholesterol are tests used to determine cardiovascular disease. What to do it about this is an open question.

    The goal in this analysis is to see if different subgroups of data have different average precision, average recall, and average F1 scores.

    -

    -
    -

    4.3 How do results differ +
    +

    5.1 How do results differ by model on CT-pub?

    We want to summarize the results across the trials. So we calculate mean and standard error of statistics for each trial for each model. @@ -2348,19 +2436,21 @@

    4.3 How do results differ seRecall=std.error(recall), meanF1=mean(f1),sef1=std.error(f1)) -kable(CT_Pub_model_results.df, caption="Table 3: Differences by Model on CT-Pub") +kable(CT_Pub_model_results.df, caption="Differences by Model on CT-Pub") - +---++++ + @@ -2371,12 +2461,40 @@

    4.3 How do results differ

    - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Table 3: Differences by Model on CT-PubDifferences by Model on CT-Pub
    model meanPrecision sePrecision meanRecall
    0.43448820.00846230.52727680.01045310.45179830.0072764gpt4-omni-ts0.41947730.01657800.54656130.02067400.45199530.0145372
    gpt4-omni-zs0.41179230.01912320.49888310.01967490.42508430.0150289
    llama3-70b-in-ts0.43724430.01463890.52679290.02091620.45506470.0137912
    llama3-70b-in-zs0.46943880.01672510.53687010.02228610.47504900.0146080
    @@ -2387,13 +2505,13 @@

    4.3 How do results differ not always be statistically significant. No significance tests were done in the original CTBench paper.

    -
    -

    4.4 How do results differ +
    +

    5.2 How do results differ by model and trial type on CT-pub?

    Now we calculate calculate mean and standard error of response measures for different model and trial type combinations and display results in a table.

    -
    # TODO: Use old pipe
    +
    # Done using group by and summarize commands in Dplyr
     CT_Pub_MT_results.df <- CT_Pub.responses.df %>% 
       group_by(model,trial_group) %>% 
       summarize(meanPrecision=mean(precision),
    @@ -2401,21 +2519,26 @@ 

    4.4 How do results differ meanRecall=mean(recall), seRecall=std.error(recall), meanF1=mean(f1), - sef1=std.error(f1)) - -kable(CT_Pub_MT_results.df, caption="Table 4: Differences by Model and Subgroup on CT-pub")

    + sef1=std.error(f1))
    +
    ## `summarise()` has grouped output by 'model'. You can override using the
    +## `.groups` argument.
    +
    kable(CT_Pub_MT_results.df, caption="Differences by Model and Subgroup on CT-pub")
    - +------++++++++ + + @@ -2426,12 +2549,204 @@

    4.4 How do results differ

    - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Table 4: Differences by Model and Subgroup on CT-pubDifferences by Model and Subgroup on CT-pub
    modeltrial_group meanPrecision sePrecision meanRecall
    0.43448820.00846230.52727680.01045310.45179830.0072764gpt4-omni-tscancer0.33763330.03994970.54308990.03928550.39708810.0344699
    gpt4-omni-tschronic kidney disease0.44300210.03035950.56253530.05182430.47892170.0327797
    gpt4-omni-tsdiabetes0.43150310.02611830.59845200.03637880.48151790.0242987
    gpt4-omni-tshypertension0.49368920.04216460.50763530.05706180.47088210.0340076
    gpt4-omni-tsobesity0.38826680.04951000.46593320.04874580.40342060.0390612
    gpt4-omni-zscancer0.35938960.06486990.51772480.03782280.38848220.0416444
    gpt4-omni-zschronic kidney disease0.44982550.03591250.49617750.04221650.45505350.0337143
    gpt4-omni-zsdiabetes0.43988740.02751780.56701200.03502870.47474530.0243584
    gpt4-omni-zshypertension0.41104570.05257730.44042360.05248250.39163990.0329269
    gpt4-omni-zsobesity0.36785170.04889320.40162090.04727450.35985840.0359429
    llama3-70b-in-tscancer0.40937690.03214810.56665990.04833300.45196190.0284682
    llama3-70b-in-tschronic kidney disease0.45383990.03677680.51582420.05688430.45916010.0350896
    llama3-70b-in-tsdiabetes0.45710220.02480110.57087320.03432730.48623240.0235926
    llama3-70b-in-tshypertension0.49835490.03700430.48183630.05994630.46579950.0376744
    llama3-70b-in-tsobesity0.36037990.03288180.45402760.04379460.38650560.0317837
    llama3-70b-in-zscancer0.41385440.04147520.63229740.04928220.48364210.0366250
    llama3-70b-in-zschronic kidney disease0.52659880.04327490.57016150.06373680.50700080.0362418
    llama3-70b-in-zsdiabetes0.49250060.02550360.53531390.03507230.49802890.0246477
    llama3-70b-in-zshypertension0.50758600.04882540.51096070.06308180.47579890.0373848
    llama3-70b-in-zsobesity0.38845610.03406090.44184560.04605360.39146920.0307577
    @@ -2441,8 +2756,8 @@

    4.4 How do results differ differences are statistically significant.

    -
    -

    5 Do we see differences +
    +

    6 Do we see differences in results for different candidate descriptors?

    This is a sample analysis of the matches data frame. The goal is to count the number of trials for the ‘gpt4-onmi-zs’ model results in which @@ -2465,9 +2780,9 @@

    5 Do we see differences nrow(CT_Pub_reference_count.df)
    ## [1] 843
    # these are top 20 most common descriptors. 
    -kable(head(CT_Pub_reference_count.df,20),caption="Table 5: Accuracy of Top 20 descriptors in CT_Pub")
    +kable(head(CT_Pub_reference_count.df,20),caption="Accuracy of Top 20 descriptors in CT_Pub") - + @@ -2610,21 +2925,22 @@

    5 Do we see differences descriptor called “hypertension.” So some more thought needs to be put in this type of analysis.

    -
    -

    6 Your Job

    +
    +

    7 Your Job

    You job is to do a more in-depth analysis of the results of the two models. Each member of your team can focus on a different question or -model. You can use any analyses or visualizations in R that you -like.

    +model. You can use any analyses or visualizations in R that you like. We +will coach you through this process at the Monday and Wednesday weekly +team breakouts.

    Here are some ideas for questions to pursue, but feel free to make up your own. Try to make up and answer at least two questions. The additional questions can be a follow-up to previous questions. Coordinate with your team so you look at different questions.

    Here are some ideas for questions to inspire you:

      -
    1. Does the LLM models perform differently in terms of precision, -recall, and F1 scores?
      -
    2. +
    3. Do the the LLM models perform differently in terms of precision, +recall, and F1 scores and are these differences statistically +significant?
    4. Are prompts for some disease types e.g. (group_types) harder than others? Does this difference hold across different models? Note we will refer to this as a subgroup analysis.
    5. @@ -2661,17 +2977,17 @@

      6 Your Job

    6. Can you reproduce the results shown in the CTBench paper using the data we have? Can you think of other ways to present those results?
    -
    -

    6.1 Analysis: Question 1 +
    +

    7.1 Analysis: Question 1 (Provide short name)

    -
    -

    6.1.1 Question being +
    +

    7.1.1 Question being asked

    Provide in natural language a statement of what question you’re trying to answer

    -
    -

    6.1.2 Data +
    +

    7.1.2 Data Preparation

    Provide in natural language a description of the data you are using for this analysis

    @@ -2681,8 +2997,8 @@

    6.1.2 Data re-state what data you’re using

    # Include all data processing code (if necessary), clearly commented

    -
    -

    6.1.3 Analysis: Methods +
    +

    7.1.3 Analysis: Methods and results

    Describe in natural language a statement of the analysis you’re trying to do

    @@ -2697,24 +3013,24 @@

    6.1.3 Analysis: Methods # or google docs), you can include links to the documents in this notebook # instead of actual text.

    -
    -

    6.1.4 Discussion of +
    +

    7.1.4 Discussion of results

    Provide in natural language a clear discussion of your observations.

    -
    -

    6.2 Analysis: Question 2 +
    +

    7.2 Analysis: Question 2 (Provide short name)

    -
    -

    6.2.1 Question being +
    +

    7.2.1 Question being asked

    Provide in natural language a statement of what question you’re trying to answer

    -
    -

    6.2.2 Data +
    +

    7.2.2 Data Preparation

    Provide in natural language a description of the data you are using for this analysis

    @@ -2724,8 +3040,8 @@

    6.2.2 Data re-state what data you’re using

    # Include all data processing code (if necessary), clearly commented

    -
    -

    6.2.3 Analysis: Methods +
    +

    7.2.3 Analysis: Methods and Results

    Describe in natural language a statement of the analysis you’re trying to do

    @@ -2740,22 +3056,22 @@

    6.2.3 Analysis: Methods # or google docs), you can include links to the documents in this notebook # instead of actual text.

    -
    -

    6.2.4 Discussion of +
    +

    7.2.4 Discussion of results

    Provide in natural language a clear discussion of your observations.

    -
    -

    6.3 Summary and next +
    +

    7.3 Summary and next steps

    Provide in natural language a clear summary and your proposed next steps.

    -
    -

    7 When you’re done: SAVE, +
    +

    8 When you’re done: SAVE, COMMIT and PUSH YOUR CHANGES!

    When you are satisfied with your edits and your notebook knits successfully, remember to push your changes to the repo using the @@ -2772,8 +3088,8 @@

    7 When you’re done: SAVE,
  • Submit pdf to gradescope
  • -
    -

    8 APPENDIX: Accessing +
    +

    9 APPENDIX: Accessing RStudio Server on the IDEA Cluster

    The IDEA Cluster provides seven compute nodes (4x 48 cores, 3x 80 cores, 1x storage server)

    @@ -2788,11 +3104,11 @@

    8 APPENDIX: Accessing
  • Access via RPI physical network or VPN only
  • -
    -

    9 More info about Rstudio -on our Cluster

    -
    -

    9.1 RStudio GUI +
    +

    10 More info about +Rstudio on our Cluster

    +
    +

    10.1 RStudio GUI Access:

    • Use: diff --git a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.pdf b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.pdf index 73f1a4a..9a98f1e 100644 Binary files a/StudentNotebooks/Assignment03/dar-f24-assignment3-template.pdf and b/StudentNotebooks/Assignment03/dar-f24-assignment3-template.pdf differ

    Table 5: Accuracy of Top 20 descriptors in CT_PubAccuracy of Top 20 descriptors in CT_Pub
    reference