Skip to content

Dar bennek #4

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
113 changes: 65 additions & 48 deletions StudentNotebooks/Assignment03/dar-f24-assignment3-template.Rmd
Expand Up @@ -3,13 +3,13 @@ title: 'CTBench Eval Project Notebook:'
author: "Your Name Here"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
pdf_document: default
word_document:
toc: true
html_document:
toc: true
number_sections: true
df_print: paged
word_document:
toc: true
pdf_document: default
subtitle: DAR Assignment 3 (Fall 2024)
---
```{r setup, include=FALSE}
Expand Down Expand Up @@ -77,7 +77,7 @@ https://arxiv.org/abs/2406.17888

## CTBenchEval Goals

The high-level goals of CTBenchEval project are to:
The high-level goals of CTBenchEval project for the semester are to:

1. Imagine you are trying your own LLM approach and want to compare to the published CTBench results.

Expand All @@ -98,12 +98,12 @@ The high-level goals of CTBenchEval project are to:
* "How can we make the evaluation software provided effective and easy to use?"
* "How can we make CTBench readily extensible to more clinical trials?"

# DAR ASSIGNMENT 3 (Introduction): Introductory DAR Notebook
# DAR ASSIGNMENT 3 (Introduction): Introductory DAR CTBench Eval Notebook

This notebook is broken into two main parts:

* **Part 1:** Preparing your local repo for **DAR Assignment 3**
* **Part 2:** Loading the CTBench Eval Datasets
* **Part 2:** Loading and Analysis the CTBench Eval Datasets
* **Part 3:** Individual analysis of your team's dataset

**NOTE:** The RPI github repository for all the code and data required for this notebook may be found at:
Expand All @@ -112,7 +112,8 @@ This notebook is broken into two main parts:

* **Part 4:** Be prepared to discuss your results in you team breakout.

# DAR ASSIGNMENT 3 (Part 1): Preparing your local repo for Assignment 3
# DAR ASSIGNMENT 3 (Part 1): Preparing your local repo for Assignment 3
**Delete this Section from your submitted notebook**

In this assignment you'll start by making a copy of the Assignment 3 template notebook, then you'll add to your copy with your original work. The instructions which follow explain how to accomplish this.

Expand Down Expand Up @@ -175,8 +176,9 @@ You're now ready to start coding Assignment 3!
* One of the DAR instructors will merge your branch, and your new files will be added to the master branch of the repo. _Do not merge your branch yourself!_

# DAR ASSIGNMENT 3 (Part 2): Loading the CTBench Eval Datasets
**Delete this Section from your submitted notebook. You can reuse this code as needed in the answers to your questions**

In this CTBench there are two sorts of data: clinical trial data used to generate the prompts and results data that shows the results. For your conveniences, these dataset have been converted to R Rds files regardless of how they appear originally.
In this CTBench there are two sources of data: clinical trial data used to generate the prompts and results data from the LM and its evaluation. For your conveniences, these dataset have been converted to R Rds files regardless of how they appear originally.

* **Data Rds:**

Expand All @@ -192,7 +194,7 @@ These are the datasets that describe each clinical trial and give it's baseline
These include the results of various LLM on each trial.

* `trials.responses.Rds` contains the results of the LLM for each clinical trial and model combination.
* `trials.matches.Rds` contains the specific matches that the evaluation LLM made between the candidate and references descriptors.
* `trials.matches.Rds` contains the specific matches that the evaluation LLM made between the candidate and references descriptors for each clinical trial and model.

NOTES:

Expand All @@ -216,10 +218,9 @@ Each of these datastructure has one row per trial. The features are given below.
[5] "Conditions" Health Conditions the Trial addresses
[6] "Interventions" Intervention (a.k.a treatments)
[7] "PrimaryOutcomes" Measure of success or failure of trial
[8] "BaselineMeasures" Original Features in ClinicalTrial.gov
[9] "BaselineMeasures_Processed" Cleaned up Features used in CTBench

</table>
[8] "BaselineMeasures" List of original Features in ClinicalTrial.gov
[9] "BaselineMeasures_Processed" List of cleaned up Features used in CTBench
</table>


<table>
Expand All @@ -233,11 +234,10 @@ Each of these datastructure has one row per trial. The features are given below.
[5] "Conditions" Health Conditions the Trial addresses
[6] "Interventions" Intervention (a.k.a treatments)
[7] "PrimaryOutcomes" Measure of success or failure of trial
[8] "BaselineMeasures" Original Features in ClinicalTrial.gov
[9] "Paper_BaselineMeasures" Original Features in trial paper
[10] "Paper_BaselineMeasures_Processed" Cleaned up Features used in CTBench

</table>
[8] "BaselineMeasures" List of original Features in ClinicalTrial.gov
[9] "Paper_BaselineMeasures" List of original features in trial paper
[10] "Paper_BaselineMeasures_Processed" List of cleaned up Features used in trial paper
</table>

* For `CT_Repo.df`, the reference baseline descriptors used in the experiments are in a comma separated list in `BaselineMeasures_Processed`.
* For `CT_Pub.df`, the reference baseline decsriptors used in the experiments are in a comma separated list in `Paper_BaselineMeasures_Processed`.
Expand Down Expand Up @@ -273,29 +273,30 @@ These are the features in CT_Pub_responses.df.

Number Name Notes
------ ----------------------------------- -----------------------------------------
[1] "trial_id" Unique trial id - same as NCTID
[2] "trial_group" trial address what group of disease.
[3] "model" LLM model used
[4] "gen_response" Result generated
[5] "processed_gen_response" Cleaned up result generated
[6] "len_matches" # of matching desriptors
[7] "len_reference" # unmatched descriptors from reference
[8] "len_candidate" # unmatched descriptors from LLM response
[9] "precision" precision for this trial and model
[10] "recall" recall for this trial and model
[11] "f1" F1 score for this trial and model
[1] "trial_id" Unique trial id - same as NCTID
[2] "trial_group" trial address what group of disease.
[3] "model" LLM model used
[4] "gen_response" Result generated
[5] "processed_gen_response" Cleaned up result generated
[6] "len_matches" # of matching desriptors
[7] "len_reference" # unmatched descriptors from reference
[8] "len_candidate" # unmatched descriptors from LLM response
[9] "precision" precision for this trial and model
[10] "recall" recall for this trial and model
[11] "f1" F1 score for this trial and model

</table>

These are the features in CT_Pub_matches.df.
<table>
<caption><span id="tab:table4">Table 4: </span> Features of CT_Pub.matches.df.</caption>

Number Name Notes
------ ----------------------------------- -----------------------------------------
[1] "trial_id" Unique trial id - same as NCTID
[2] "model" LLM model used
[3] "reference" matched reference feature (NA if none)
[4] "candidate" matched candidate feature (NA if none)
------ --------------- -----------------------------------------
[1] "trial_id" Unique trial id - same as NCTID
[2] "model" LLM model used
[3] "reference" matched reference feature (NA if none)
[4] "candidate" matched candidate feature (NA if none)

</table>

Expand All @@ -311,7 +312,7 @@ If the table has an entry such as
trial_id_A model_id_B NA candidate D, this means that for trial_id A using model B, candidate descriptor D had no match in the reference list.

```{r}
# Load the trials.responses
# Load the trials.responses
CT_Pub.responses.df<- readRDS("/academics/MATP-4910-F24/DAR-CTEval-F24/Data/trials.responses.Rds")

# convert model and type to factors
Expand Down Expand Up @@ -341,20 +342,20 @@ dim(CT_Pub.matches.df)
The same process can be used for CT_Repo. But be aware that variable names may be slightly different.


#Analysis of Response Data
# Analysis of Response Data

For each clinical trial, the evaluation program (which calls two different LLM) calculates several measures of how good the candidate descriptor proposed for each LLM are as compared to the reference descriptors.
For each clinical trial, the evaluation program (which calls two different LLMs) calculates several measures of how good the candidate descriptor proposed for each LLM are as compared to the reference descriptors.

The Bert score is a measure of similarity of the candidate and reference LLM in the the latent space of the Bert LLM. This is a quick but very rough way to calculate similarity. You can read about Bert Scores here. https://arxiv.org/abs/1904.09675
The Bert score is a measure of semantic similarity of the candidate and reference LLM descriptor lists in the the latent space of the Bert LLM. This is a quick but very rough way to calculate similarity. 0 is no semantic similarity. 1 is perfect semantic similarity. You can read about Bert Scores here. https://arxiv.org/abs/1904.09675

A more accurate evaluation is done by matching each candidate descriptor with at most 1 reference descriptor. This is done using the LLM GPT-4o. Let matches.len = number of candidate descriptors matched with the reference descriptors. Let candidate.len = number of unmatched candidate descriptors and reference.len = number of unmatched candidate descriptor.

Precision measures the proportion of candidate descriptors that were matched. Recall measure the proportions of the reference descriptor that were in the candidate descriptors. F1 is a standard measure that combines precision and recall. These calculations have already been done for each trial.
Precision measures the proportion of candidate descriptors that were matched. Recall measure the proportion of the reference descriptor that were in the candidate descriptors. F1 is a standard measure that combines precision and recall. These calculations have already been done for each trial.

For example, say reference = "Age, sex (OHM), race/ethnicity, cardiovascular disease, socio-economic status" and candidate = "age, gender, race, SBP, DBP, cholesterol ).

* There are three matches: (Age, age), (sex(OHM),gender}, (race/ethnicity,race)
* There are two unmatched reference descriptors: socio-econmic status and blood pressure.
* There are three matches: <Age, age>, <sex(OHM),gender>, <race/ethnicity,race>
* There are two unmatched reference descriptors: socio-econmic status, blood pressure.
* There are three unmatched candidate descriptors: SBP, DBP, cholesterol.

This is how precision, recall, and f1 are calculated:
Expand All @@ -374,7 +375,23 @@ This is how precision, recall, and f1 are calculated:
`f1 <- ifelse(precision == 0 | recall == 0, 0, 2 * (precision * recall) / (precision + recall))`
`f1 = 0.5454545`

**Note:** one could argue that the candidate descriptors are actually better than measured by these metrics since blood pressure and cholesterol are test used to determine cardiovascular disease. What to do it about this is an open question.
For this toy example, the entries in CT_Pub.matches.df would look like this:

<table>
<caption><span id="tab:table5">Table 5: </span> Toy example of CT-Pub.matches.df.</caption>
"trial_id" "model" "reference" "candidate"
--------- ------------- ----------------------- --------------
NCT1 gpt4-omni-zs Age age
NCT1 gpt4-omni-zs sex(OHM) gender
NCT1 gpt4-omni-zs race/ethnicity race
NCT1 gpt4-omni-zs socio-econmic status NA
NCT1 gpt4-omni-zs cardiovascular disease NA
NCT1 gpt4-omni-zs NA SBP
NCT1 gpt4-omni-zs NA DBP
NCT1 gpt4-omni-zs NA cholesterol
</table>

**Note:** one could argue that the candidate descriptors are actually better than measured by these metrics since blood pressure and cholesterol are tests used to determine cardiovascular disease. What to do it about this is an open question.

_The goal in this analysis is to see if different subgroups of data have different average precision, average recall, and average F1 scores._

Expand All @@ -392,7 +409,7 @@ CT_Pub_model_results.df <- CT_Pub.responses.df %>%
seRecall=std.error(recall),
meanF1=mean(f1),sef1=std.error(f1))

kable(CT_Pub_model_results.df, caption="Table 3: Differences by Model on CT-Pub")
kable(CT_Pub_model_results.df, caption="Differences by Model on CT-Pub")

```

Expand All @@ -404,7 +421,7 @@ Now we calculate calculate mean and standard error of response measures for diff

```{r}

# TODO: Use old pipe
# Done using group by and summarize commands in Dplyr
CT_Pub_MT_results.df <- CT_Pub.responses.df %>%
group_by(model,trial_group) %>%
summarize(meanPrecision=mean(precision),
Expand All @@ -414,7 +431,7 @@ CT_Pub_MT_results.df <- CT_Pub.responses.df %>%
meanF1=mean(f1),
sef1=std.error(f1))

kable(CT_Pub_MT_results.df, caption="Table 4: Differences by Model and Subgroup on CT-pub")
kable(CT_Pub_MT_results.df, caption="Differences by Model and Subgroup on CT-pub")
```

Here we can see that predicting descriptors seems to be harder for some combinations of models and trial types. But further analysis is needed to present the results in a more informative way and see if the differences are statistically significant.
Expand All @@ -436,7 +453,7 @@ CT_Pub_reference_count.df <- CT_Pub.matches.df %>%
nrow(CT_Pub_reference_count.df)

# these are top 20 most common descriptors.
kable(head(CT_Pub_reference_count.df,20),caption="Table 5: Accuracy of Top 20 descriptors in CT_Pub")
kable(head(CT_Pub_reference_count.df,20),caption="Accuracy of Top 20 descriptors in CT_Pub")

```

Expand All @@ -445,13 +462,13 @@ sample sizes can be very low. CT-Pub is really too small to do this type of a

# Your Job

You job is to do a more in-depth analysis of the results of the two models. Each member of your team can focus on a different question or model. You can use any analyses or visualizations in R that you like.
You job is to do a more in-depth analysis of the results of the two models. Each member of your team can focus on a different question or model. You can use any analyses or visualizations in R that you like. We will coach you through this process at the Monday and Wednesday weekly team breakouts.

Here are some ideas for questions to pursue, but feel free to make up your own. _Try to make up and answer at least two questions._ The additional questions can be a follow-up to previous questions. Coordinate with your team so you look at different questions.

Here are some ideas for questions to inspire you:

1. Does the LLM models perform differently in terms of precision, recall, and F1 scores?
1. Do the the LLM models perform differently in terms of precision, recall, and F1 scores and are these differences statistically significant?
2. Are prompts for some disease types e.g. (group_types) harder than others? Does this difference hold across different models? Note we will refer to this as a subgroup analysis.
3. How does the performance compare for CT_Pub and CT_Repo?
4. Can you use multiple regression on the response to understand how different factors (e.g. model, group_type, and/or source(Repo or Pub)) effect the results controlling for the others? Note this can let you know the significance of any difference too.
Expand Down