Skip to content

Assignment 5 Submission #39

Merged
merged 1 commit into from Nov 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
221 changes: 221 additions & 0 deletions StudentNotebooks/Assignment05/parks14_assignment05.Rmd
@@ -0,0 +1,221 @@
---
title: "Assignment 5"
author: "Samuel Park"
date: "`r Sys.Date()`"
output:
pdf_document:
toc: yes
html_document:
toc: yes
subtitle: "CTBench"
---

## Weekly Work Summary

* RCS ID: parks14
* Project Name: CTEval
* Summary of work since last week

* completed evaluation and benchmarking code translation from Nafis's CTBenchLLM repo to R. Location: "DAR=CTEval-24/StudentNotebooks/Assignment04/CTBench_LLM_prompt.Rmd"
* wrote boiler plate code for evaluation portion to be implemented into the CTBS Shiny App. Location: "DAR-CTBSApp-24/eval.R"

* List of presentations, papers, or other outputs

* NA

* Indicate which parts of your described work were done by you or as part of joint efforts
* My evaluation code of eval.R has been written in order to be implemented into the main code base of app.R for the CTBS Shiny app.


## Personal Contribution

* Translated functions for single and bulk evaluation and benchmarking from CTEval LLM repo into R for people to use in the future.
* Wrote functions to be used in the CTBS Shiny App evaluation portion. Specifically, some helper function for creating prompts, parsing strings, and most importantly the server logic to make the API call and get the evaluation data.


## Analysis: Translation of Evaluation and Benchmarking code from python to R

### Helper function involved with Evaluation/Benchmarking

```{r}
build_eval_prompt <- function(reference, candidate, qstart) {
# Define the system message
system <- "
You are an expert assistant in the medical domain and clinical trial design. You are provided with details of a clinical trial.
Your task is to determine which candidate baseline features match any feature in a reference baseline feature list for that trial.
You need to consider the context and semantics while matching the features.
For each candidate feature:
1. Identify a matching reference feature based on similarity in context and semantics.
2. Remember the matched pair.
3. A reference feature can only be matched to one candidate feature and cannot be further considered for any consecutive matches.
4. If there are multiple possible matches (i.e. one reference feature can be matched to multiple candidate features or vice versa), choose the most contextually similar one.
5. Also keep track of which reference and candidate features remain unmatched.
6. DO NOT provide the code to accomplish this and ONLY respond with the following JSON. Perform the matching yourself.
Once the matching is complete, omitting explanations provide the answer only in the following form:
{\"matched_features\": [[\"<reference feature 1>\" , \"<candidate feature 1>\" ],[\"<reference feature 2>\" , \"<candidate feature 2>\"]],\"remaining_reference_features\": [\"<unmatched reference feature 1>\" ,\"<unmatched reference feature 2>\"],\"remaining_candidate_features\" : [\"<unmatched candidate feature 1>\" ,\"<unmatched candidate feature 2>\"]}
7. Please generate a valid JSON object, ensuring it fits within a single JSON code block, with all keys and values properly quoted and all elements closed. Do not include line breaks within array elements."
# Start building the question message
question <- paste("\nHere is the trial information: \n\n", qstart, "\n\n", sep = "")
# Add the reference features
question <- paste(question, "Here is the list of reference features: \n\n", sep = "")
for (i in seq_along(reference)) {
question <- paste(question, i, ". ", reference[[i]], "\n", sep = "")
}
# Add the candidate features
question <- paste(question, "\nCandidate features: \n\n", sep = "")
for (i in seq_along(candidate)) {
question <- paste(question, i, ". ", candidate[[i]], "\n", sep = "")
}
return (c(system, question))
}
get_question_from_row <- function(row) {
# Extract relevant fields from the row
title <- row["BriefTitle"]
brief_summary <- row["BriefSummary"]
condition <- row["Conditions"]
eligibility_criteria <- row["EligibilityCriteria"]
intervention <- row["Interventions"]
outcome <- row["PrimaryOutcomes"]
# Build the question string by concatenating the extracted fields
question <- ""
question <- paste(question, "<Title> \n", title, "\n", sep = "")
question <- paste(question, "<Brief Summary> \n", brief_summary, "\n", sep = "")
question <- paste(question, "<Condition> \n", condition, "\n", sep = "")
question <- paste(question, "<Eligibility Criteria> \n", eligibility_criteria, "\n", sep = "")
question <- paste(question, "<Intervention> \n", intervention, "\n", sep = "")
question <- paste(question, "<Outcome> \n", outcome, "\n", sep = "")
return(question)
}
extract_elements <- function(s) {
# Define the pattern to match text within backticks
pattern <- "`(.*?)`"
# Use the regmatches and gregexpr functions to find all matches
elements <- regmatches(s, gregexpr(pattern, s, perl = TRUE))[[1]]
# Remove the enclosing backticks from the matched elements
elements <- gsub("`", "", elements)
return(elements)
}
```

```{r}
# extracts JSON elements from a JSON-like string
extract_json <- function(text) {
# Regular expression to detect JSON objects or arrays, allowing nested structures
json_pattern <- "\\{(?:[^{}]|(?R))*\\}|\\[(?:[^[\\]]|(?R))*\\]"
# Extract all matches
matches <- regmatches(text, gregexpr(json_pattern, text, perl = TRUE))[[1]]
# Validate JSON strings by attempting to parse
valid_json <- matches[sapply(matches, function(x) {
tryCatch({
fromJSON(x)
TRUE
}, error = function(e) FALSE)
})]
return(valid_json)
}
```


```{r}
# Helper function to calculate F1, precision, and accuracy scores
match_to_score <- function(matched_pairs, remaining_reference_features, remaining_candidate_features) {
# Calculate precision: TP / (TP + FP)
precision <- length(matched_pairs) / (length(matched_pairs) + length(remaining_candidate_features))
# Calculate recall: TP / (TP + FN)
recall <- length(matched_pairs) / (length(matched_pairs) + length(remaining_reference_features))
# Calculate F1 score: 2 * (precision * recall) / (precision + recall)
if (precision == 0 || recall == 0) {
f1 <- 0
} else {
f1 <- 2 * (precision * recall) / (precision + recall)
}
# Return a list with precision, recall, and f1
return(c(precision, recall, f1))
}
```


### Discussion of results

The code shown above, provides a modular method in which future and current users working on the CT Eval project can make evaluations and benchmarking metrics on for LLM generated reference features.

## Analysis: Evaluation Code for CTBS Shiny App


Given that most of the helper functions are identical to the ones shown above, I have provided the server logic portion for the CTBS Shiny App Evaluation portion.

```{r}
library(shiny)
# Run evaluation Button for Eval
observeEvent(input$eval, {
row = CT_Pub_updated.df[CT_Pub_updated.df$NCTId == input$NCTid,]
eval_model = input$modelEval
promptChoice = 1 #1 for gpt prompts and 2 for llama prompts
print(result.short)
print(result[[1]]$choices[[1]]$message$content)
#Set promptChoice = 2 if evalModel choice is set to llama
if (startsWith(eval_model, "Meta-")) prompt = 2
qstart = get_question_from_row(row)
reference_list = extract_elements(row["Paper_BaselineMeasures_Corrected"])
# Assuming that gen_response will be provided by zero/triple shot sections
candidate_list = extract_elements(gen_response)
#produce evaluation prompt based on evaluator LLM choice
eval_prompts = build_eval_prompt(reference_list, candidate_list, qstart, prompt)
systemPrompt(eval_prompts[1])
prompt = eval_prompts[2]
retry = TRUE
while(retry){
tryCatch(
{
# model index set to 7 for llama
matched_json = insistent_create_completion(prompt, eval_model)$choices[[1]]$message$content
json_data = extract_json(matched_json)
temp_df = fromJSON(json_data)
retry = FALSE
},
error = function(e) {
print(as.character(e))
return()
})
}
print(temp_df)
})
```

### Discussion of results

The function above is to be used in the Shiny App is is written in a manner such that it can be implemented into the app well. Specifically this the function that defines the server logic when a user presses the evaluation button after running the generation prompts and obtaining the results.

## Summary and next steps

I will be pivoting from writing pure evaluation and benchmarking code. Instead I am working with the CTBS Shiny App team to get an evaluation page set up so that clients can run generation and evaluate the performances of different LLM's generation. Additionally, I plan to work with Soumeek to hopefully get some benchmarking metrics that account for hallucination as well.
Binary file not shown.