DataINCITE · bennek · Nov 12, 2024 · Nov 11, 2024
diff --git a/StudentNotebooks/Assignment05/parks14_assignment05.Rmd b/StudentNotebooks/Assignment05/parks14_assignment05.Rmd
@@ -0,0 +1,221 @@
+---
+title: "Assignment 5"
+author: "Samuel Park"
+date: "`r Sys.Date()`"
+output:
+  pdf_document:
+    toc: yes
+  html_document:
+    toc: yes
+subtitle: "CTBench"
+---
+
+## Weekly Work Summary	
+
+* RCS ID: parks14
+* Project Name: CTEval
+* Summary of work since last week 
+
+    * completed evaluation and benchmarking code translation from Nafis's CTBenchLLM repo to R. Location: "DAR=CTEval-24/StudentNotebooks/Assignment04/CTBench_LLM_prompt.Rmd"
+    * wrote boiler plate code for evaluation portion to be implemented into the CTBS Shiny App. Location: "DAR-CTBSApp-24/eval.R"
+
+* List of presentations,  papers, or other outputs
+
+    * NA
+
+* Indicate which parts of your described work were done by you or as part of joint efforts
+    * My evaluation code of eval.R has been written in order to be implemented into the main code base of app.R for the CTBS Shiny app.
+
+
+## Personal Contribution	
+
+* Translated functions for single and bulk evaluation and benchmarking from CTEval LLM repo into R for people to use in the future.
+* Wrote functions to be used in the CTBS Shiny App evaluation portion. Specifically, some helper function for creating prompts, parsing strings, and most importantly the server logic to make the API call and get the evaluation data.
+
+
+## Analysis: Translation of Evaluation and Benchmarking code from python to R
+
+### Helper function involved with Evaluation/Benchmarking
+
+```{r}
+
+build_eval_prompt <- function(reference, candidate, qstart) {
+  # Define the system message
+  system <- "
+    You are an expert assistant in the medical domain and clinical trial design. You are provided with details of a clinical trial.
+    Your task is to determine which candidate baseline features match any feature in a reference baseline feature list for that trial. 
+    You need to consider the context and semantics while matching the features.
+
+    For each candidate feature:
+    
+        1. Identify a matching reference feature based on similarity in context and semantics.
+        2. Remember the matched pair.
+        3. A reference feature can only be matched to one candidate feature and cannot be further considered for any consecutive matches.
+        4. If there are multiple possible matches (i.e. one reference feature can be matched to multiple candidate features or vice versa), choose the most contextually similar one.
+        5. Also keep track of which reference and candidate features remain unmatched.
+    6. DO NOT provide the code to accomplish this and ONLY respond with the following JSON. Perform the matching yourself.
+    Once the matching is complete, omitting explanations provide the answer only in the following form:
+  {\"matched_features\": [[\"<reference feature 1>\" , \"<candidate feature 1>\" ],[\"<reference feature 2>\" , \"<candidate feature 2>\"]],\"remaining_reference_features\": [\"<unmatched reference feature 1>\" ,\"<unmatched reference feature 2>\"],\"remaining_candidate_features\" : [\"<unmatched candidate feature 1>\" ,\"<unmatched candidate feature 2>\"]}
+  7. Please generate a valid JSON object, ensuring it fits within a single JSON code block, with all keys and values properly quoted and all elements closed. Do not include line breaks within array elements."
+  
+
+  # Start building the question message
+  question <- paste("\nHere is the trial information: \n\n", qstart, "\n\n", sep = "")
+
+  # Add the reference features
+  question <- paste(question, "Here is the list of reference features: \n\n", sep = "")
+  for (i in seq_along(reference)) {
+    question <- paste(question, i, ". ", reference[[i]], "\n", sep = "")
+  }
+
+  # Add the candidate features
+  question <- paste(question, "\nCandidate features: \n\n", sep = "")
+  for (i in seq_along(candidate)) {
+    question <- paste(question, i, ". ", candidate[[i]], "\n", sep = "")
+  }
+
+  return (c(system, question))
+}
+
+get_question_from_row <- function(row) {
+  # Extract relevant fields from the row
+  title <- row["BriefTitle"]
+  brief_summary <- row["BriefSummary"]
+  condition <- row["Conditions"]
+  eligibility_criteria <- row["EligibilityCriteria"]
+  intervention <- row["Interventions"]
+  outcome <- row["PrimaryOutcomes"]
+
+  # Build the question string by concatenating the extracted fields
+  question <- ""
+  question <- paste(question, "<Title> \n", title, "\n", sep = "")
+  question <- paste(question, "<Brief Summary> \n", brief_summary, "\n", sep = "")
+  question <- paste(question, "<Condition> \n", condition, "\n", sep = "")
+  question <- paste(question, "<Eligibility Criteria> \n", eligibility_criteria, "\n", sep = "")
+  question <- paste(question, "<Intervention> \n", intervention, "\n", sep = "")
+  question <- paste(question, "<Outcome> \n", outcome, "\n", sep = "")
+
+  return(question)
+}
+
+extract_elements <- function(s) {
+  # Define the pattern to match text within backticks
+  pattern <- "`(.*?)`"
+  
+  # Use the regmatches and gregexpr functions to find all matches
+  elements <- regmatches(s, gregexpr(pattern, s, perl = TRUE))[[1]]
+  
+  # Remove the enclosing backticks from the matched elements
+  elements <- gsub("`", "", elements)
+  
+  return(elements)
+}
+
+```
+
+```{r}
+# extracts JSON elements from a JSON-like string
+extract_json <- function(text) {
+  # Regular expression to detect JSON objects or arrays, allowing nested structures
+  json_pattern <- "\\{(?:[^{}]|(?R))*\\}|\\[(?:[^[\\]]|(?R))*\\]"
+  
+  # Extract all matches
+  matches <- regmatches(text, gregexpr(json_pattern, text, perl = TRUE))[[1]]
+  
+  # Validate JSON strings by attempting to parse
+  valid_json <- matches[sapply(matches, function(x) {
+    tryCatch({
+      fromJSON(x)
+      TRUE
+    }, error = function(e) FALSE)
+  })]
+  
+  return(valid_json)
+}
+```
+
+
+```{r}
+# Helper function to calculate F1, precision, and accuracy scores
+match_to_score <- function(matched_pairs, remaining_reference_features, remaining_candidate_features) {
+  # Calculate precision: TP / (TP + FP)
+  precision <- length(matched_pairs) / (length(matched_pairs) + length(remaining_candidate_features))
+  
+  # Calculate recall: TP / (TP + FN)
+  recall <- length(matched_pairs) / (length(matched_pairs) + length(remaining_reference_features))
+  
+  # Calculate F1 score: 2 * (precision * recall) / (precision + recall)
+  if (precision == 0 || recall == 0) {
+    f1 <- 0
+  } else {
+    f1 <- 2 * (precision * recall) / (precision + recall)
+  }
+  
+  # Return a list with precision, recall, and f1
+  return(c(precision, recall, f1))
+}
+```
+
+
+### Discussion of results
+
+The code shown above, provides a modular method in which future and current users working on the CT Eval project can make evaluations and benchmarking metrics on for LLM generated reference features.
+
+## Analysis: Evaluation Code for CTBS Shiny App
+
+
+Given that most of the helper functions are identical to the ones shown above, I have provided the server logic portion for the CTBS Shiny App Evaluation portion.
+
+```{r}
+library(shiny)
+
+# Run evaluation Button for Eval
+observeEvent(input$eval, {
+  
+  row = CT_Pub_updated.df[CT_Pub_updated.df$NCTId == input$NCTid,]
+  eval_model = input$modelEval
+  promptChoice = 1 #1 for gpt prompts and 2 for llama prompts
+  
+  print(result.short)
+  print(result[[1]]$choices[[1]]$message$content)
+  
+  #Set promptChoice = 2 if evalModel choice is set to llama
+  if (startsWith(eval_model, "Meta-")) prompt = 2
+  
+  qstart = get_question_from_row(row)
+  reference_list = extract_elements(row["Paper_BaselineMeasures_Corrected"])
+  # Assuming that gen_response will be provided by zero/triple shot sections
+  candidate_list = extract_elements(gen_response)
+
+  #produce evaluation prompt based on evaluator LLM choice
+  eval_prompts = build_eval_prompt(reference_list, candidate_list, qstart, prompt)
+  
+  systemPrompt(eval_prompts[1])
+  prompt = eval_prompts[2]
+  
+  retry = TRUE
+  while(retry){
+    tryCatch(
+      {
+        # model index set to 7 for llama
+        matched_json = insistent_create_completion(prompt, eval_model)$choices[[1]]$message$content
+        json_data = extract_json(matched_json)
+        temp_df = fromJSON(json_data)
+        retry = FALSE
+      },
+      error = function(e) {
+        print(as.character(e))
+        return()
+      })
+  }
+  print(temp_df)
+})
+```
+
+### Discussion of results
+
+The function above is to be used in the Shiny App is is written in a manner such that it can be implemented into the app well. Specifically this the function that defines the server logic when a user presses the evaluation button after running the generation prompts and obtaining the results.
+
+## Summary and next steps
+
+I will be pivoting from writing pure evaluation and benchmarking code. Instead I am working with the CTBS Shiny App team to get an evaluation page set up so that clients can run generation and evaluate the performances of different LLM's generation. Additionally, I plan to work with Soumeek to hopefully get some benchmarking metrics that account for hallucination as well.
diff --git a/StudentNotebooks/Assignment05/parks14_assignment05.pdf b/StudentNotebooks/Assignment05/parks14_assignment05.pdf