Home

Welcome to the CTSuggest wiki!

Step 1: Trial Specification

The first step of the CTSuggest app serves as a user-friendly starting point for both reviewing existing clinical trials and making new custom ones. Once a trial is selected, the following fields are populated with detailed information:

Title: The official name of the trial.
Brief Summary: A concise overview of what the trial is about.
Condition: The health condition or disease that the trial addresses.
Eligibility Criteria: This details the criteria for participant inclusion and exclusion.
Intervention: Descriptions of the interventions tested.
Outcome: The expected outcome of the trial.

This step is intuitively divided into two sections: Existing trials and Custom made trials. The aforementioned fields are present in both of the sections.

1.1 Existing Trials

Here, users can easily look up and review details of existing clinical trials. The datasets used are CT-Pub and CT-Repo, which are publicly available on clinicaltrials.gov. It begins with a dropdown menu where users can select a clinical trial ID (default: NCT00126737). The dropdown includes a search feature for easier access.

To load the data of an existing trial into these fields, users click the "Load an existing trial" button, which loads all relevant information for use by the LLM.

Actual Features: Baseline features determined by medical experts that serve as a reference for the LLM.

The Actual Features field only appears in existing trials.
Users should only click "Load an existing trial" and avoid other actions to prevent LLM confusion.
If users modify any data, they must hit "Update" to store changes. Note: Evaluation will not run in that case.

1.2 Blank Trials (Custom made).

Users can create a custom trial by manually filling out the fields and clicking "Create blank trial".

This option excludes Actual Features since there are no reference features.
As a result, Step 3 (Evaluation) will be skipped.

Once all details are entered, users click "Update" to save the trial. The LLM will receive this data for the next step.

Users can export:

For existing trials: trial ID, trial state, all fields (including Actual Features)
For custom trials: trial state, all fields (excluding Actual Features)

Export is available as a JSON file via the "Download trial information as JSON file" button.

Step 2: Generation of Baseline Features

In this step, the LLM (default: gpt-4o, with three-shot learning and explanations enabled) receives the trial data and generates candidate baseline features with explanations.

Two learning approaches are used:

Zero-shot learning: No prior examples are shown to the LLM.
Three-shot learning: The LLM sees three example trials before generating output.

The three-shot examples (defaults: NCT00000620, NCT01483560, NCT04280783) are user-editable and must be unique.

2.1 Report

Clicking "Generate" calls an LLM to produce a report that includes:

Trial ID (custom for custom trials)
Title
Suggested baseline features (with or without explanations)
LLM used
In-context learning setting
Three-shot example IDs (if used)

Download the full report and prompts via "Download generation report as JSON file".

2.2 Options

Users can configure:

LLM selection
Zero-shot vs. three-shot learning
Custom three-shot examples
Enable/disable explanations

Once configured, clicking "Update" will regenerate the report with the selected options.

This allows hands-on experimentation with LLMs and in-context setups.

Step 3: Evaluation Using LLM as a Judge

This step is only available for existing trials. Click "Run Evaluation" to call an LLM to compare the generated candidate features against expert-defined reference features. The LLM returns a list of matched features. As LLMs are prone to hallucinating, this list goes through an algorithmic hallucination removal process, then performance metrics (Precision, Recall, and F₁) are then calculated. The displayed report shows the correctly matched features, the unmatched features from each of the input lists, and the performance metrics. The performance metrics are calculated as follows:

$$\text{Precision}=\frac{\text{Number of Correct Matches}}{\text{Number of Candidate Features}}$$

$$\text{Recall}=\frac{\text{Number of Correct Matches}}{\text{Number of Reference Features}}$$

$$\text{F}_1=\frac{2\times\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}}$$

3.1 Output Format (JSON):

{
  "matched_features": [
    ["<reference feature 1>", "<candidate feature 1>"],
    ["<reference feature 2>", "<candidate feature 2>"]
  ],
  "remaining_reference_features": ["<unmatched reference 1>", "<unmatched reference 2>"],
  "remaining_candidate_features": ["<unmatched candidate 1>", "<unmatched candidate 2>"]
}

Prompts

Generation prompts

These are the prompts used in both zero shot and three shot generations with the only difference being in three shot learning three additional examples have been provided to the LLM.

4.1 CTBench Prompt

You are a helpful assistant with experience in the clinical domain and clinical trial design. 
You’ll be asked queries related to clinical trials. These inquiries will be delineated by a ‘##Question’ heading. 
Inside these queries, expect to find comprehensive details about the clinical trial structured within specific subsections, 
indicated by ‘<>’ tags. These subsections include essential information such as the trial’s title, brief summary, 
condition under study, inclusion and exclusion criteria, intervention, and outcomes.

In answer to this question, return a list of probable baseline features (each feature should be enclosed within a pair of backticks 
and each feature should be separated by commas from other features) of the clinical trial. Baseline features are the set of baseline 
or demographic characteristics that are assessed at baseline and used in the analysis of the primary outcome measure(s) to characterize 
the study population and assess validity. Clinical trial‑related publications typically include a table of baseline features assessed 
by arm or comparison group and for the entire population of participants in the clinical trial.

Do not give any additional explanations or use any tags or headers, only return the list of baseline features.

4.2 CTSuggest Prompt

You are a helpful assistant with expertise in the clinical domain and clinical trial design. You will be asked queries related to clinical trials, each marked by a '##Question' heading.
Within these queries, you will find comprehensive details about a clinical trial, structured within specific subsections denoted by '<>' tags. These subsections include critical information such as:
- **Title**: The official name of the trial.
- **Brief Summary**: A concise description of the trial’s objectives and methods.
- **Condition**: The medical condition or disease under study.
- **Inclusion and Exclusion Criteria**: Eligibility requirements for participants.
- **Intervention**: The treatment, procedure, or action being studied.
- **Outcomes**: The measures used to evaluate the effect of the intervention.
Your task is to provide a list of probable baseline features of the clinical trial. Baseline features are demographic or clinical characteristics assessed at baseline that are used to analyze the primary outcome measures, characterize the study population, and validate findings. Examples include age, sex, BMI, blood pressure, disease severity, smoking status, and medication history.
Respond only with the list of baseline features in the format:  
{feature 1, feature 2, feature 3, ...}.  
**Guidelines**:
1. **Avoid Explanations**: Do not provide any additional text, explanations, or context.
2. **No Tags or Headers**: Do not include any tags, headings, or formatting other than the list itself.
3. **Avoid Repetition**: Ensure each baseline feature appears only once in the list.

The prompt used in CTSuggest organizes the requirements into structured critical information and guidelines and shows exactly what the output should look like. The CTBench prompt has the descriptors enclosed within backticks and separated by commas, which is not the case for the prompt used in CTSuggest. The primary reason being in the translation of the benchmark for some reason the evaluator in app was not picking up the parsing properly, so we did not include backticks yet, which is definitely a future goal for the app. Moreover, the prompt in the app has a cleaner structure that focuses the model on the core task rather than parsing and the examples enhance the model's output to the user's expectations. It is concise and reduces noise, so the model produces a clean, machine-readable list.

Three Shot Prompt Template

 "You are a helpful assistant with expertise in the clinical domain and clinical trial design. You will be asked queries related to clinical trials, each marked by a '##Question' heading.
Within these queries, you will find comprehensive details about a clinical trial, structured within specific subsections denoted by '<>' tags. These subsections include critical information such as:
- **Title**: The official name of the trial.
- **Brief Summary**: A concise description of the trial’s objectives and methods.
- **Condition**: The medical condition or disease under study.
- **Inclusion and Exclusion Criteria**: Eligibility requirements for participants.
- **Intervention**: The treatment, procedure, or action being studied.
- **Outcomes**: The measures used to evaluate the effect of the intervention.
Your task is to provide a list of probable baseline features of the clinical trial. Baseline features are demographic or clinical characteristics assessed at baseline that are used to analyze the primary outcome measures, characterize the study population, and validate findings. Examples include age, sex, BMI, blood pressure, disease severity, smoking status, and medication history.
Respond only with the list of baseline features in the format:  
{feature 1, feature 2, feature 3, ...}
You will be given three examples for reference. Follow the same pattern for your responses.
**Guidelines**:
1. **Avoid Explanations**: Do not provide any additional text, explanations, or context.
2. **No Tags or Headers**: Do not include any tags, headings, or formatting other than the list itself. 
3. **Avoid Repetition**: Ensure each baseline feature appears only once in the list.
---
### **Example 1**
##Question:  
**Title**: <Insert trial title>  
**Brief Summary**: <Insert trial summary>  
**Condition**: <Insert condition>  
**Inclusion and Exclusion Criteria**: <Insert criteria>  
**Intervention**: <Insert intervention>  
**Outcomes**: <Insert outcomes>
##Answer:  
{feature 1, feature 2, feature 3, feature 4, feature 5, ...}
---
### **Example 2**
##Question:  
**Title**: <Insert trial title>  
**Brief Summary**: <Insert trial summary>  
**Condition**: <Insert condition>  
**Inclusion and Exclusion Criteria**: <Insert criteria>  
**Intervention**: <Insert intervention>  
**Outcomes**: <Insert outcomes>
##Answer:  
{feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, ...}
---
### **Example 3**
##Question:  
**Title**: <Insert trial title>  
**Brief Summary**: <Insert trial summary>  
**Condition**: <Insert condition>  
**Inclusion and Exclusion Criteria**: <Insert criteria>  
**Intervention**: <Insert intervention>  
**Outcomes**: <Insert outcomes>
##Answer:  
{feature 1, feature 2, feature 3, feature 4, ...}
---
Now evaluate the next query provided under the '##Question' heading and respond in the same format as the examples.

4.3 Explanation Prompt

This is the prompt used by the LLM for generating explanations for each of the baseline features suggested by the LLM.

Provide detailed explanations for each selected baseline feature, linking them directly to the study’s objectives or hypotheses. For instance, explain how age and gender are related to the condition under study, supported by data or literature indicating their relevance. Also, note any statistical models or analyses that demonstrate the impact of these demographics on the study's outcomes.

Evaluation Prompt

This is the prompt that LLM as a judge receives and matches the suggested baseline features that serve as the candidate features with the actual reference features.

4.4 CTBench Prompt

You are an expert assistant in the medical domain and clinical trial design. You are provided with details of a clinical trial.
Your task is to determine which candidate baseline features match any feature in a reference baseline feature list for that trial. 
You need to consider the context and semantics while matching the features.

For each candidate feature:

    1. Identify a matching reference feature based on similarity in context and semantics.
    2. Remember the matched pair.
    3. A reference feature can only be matched to one candidate feature and cannot be further considered for any consecutive matches.
    4. If there are multiple possible matches (i.e. one reference feature can be matched to multiple candidate features or vice versa), choose the most contextually similar one.
    5. Also keep track of which reference and candidate features remain unmatched.

Once the matching is complete, provide the results in a JSON format as follows:
{
  "matched_features": [
    ["<reference feature 1>", "<candidate feature 1>"],
    ["<reference feature 2>", "<candidate feature 2>"]
  ],
  "remaining_reference_features": [
    "<unmatched reference feature 1>",
    "<unmatched reference feature 2>"
  ],
  "remaining_candidate_features": [
    "<unmatched candidate feature 1>",
    "<unmatched candidate feature 2>"
  ]
}

4.5 CTSuggest Prompt

You are an expert assistant in the medical domain and clinical trial design. You are provided with details of a clinical trial.
Your task is to determine which candidate baseline features match any feature in a reference baseline feature list for that trial. 
You need to consider the context and semantics while matching the features.

For each candidate feature:

    1. Identify a matching reference feature based on similarity in context and semantics.
    2. Remember the matched pair.
    3. A reference feature can only be matched to one candidate feature and cannot be further considered for any consecutive matches.
    4. If there are multiple possible matches (i.e. one reference feature can be matched to multiple candidate features or vice versa), choose the most contextually similar one.
    5. Also keep track of which reference and candidate features remain unmatched.
    6. DO NOT provide the code to accomplish this and ONLY respond with the following JSON. Perform the matching yourself.

Once the matching is complete, omitting explanations provide the answer only in the following form:
{"matched_features":[["<reference feature 1>","<candidate feature 1>"],["<reference feature 2>","<candidate feature 2>"]],"remaining_reference_features":["<unmatched reference feature 1>","<unmatched reference feature 2>"],"remaining_candidate_features":["<unmatched candidate feature 1>","<unmatched candidate feature 2>"]}

7. Please generate a valid JSON object, ensuring it fits within a single JSON code block, with all keys and values properly quoted and all elements closed. Do not include line breaks within array elements.

The key difference being here, the second prompt has stricter formatting rules, the extra constraints specifically line 6 and line 7 forces the evaluator LLM to stick to a precise output format, reducing ambiguity and ensuring the JSON is valid. The second prompt also has a more focus on the response by instructing the model to not provide any code or additional explanations. This makes the evaluator LLM more focused on only matching the results eventually generating more matches and better scores. This prompt also specifies that the JSON must be contained within a single code block with all keys and values properly quoted and without line breaks inside arrays. This helps the evaluator LLM to not confuse and minimize formatting errors.