readme.md

# Towards a Progression-Aware Autonomous Dialogue Agent
Official implementation for our paper: [Towards a Progression-Aware Autonomous Dialogue Agent](https://arxiv.org/abs/2205.03692).

To run the chat interface and interact with pre-trained models, see [Run Chat Interface](#run-chat-interface).
To reproduce results from the paper, see [Reproduce Paper Results](#reproduce-paper-results).

## Overview
We propose a framework in which dialogue agents can evaluate the progression of a conversation toward or away from desired outcomes, and use this signal to inform planning for subsequent responses. Our framework is composed of three key elements: (1) the notion of a "global" dialogue state (GDS) space, (2) a task-specific progression function (PF) computed in terms of a conversation's trajectory through this space, and (3) a planning mechanism based on dialogue rollouts (Lewis et al., 2017) by which an agent may use progression signals to select its next response. See [our paper](https://arxiv.org/abs/2205.03692) for more details.

![Architecture](images/architecture.png)

This repository contains everything needed to reproduce the results on the donation solicitation task of [Persuasion For Good](https://gitlab.com/ucdavisnlp/persuasionforgood) (Wang et al., 2019) as reported in our paper. Additionally, scripts are provided
to train progression models on new datasets.

## Installation
### Dependencies
Python 3.8 or greater is required. We pinned all dependency versions to be at (or near) what we used in the paper to aid
reproducibility. Thus, we recommend installing into a new environment. If you wish to upgrade dependencies, note that some library versions are pinned to work with our existing trained progression models (see comments in [requirements.txt](requirements.txt)).
This is because serialization is used as part of the model saving process and version compatibility must be maintained during deserialization on loading. If you wish to upgrade these libraries anyway, the models can be easily re-trained on the new versions using [the provided scripts](#train-and-evaluate-progression-models).

If PyTorch is not already installed in your environment, please install the appropriate configuration of
PyTorch for you environment (OS, CUDA version) before proceeding -
see [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/).

To install dependencies, run:
```bash
pip install -r requirements.txt
```

## Interactive Demos
### Run Chat Interface
We provide an interactive streamlit chat interface that can be used to converse with our fine-tuned DialoGPT model
on the Persuasion For Good solicitation task. Rollouts can be dynamically throttled using provided controls, and GDS / PF
plots are given in real-time during the chat.

To launch the streamlit chat interface, run:
```bash
streamlit run dialogue-progression/demo_streamlit.py
```

![Streamlit Chat Interface](images/chat.png)

By default, the [LACAI/DialoGPT-large-PFG](https://huggingface.co/LACAI/DialoGPT-large-PFG) checkpoint is used for the response
generator along with the trained `supervised_large_adapted` progression model used for self-play evaluations in the paper.
It is possible to use other response generator checkpoints: for example, we provide a smaller DialoGPT checkpoint [LACAI/DialoGPT-small-PFG](https://huggingface.co/LACAI/DialoGPT-small-PFG). It is also possible to use other trained progression models
and to fix a generation seed.

To launch the streamlit chat interface using alternate response and progression models:
```bash
streamlit run dialogue-progression/demo_streamlit.py -- \
    --agent-modelpath=LACAI/DialoGPT-small-PFG \
    --progression-modelpath=models/progression/persuasion_for_good/unsupervised \
    --random-state=42
```

We have released the following response generator models which can be used with `--agent-modelpath`:
| HuggingFace link         |
|--------------------------|
| [LACAI/DialoGPT-large-PFG](https://huggingface.co/LACAI/DialoGPT-large-PFG) |
| [LACAI/DialoGPT-small-PFG](https://huggingface.co/LACAI/DialoGPT-small-PFG) |

We have released the following progression models which can be used with `--progression-modelpath`: *
| Root model link (use this) | HuggingFace link                        |
|----------------------------|-----------------------------------------|
| [unsupervised](models/progression/persuasion_for_good/unsupervised) | N/A                                     |
| [supervised_base](models/progression/persuasion_for_good/supervised_base) | [LACAI/roberta-base-PFG-progression](https://huggingface.co/LACAI/roberta-base-PFG-progression)   |
| [supervised_large](models/progression/persuasion_for_good/supervised_large) | [LACAI/roberta-large-PFG-progression](https://huggingface.co/LACAI/roberta-large-PFG-progression) |
| [supervised_large_adapted](models/progression/persuasion_for_good/supervised_large_adapted) | [LACAI/roberta-large-adapted-PFG-progression](https://huggingface.co/LACAI/roberta-large-adapted-PFG-progression) |

\* The models above all originate from initialization seed 2594 in our multi-seed experiments. The `supervised_large_adapted` model of this seed is
the one selected for use in the rollout experiments (see Section 4.3.1 in the paper). We released the other three models from the same seed for consistency.

### Run Rollout Comparer
We provide a comparer tool that can be used to inspect rollouts side-by-side during a conversation. It is possible to use this interface to chat in a similar manner supported by the streamlit app, however the comparer tool supports a mutable dialogue history.
This means that you can make small modifications to any point in the conversation and observe how it impacts the resulting progression computations and rollout results.

To launch the rollout comparer tool, run:
```bash
python dialogue-progression/demo_server.py
```
Then, navigate to [http://localhost:8080/demo_ui](http://localhost:8080/demo_ui)

![Rollout Comparer Tool](images/rollout_compare.png)

The rollout comparer tool uses the same model defaults as the streamlit chat interface, and can be run with alternate arguments in a similar way:
```bash
python dialogue-progression/demo_server.py \
    --agent-modelpath=LACAI/DialoGPT-small-PFG \
    --progression-modelpath=models/progression/persuasion_for_good/unsupervised \
    --random-state=42
```

## Reproduce Paper Results
### Dataset
To reproduce the results from our paper, first download the contents of the [data folder from Persuasion For Good](https://gitlab.com/ucdavisnlp/persuasionforgood/-/tree/master/data) into [data/persuasion_for_good](data/persuasion_for_good) in this repository. Do not replace the CSV and XLSX files already provided
in [data/persuasion_for_good/AnnotatedData](data/persuasion_for_good/AnnotatedData) - these are our manual progression annotations.

### Train and Evaluate Progression Models:
Run the following command to train and evaluate the `unsupervised`, `roberta-base`, `roberta-large`, and `roberta-large-adapted`
progression models for all of the 33 initialization seeds used in the paper:
```bash
./run_train_progression_multiseed.sh
```

To train and evaluate an individual **unsupervised** progression model using a single seed (e.g., 42)*, run:
```bash
python dialogue-progression/train_progression.py \
    --run-name=unsupervised_42 \
    --recency-weight=0.3 \
    --kmeans-n-clusters=21 \
    --normalize-embeddings \
    --kmeans-progression-inv-dist \
    --transformers-modelpath=sentence-transformers/all-mpnet-base-v2 \
    --model-random-state=42

python dialogue-progression/progression_model_eval_manual.py \
    --progression-modelpath=models/progression/persuasion_for_good/unsupervised
```
\* Use seed 2594 to replicate the released unsupervised model.

To train and evaluate an individual **supervised** progression model (e.g., base on `roberta-base`) using a single seed (e.g., 42)*, run:
```bash
python dialogue-progression/train_progression.py \
    --run-name=supervised_base_42 \
    --embedding-level=dialog \
    --normalize-embeddings \
    --batch-size=16 \
    --transformers-modelpath=roberta-base \
    --model-random-state=42

python dialogue-progression/progression_model_eval_manual.py \
    --progression-modelpath=models/progression/persuasion_for_good/supervised_base
```
\* Use seed 2594 to replicate the released supervised models.

When training supervised progression models, the following naming convention is typically used for runs:
| run-name                        | transformers-modelpath                  |
|---------------------------------|-----------------------------------------|
| unsupervised_{seed}             | sentence-transformers/all-mpnet-base-v2 |
| supervised_base_{seed}          | roberta-base                            |
| supervised_large_{seed}         | roberta-large                           |
| supervised_large_adapted_{seed} | LACAI/roberta-large-dialog-narrative    |

Each run outputs two CSVs containing the auto and manual eval results for each model and seed. These files are named `{run_name}_progression_df.csv` (auto eval) and `manual_eval_results_df` (manual eval) where `run_name` is the value provided in the `--run-name` argument. If using `run_train_progression_multiseed.sh` then only two CSVs will be output, each compiling the results
for all models and seeds.

**Reference Results**

These CSV outputs can be compared against our provided [auto-eval reference results](results/progression/auto_eval_results_33_seeds.xlsx) and [manual-eval reference results](results/progression/manual_eval_results_33_seeds.xlsx) for each model and seed. The aggregations done in these reference results sheets are the source of Tables 1 & 2 in our paper.

### Rollout Self-Play Experiments:
Self-play experiments can be run using the trained `roberta-large-adapted` progression model used in the paper, provided in
[models/progression/persuasion_for_good/supervised_large_adapted](models/progression/persuasion_for_good/supervised_large_adapted).
When loading this model, RoBERTa weights are automatically loaded from our [LACAI/roberta-large-adapted-PFG-progression](https://huggingface.co/LACAI/roberta-large-adapted-PFG-progression) checkpoint on the HuggingFace model hub.

Run the following commands to run the 2x2x3 and 3x3x5 self-play experiments for each of the 5 generation seeds used in the paper:
```bash
./run_rollouts_eval_2_2_3_multiseed.sh
./run_rollouts_eval_3_3_5_multiseed.sh
./run_rollouts_eval_stats.sh
```
Running for all 5 generation seeds can take several days. If desired, the first two commands can be run in different terminals using separate GPUs:
Terminal 1:
```bash
export CUDA_VISIBLE_DEVICES=0
./run_rollouts_eval_2_2_3_multiseed.sh
```
Terminal 2:
```bash
export CUDA_VISIBLE_DEVICES=1
./run_rollouts_eval_3_3_5_multiseed.sh
```
Then when both complete:
```bash
./run_rollouts_eval_stats.sh
```

To run self-play experiments using a specified configuration (e.g., 2x2x3) and a single seed (e.g., 42), run:
```bash
python dialogue-progression/rollouts_eval.py \
    --n-candidates=2 \
    --n-rollouts-per-candidate=2 \
    --n-utterances-per-rollout=3 \
    --outputpath=rollouts_self_play/2_2_3/42 \
    --agent-modelpath=LACAI/DialoGPT-large-PFG \
    --progression-modelpath=models/progression/persuasion_for_good/supervised_large_adapted \
    --model-random-state=42

./run_rollouts_eval_stats.sh
```
The command above should complete in just a few hours on a single GPU.

**Reference Results**

We have provided reference results which are the [complete outputs](results/rollouts_self_play) of the multi-seed self-play experiments in the paper, including all dialogue text. To compute the aggregation of these results as done in the experiment
scripts above, run:
```bash
./run_rollouts_eval_stats.sh results/rollouts_self_play
```
These aggregations are the source of Table 3 in our paper.

## Train Progression Models on New Datasets
Coming soon!

## Reference
If you use our code or models in your work, please cite:
```
@article{sanders2022towards,
  title={Towards a Progression-Aware Autonomous Dialogue Agent},
  author={Sanders, Abraham and Strzalkowski, Tomek and Si, Mei and Chang, Albert and Dey, Deepanshu and Braasch, Jonas and Wang, Dakuo},
  journal={arXiv preprint arXiv:2205.03692},
  year={2022}
}
```
	# Towards a Progression-Aware Autonomous Dialogue Agent
	Official implementation for our paper: [Towards a Progression-Aware Autonomous Dialogue Agent](https://arxiv.org/abs/2205.03692).

	To run the chat interface and interact with pre-trained models, see [Run Chat Interface](#run-chat-interface).
	To reproduce results from the paper, see [Reproduce Paper Results](#reproduce-paper-results).

	## Overview
	We propose a framework in which dialogue agents can evaluate the progression of a conversation toward or away from desired outcomes, and use this signal to inform planning for subsequent responses. Our framework is composed of three key elements: (1) the notion of a "global" dialogue state (GDS) space, (2) a task-specific progression function (PF) computed in terms of a conversation's trajectory through this space, and (3) a planning mechanism based on dialogue rollouts (Lewis et al., 2017) by which an agent may use progression signals to select its next response. See [our paper](https://arxiv.org/abs/2205.03692) for more details.

	![Architecture](images/architecture.png)

	This repository contains everything needed to reproduce the results on the donation solicitation task of [Persuasion For Good](https://gitlab.com/ucdavisnlp/persuasionforgood) (Wang et al., 2019) as reported in our paper. Additionally, scripts are provided
	to train progression models on new datasets.

	## Installation
	### Dependencies
	Python 3.8 or greater is required. We pinned all dependency versions to be at (or near) what we used in the paper to aid
	reproducibility. Thus, we recommend installing into a new environment. If you wish to upgrade dependencies, note that some library versions are pinned to work with our existing trained progression models (see comments in [requirements.txt](requirements.txt)).
	This is because serialization is used as part of the model saving process and version compatibility must be maintained during deserialization on loading. If you wish to upgrade these libraries anyway, the models can be easily re-trained on the new versions using [the provided scripts](#train-and-evaluate-progression-models).

	If PyTorch is not already installed in your environment, please install the appropriate configuration of
	PyTorch for you environment (OS, CUDA version) before proceeding -
	see [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/).

	To install dependencies, run:
	```bash
	pip install -r requirements.txt
	```

	## Interactive Demos
	### Run Chat Interface
	We provide an interactive streamlit chat interface that can be used to converse with our fine-tuned DialoGPT model
	on the Persuasion For Good solicitation task. Rollouts can be dynamically throttled using provided controls, and GDS / PF
	plots are given in real-time during the chat.

	To launch the streamlit chat interface, run:
	```bash
	streamlit run dialogue-progression/demo_streamlit.py
	```

	![Streamlit Chat Interface](images/chat.png)

	By default, the [LACAI/DialoGPT-large-PFG](https://huggingface.co/LACAI/DialoGPT-large-PFG) checkpoint is used for the response
	generator along with the trained `supervised_large_adapted` progression model used for self-play evaluations in the paper.
	It is possible to use other response generator checkpoints: for example, we provide a smaller DialoGPT checkpoint [LACAI/DialoGPT-small-PFG](https://huggingface.co/LACAI/DialoGPT-small-PFG). It is also possible to use other trained progression models
	and to fix a generation seed.

	To launch the streamlit chat interface using alternate response and progression models:
	```bash
	streamlit run dialogue-progression/demo_streamlit.py -- \
	--agent-modelpath=LACAI/DialoGPT-small-PFG \
	--progression-modelpath=models/progression/persuasion_for_good/unsupervised \
	--random-state=42
	```

	We have released the following response generator models which can be used with `--agent-modelpath`:
	\| HuggingFace link \|
	\|--------------------------\|
	\| [LACAI/DialoGPT-large-PFG](https://huggingface.co/LACAI/DialoGPT-large-PFG) \|
	\| [LACAI/DialoGPT-small-PFG](https://huggingface.co/LACAI/DialoGPT-small-PFG) \|

	We have released the following progression models which can be used with `--progression-modelpath`: *
	\| Root model link (use this) \| HuggingFace link \|
	\|----------------------------\|-----------------------------------------\|
	\| [unsupervised](models/progression/persuasion_for_good/unsupervised) \| N/A \|
	\| [supervised_base](models/progression/persuasion_for_good/supervised_base) \| [LACAI/roberta-base-PFG-progression](https://huggingface.co/LACAI/roberta-base-PFG-progression) \|
	\| [supervised_large](models/progression/persuasion_for_good/supervised_large) \| [LACAI/roberta-large-PFG-progression](https://huggingface.co/LACAI/roberta-large-PFG-progression) \|
	\| [supervised_large_adapted](models/progression/persuasion_for_good/supervised_large_adapted) \| [LACAI/roberta-large-adapted-PFG-progression](https://huggingface.co/LACAI/roberta-large-adapted-PFG-progression) \|

	\* The models above all originate from initialization seed 2594 in our multi-seed experiments. The `supervised_large_adapted` model of this seed is
	the one selected for use in the rollout experiments (see Section 4.3.1 in the paper). We released the other three models from the same seed for consistency.

	### Run Rollout Comparer
	We provide a comparer tool that can be used to inspect rollouts side-by-side during a conversation. It is possible to use this interface to chat in a similar manner supported by the streamlit app, however the comparer tool supports a mutable dialogue history.
	This means that you can make small modifications to any point in the conversation and observe how it impacts the resulting progression computations and rollout results.

	To launch the rollout comparer tool, run:
	```bash
	python dialogue-progression/demo_server.py
	```
	Then, navigate to [http://localhost:8080/demo_ui](http://localhost:8080/demo_ui)

	![Rollout Comparer Tool](images/rollout_compare.png)

	The rollout comparer tool uses the same model defaults as the streamlit chat interface, and can be run with alternate arguments in a similar way:
	```bash
	python dialogue-progression/demo_server.py \
	--agent-modelpath=LACAI/DialoGPT-small-PFG \
	--progression-modelpath=models/progression/persuasion_for_good/unsupervised \
	--random-state=42
	```

	## Reproduce Paper Results
	### Dataset
	To reproduce the results from our paper, first download the contents of the [data folder from Persuasion For Good](https://gitlab.com/ucdavisnlp/persuasionforgood/-/tree/master/data) into [data/persuasion_for_good](data/persuasion_for_good) in this repository. Do not replace the CSV and XLSX files already provided
	in [data/persuasion_for_good/AnnotatedData](data/persuasion_for_good/AnnotatedData) - these are our manual progression annotations.

	### Train and Evaluate Progression Models:
	Run the following command to train and evaluate the `unsupervised`, `roberta-base`, `roberta-large`, and `roberta-large-adapted`
	progression models for all of the 33 initialization seeds used in the paper:
	```bash
	./run_train_progression_multiseed.sh
	```

	To train and evaluate an individual unsupervised progression model using a single seed (e.g., 42)*, run:
	```bash
	python dialogue-progression/train_progression.py \
	--run-name=unsupervised_42 \
	--recency-weight=0.3 \
	--kmeans-n-clusters=21 \
	--normalize-embeddings \
	--kmeans-progression-inv-dist \
	--transformers-modelpath=sentence-transformers/all-mpnet-base-v2 \
	--model-random-state=42

	python dialogue-progression/progression_model_eval_manual.py \
	--progression-modelpath=models/progression/persuasion_for_good/unsupervised
	```
	\* Use seed 2594 to replicate the released unsupervised model.

	To train and evaluate an individual supervised progression model (e.g., base on `roberta-base`) using a single seed (e.g., 42)*, run:
	```bash
	python dialogue-progression/train_progression.py \
	--run-name=supervised_base_42 \
	--embedding-level=dialog \
	--normalize-embeddings \
	--batch-size=16 \
	--transformers-modelpath=roberta-base \
	--model-random-state=42

	python dialogue-progression/progression_model_eval_manual.py \
	--progression-modelpath=models/progression/persuasion_for_good/supervised_base
	```
	\* Use seed 2594 to replicate the released supervised models.

	When training supervised progression models, the following naming convention is typically used for runs:
	\| run-name \| transformers-modelpath \|
	\|---------------------------------\|-----------------------------------------\|
	\| unsupervised_{seed} \| sentence-transformers/all-mpnet-base-v2 \|
	\| supervised_base_{seed} \| roberta-base \|
	\| supervised_large_{seed} \| roberta-large \|
	\| supervised_large_adapted_{seed} \| LACAI/roberta-large-dialog-narrative \|

	Each run outputs two CSVs containing the auto and manual eval results for each model and seed. These files are named `{run_name}_progression_df.csv` (auto eval) and `manual_eval_results_df` (manual eval) where `run_name` is the value provided in the `--run-name` argument. If using `run_train_progression_multiseed.sh` then only two CSVs will be output, each compiling the results
	for all models and seeds.

	Reference Results

	These CSV outputs can be compared against our provided [auto-eval reference results](results/progression/auto_eval_results_33_seeds.xlsx) and [manual-eval reference results](results/progression/manual_eval_results_33_seeds.xlsx) for each model and seed. The aggregations done in these reference results sheets are the source of Tables 1 & 2 in our paper.

	### Rollout Self-Play Experiments:
	Self-play experiments can be run using the trained `roberta-large-adapted` progression model used in the paper, provided in
	[models/progression/persuasion_for_good/supervised_large_adapted](models/progression/persuasion_for_good/supervised_large_adapted).
	When loading this model, RoBERTa weights are automatically loaded from our [LACAI/roberta-large-adapted-PFG-progression](https://huggingface.co/LACAI/roberta-large-adapted-PFG-progression) checkpoint on the HuggingFace model hub.

	Run the following commands to run the 2x2x3 and 3x3x5 self-play experiments for each of the 5 generation seeds used in the paper:
	```bash
	./run_rollouts_eval_2_2_3_multiseed.sh
	./run_rollouts_eval_3_3_5_multiseed.sh
	./run_rollouts_eval_stats.sh
	```
	Running for all 5 generation seeds can take several days. If desired, the first two commands can be run in different terminals using separate GPUs:
	Terminal 1:
	```bash
	export CUDA_VISIBLE_DEVICES=0
	./run_rollouts_eval_2_2_3_multiseed.sh
	```
	Terminal 2:
	```bash
	export CUDA_VISIBLE_DEVICES=1
	./run_rollouts_eval_3_3_5_multiseed.sh
	```
	Then when both complete:
	```bash
	./run_rollouts_eval_stats.sh
	```

	To run self-play experiments using a specified configuration (e.g., 2x2x3) and a single seed (e.g., 42), run:
	```bash
	python dialogue-progression/rollouts_eval.py \
	--n-candidates=2 \
	--n-rollouts-per-candidate=2 \
	--n-utterances-per-rollout=3 \
	--outputpath=rollouts_self_play/2_2_3/42 \
	--agent-modelpath=LACAI/DialoGPT-large-PFG \
	--progression-modelpath=models/progression/persuasion_for_good/supervised_large_adapted \
	--model-random-state=42

	./run_rollouts_eval_stats.sh
	```
	The command above should complete in just a few hours on a single GPU.

	Reference Results

	We have provided reference results which are the [complete outputs](results/rollouts_self_play) of the multi-seed self-play experiments in the paper, including all dialogue text. To compute the aggregation of these results as done in the experiment
	scripts above, run:
	```bash
	./run_rollouts_eval_stats.sh results/rollouts_self_play
	```
	These aggregations are the source of Table 3 in our paper.

	## Train Progression Models on New Datasets
	Coming soon!

	## Reference
	If you use our code or models in your work, please cite:
	```
	@article{sanders2022towards,
	title={Towards a Progression-Aware Autonomous Dialogue Agent},
	author={Sanders, Abraham and Strzalkowski, Tomek and Si, Mei and Chang, Albert and Dey, Deepanshu and Braasch, Jonas and Wang, Dakuo},
	journal={arXiv preprint arXiv:2205.03692},
	year={2022}
	}
	```