Evaluating T0#
This example uses the transformers::run_generation_dataset
step to run the
T0 model. It runs the
XSum summarization data, prompted in 10 different ways, and computes
ROUGE scores for all variants. Finally, it computes an overall ROUGE score.
This example uses mostly built-in Tango steps. You will need the datasets
and transformers
integrations.
The only custom step in this example is the RougeScoreStep
, which computes ROUGE scores from the
generated text.
RougeScoreStep
#
RougeScoreStep
is defined in eval.py
:
import logging
from typing import Dict
from torch import Tensor
from torchmetrics.text.rouge import ROUGEScore
from tango import Format, JsonFormat, Step
from tango.common import DatasetDict
from tango.common.tqdm import Tqdm
logger = logging.getLogger(__name__)
@Step.register("rouge_score")
class RougeScoreStep(Step[Dict[str, Tensor]]):
VERSION = "002"
FORMAT: Format = JsonFormat()
def run( # type: ignore
self,
input: DatasetDict,
input_split: str,
target_field: str,
prediction_field: str,
use_stemmer: bool = True,
) -> Dict[str, Tensor]:
metric = ROUGEScore(
use_stemmer=use_stemmer,
rouge_keys=("rouge1", "rouge2", "rougeL"),
accumulate="avg",
)
for instance in Tqdm.tqdm(input[input_split], desc="Calculating scores"):
target = instance[target_field]
for prediction in instance[prediction_field]:
metric.update(prediction, target)
return metric.compute()
Config#
The configuration file, config.jsonnet
, uses some advanced Jsonnet concepts like std.foldl
to create the same configuration for all 10 prompts:
local model = "bigscience/T0_3B";
local batch_size = 8;
local datasets = [
'xsum_DOC_boils_down_to_simple_idea_that',
'xsum_DOC_given_above_write_one_sentence',
'xsum_DOC_how_would_you_rephrase_few_words',
'xsum_DOC_tldr',
'xsum_DOC_write_summary_of_above',
'xsum_article_DOC_summary',
'xsum_college_roommate_asked_DOC_so_I_recap',
'xsum_read_below_DOC_write_abstract',
'xsum_summarize_DOC',
'xsum_summarize_this_DOC_summary'
];
# This creates three steps for each of the datasets:
# 1. Load the dataset.
# 2. Generate output based on the dataset.
# 3. Evaluate the output against the gold answers.
local dataset_steps = std.foldl(
function(x, dataset_name) x + {
["dataset_" + dataset_name]: {
"type": "datasets::load",
"path": "bigscience/P3",
"name": dataset_name,
},
["generation_" + dataset_name]: {
"type": "transformers::run_generation_dataset",
"max_length": 200,
"input": {"ref": "dataset_" + dataset_name},
"batch_size": batch_size,
"model": model,
"prompt_field": "inputs_pretokenized",
"output_field": "generation",
"splits": ["validation"]
},
["eval_" + dataset_name]: {
"type": "rouge_score",
"input": {"ref": "generation_" + dataset_name},
"input_split": "validation",
"target_field": "targets_pretokenized",
"prediction_field": "generation"
}
},
datasets,
{}
);
# In addition to the three steps per dataset, we also combine all the generations and
# evaluate them all together.
{
"steps": dataset_steps + {
"all_generations": {
"type": "dataset_combine",
"inputs": std.map(
function(dataset_name) {"ref": "generation_" + dataset_name},
datasets
)
},
"all_evaluations": {
"type": "rouge_score",
"input": {"ref": "all_generations"},
"input_split": "validation",
"target_field": "targets_pretokenized",
"prediction_field": "generation"
}
}
}
Run it#
You can run the experiment with:
tango run config.jsonnet -i eval -d /tmp/workspace