Evaluating T0#

This example uses the transformers::run_generation_dataset step to run the T0 model. It runs the XSum summarization data, prompted in 10 different ways, and computes ROUGE scores for all variants. Finally, it computes an overall ROUGE score.

This example uses mostly built-in Tango steps. You will need the datasets and transformers integrations. The only custom step in this example is the RougeScoreStep, which computes ROUGE scores from the generated text.


RougeScoreStep is defined in eval.py:

import logging
from typing import Dict

from torch import Tensor
from torchmetrics.text.rouge import ROUGEScore

from tango import Format, JsonFormat, Step
from tango.common import DatasetDict
from tango.common.tqdm import Tqdm

logger = logging.getLogger(__name__)

class RougeScoreStep(Step[Dict[str, Tensor]]):
    VERSION = "002"
    FORMAT: Format = JsonFormat()

    def run(  # type: ignore
        input: DatasetDict,
        input_split: str,
        target_field: str,
        prediction_field: str,
        use_stemmer: bool = True,
    ) -> Dict[str, Tensor]:
        metric = ROUGEScore(
            rouge_keys=("rouge1", "rouge2", "rougeL"),

        for instance in Tqdm.tqdm(input[input_split], desc="Calculating scores"):
            target = instance[target_field]
            for prediction in instance[prediction_field]:
                metric.update(prediction, target)

        return metric.compute()


The configuration file, config.jsonnet, uses some advanced Jsonnet concepts like std.foldl to create the same configuration for all 10 prompts:

local model = "bigscience/T0_3B";
local batch_size = 8;

local datasets = [

# This creates three steps for each of the datasets:
# 1. Load the dataset.
# 2. Generate output based on the dataset.
# 3. Evaluate the output against the gold answers.
local dataset_steps = std.foldl(
    function(x, dataset_name) x + {
        ["dataset_" + dataset_name]: {
            "type": "datasets::load",
            "path": "bigscience/P3",
            "name": dataset_name,
        ["generation_" + dataset_name]: {
            "type": "transformers::run_generation_dataset",
            "max_length": 200,
            "input": {"ref": "dataset_" + dataset_name},
            "batch_size": batch_size,
            "model": model,
            "prompt_field": "inputs_pretokenized",
            "output_field": "generation",
            "splits": ["validation"]
        ["eval_" + dataset_name]: {
            "type": "rouge_score",
            "input": {"ref": "generation_" + dataset_name},
            "input_split": "validation",
            "target_field": "targets_pretokenized",
            "prediction_field": "generation"

# In addition to the three steps per dataset, we also combine all the generations and
# evaluate them all together.
    "steps": dataset_steps + {
        "all_generations": {
            "type": "dataset_combine",
            "inputs": std.map(
                function(dataset_name) {"ref": "generation_" + dataset_name},
        "all_evaluations": {
            "type": "rouge_score",
            "input": {"ref": "all_generations"},
            "input_split": "validation",
            "target_field": "targets_pretokenized",
            "prediction_field": "generation"

Run it#

You can run the experiment with:

tango run config.jsonnet -i eval -d /tmp/workspace