🤗 Transformers#

Important

To use this integration you should install tango with the “transformers” extra (e.g. pip install tango[transformers]) or just install the transformers library after the fact (e.g. pip install transformers).

Components for Tango integration with 🤗 Transformers.

This integration provides some useful steps and also registers PyTorch components from the transformers library under the corresponding class from the torch integration, such as:

  • Model: All transformers “auto” model classes are registered according to their class names (e.g. “transformers::AutoModelForCausalLM::from_pretrained” or “transformers::AutoModelForCausalLM::from_config”).

    For example, to instantiate a pretrained transformer model from params:

    from tango.integrations.torch import Model
    
    model = Model.from_params({
        "type": "transformers::AutoModel::from_pretrained",
        "pretrained_model_name_or_path": "epwalsh/bert-xsmall-dummy",
    })
    

    Or to instantiate a transformer model from params without loading pretrained weights:

    from tango.integrations.torch import Model
    
    model = Model.from_params({
        "type": "transformers::AutoModel::from_config",
        "config": {"pretrained_model_name_or_path": "epwalsh/bert-xsmall-dummy"},
    })
    

    Tip

    You can see a list of all of the available auto model constructors from transformers by running:

    from tango.integrations.torch import Model
    from tango.integrations.transformers import *
    
    for name in sorted(Model.list_available()):
        if name.startswith("transformers::AutoModel"):
            print(name)
    
    transformers::AutoModel::from_config
    transformers::AutoModel::from_pretrained
    transformers::AutoModelForAudioClassification::from_config
    transformers::AutoModelForAudioClassification::from_pretrained
    transformers::AutoModelForAudioFrameClassification::from_config
    transformers::AutoModelForAudioFrameClassification::from_pretrained
    transformers::AutoModelForAudioXVector::from_config
    transformers::AutoModelForAudioXVector::from_pretrained
    transformers::AutoModelForCTC::from_config
    transformers::AutoModelForCTC::from_pretrained
    transformers::AutoModelForCausalLM::from_config
    transformers::AutoModelForCausalLM::from_pretrained
    transformers::AutoModelForImageClassification::from_config
    transformers::AutoModelForImageClassification::from_pretrained
    transformers::AutoModelForImageSegmentation::from_config
    transformers::AutoModelForImageSegmentation::from_pretrained
    transformers::AutoModelForInstanceSegmentation::from_config
    transformers::AutoModelForInstanceSegmentation::from_pretrained
    transformers::AutoModelForMaskedImageModeling::from_config
    transformers::AutoModelForMaskedImageModeling::from_pretrained
    transformers::AutoModelForMaskedLM::from_config
    transformers::AutoModelForMaskedLM::from_pretrained
    transformers::AutoModelForMultipleChoice::from_config
    transformers::AutoModelForMultipleChoice::from_pretrained
    transformers::AutoModelForNextSentencePrediction::from_config
    transformers::AutoModelForNextSentencePrediction::from_pretrained
    transformers::AutoModelForObjectDetection::from_config
    transformers::AutoModelForObjectDetection::from_pretrained
    transformers::AutoModelForPreTraining::from_config
    transformers::AutoModelForPreTraining::from_pretrained
    transformers::AutoModelForQuestionAnswering::from_config
    transformers::AutoModelForQuestionAnswering::from_pretrained
    transformers::AutoModelForSemanticSegmentation::from_config
    transformers::AutoModelForSemanticSegmentation::from_pretrained
    transformers::AutoModelForSeq2SeqLM::from_config
    transformers::AutoModelForSeq2SeqLM::from_pretrained
    transformers::AutoModelForSequenceClassification::from_config
    transformers::AutoModelForSequenceClassification::from_pretrained
    transformers::AutoModelForSpeechSeq2Seq::from_config
    transformers::AutoModelForSpeechSeq2Seq::from_pretrained
    transformers::AutoModelForTableQuestionAnswering::from_config
    transformers::AutoModelForTableQuestionAnswering::from_pretrained
    transformers::AutoModelForTokenClassification::from_config
    transformers::AutoModelForTokenClassification::from_pretrained
    transformers::AutoModelForVision2Seq::from_config
    transformers::AutoModelForVision2Seq::from_pretrained
    transformers::AutoModelForVisualQuestionAnswering::from_config
    transformers::AutoModelForVisualQuestionAnswering::from_pretrained
    transformers::AutoModelWithLMHead::from_config
    transformers::AutoModelWithLMHead::from_pretrained
    
  • Optimizer: All optimizers from transformers are registered according to their class names (e.g. “transformers::AdaFactor”).

    Tip

    You can see a list of all of the available optimizers from transformers by running:

    from tango.integrations.torch import Optimizer
    from tango.integrations.transformers import *
    
    for name in sorted(Optimizer.list_available()):
        if name.startswith("transformers::"):
            print(name)
    
    transformers::Adafactor
    transformers::AdamW
    
  • LRScheduler: All learning rate scheduler function from transformers are registered according to their type name (e.g. “transformers::linear”).

    Tip

    You can see a list of all of the available scheduler functions from transformers by running:

    from tango.integrations.torch import LRScheduler
    from tango.integrations.transformers import *
    
    for name in sorted(LRScheduler.list_available()):
        if name.startswith("transformers::"):
            print(name)
    
    transformers::constant
    transformers::constant_with_warmup
    transformers::cosine
    transformers::cosine_with_restarts
    transformers::linear
    transformers::polynomial
    
  • DataCollator: All data collators from transformers are registered according to their class name (e.g. “transformers::DefaultDataCollator”).

    You can instantiate any of these from a config / params like so:

    from tango.integrations.torch import DataCollator
    
    collator = DataCollator.from_params({
        "type": "transformers::DataCollatorWithPadding",
        "tokenizer": {
            "pretrained_model_name_or_path": "epwalsh/bert-xsmall-dummy",
        },
    })
    

    Tip

    You can see a list of all of the available data collators from transformers by running:

    from tango.integrations.torch import DataCollator
    from tango.integrations.transformers import *
    
    for name in sorted(DataCollator.list_available()):
        if name.startswith("transformers::"):
            print(name)
    
    transformers::DataCollatorForLanguageModeling
    transformers::DataCollatorForPermutationLanguageModeling
    transformers::DataCollatorForSOP
    transformers::DataCollatorForSeq2Seq
    transformers::DataCollatorForTokenClassification
    transformers::DataCollatorForWholeWordMask
    transformers::DataCollatorWithPadding
    transformers::DefaultDataCollator
    

Reference#

class tango.integrations.transformers.Tokenizer(**kwargs)[source]#

A Registrable version of transformers’ PreTrainedTokenizerBase.

default_implementation: Optional[str] = 'auto'#

The default registered implementation just calls transformers.AutoTokenizer.from_pretrained().

class tango.integrations.transformers.Config(**kwargs)[source]#

A Registrable version of transformers’ PretrainedConfig.

default_implementation: Optional[str] = 'auto'#

The default registered implementation just calls transformers.AutoConfig.from_pretrained().

class tango.integrations.transformers.RunGeneration(step_name=None, cache_results=None, step_format=None, step_config=None, step_unique_id_override=None, step_resources=None, **kwargs)[source]#

A step that runs seq2seq Huggingface models in inference mode.

Tip

Registered as a Step under the name “transformers::run_generation”.

run(model, prompts, *, tokenizer=None, batch_size=4, max_length=20, temperature=1.0, repetition_penalty=1.0, k=0, p=0.9, prefix='', xlm_language='', seed=42, num_return_sequences=1, fp16=False)[source]#

Run a Huggingface seq2seq model in inference mode.

Parameters
  • model (Union[str, Model]) – The name of the model to run. Any name that works in the transformers library works here. Or, you can directly provide the model to run.

  • prompts (Iterable[str]) – The prompts to run through the model. You can specify prompts directly in the config, but more commonly the prompts are produced by another step that reads a dataset, for example.

  • tokenizer (Union[PreTrainedTokenizer, PreTrainedTokenizerFast, None], default: None) – The tokenizer to run.

  • batch_size (int, default: 4) – The number of sequences to process at one time. This has no bearing on the output, so you can change this number without invalidating cached results.

  • max_length (int, default: 20) – The maximum number of tokens/word pieces that the model will generate. For models that extend the prompt, the prefix does not count towards this limit.

  • temperature (float, default: 1.0) – Passed directly to transformer’s generate() method. The value used to model the next token probabilities.

  • repetition_penalty (float, default: 1.0) – Passed directly to transformer’s generate() method. The parameter for repetition penalty. 1.0 means no penalty.

  • k (int, default: 0) – Passed directly to transformer’s generate() method. The number of highest probability vocabulary tokens to keep for top-k-filtering.

  • p (float, default: 0.9) – Passed directly to transformer’s generate() method. If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.

  • prefix (str, default: '') – A prefix that gets pre-pended to all prompts.

  • xlm_language (str, default: '') – For the XLM model, this is a way to specify the language you want to use.

  • seed (int, default: 42) – Random seed

  • num_return_sequences (int, default: 1) – The number of generations to return for each prompt.

  • fp16 (bool, default: False) – Whether to use 16-bit floats.

Return type

Iterable[List[str]]

Returns

Returns an iterator of lists of string. Each list contains the predictions for one prompt.

FORMAT: Format = <tango.format.JsonFormat object>#

This specifies the format the results of this step will be serialized in. See the documentation for Format for details.

SKIP_ID_ARGUMENTS: Set[str] = {'batch_size'}#

If your run() method takes some arguments that don’t affect the results, list them here. Arguments listed here will not be used to calculate this step’s unique ID, and thus changing those arguments does not invalidate the cache.

For example, you might use this for the batch size in an inference step, where you only care about the model output, not about how many outputs you can produce at the same time.

VERSION: Optional[str] = '001'#

This is optional, but recommended. Specifying a version gives you a way to tell Tango that a step has changed during development, and should now be recomputed. This doesn’t invalidate the old results, so when you revert your code, the old cache entries will stick around and be picked up.

class tango.integrations.transformers.RunGenerationDataset(step_name=None, cache_results=None, step_format=None, step_config=None, step_unique_id_override=None, step_resources=None, **kwargs)[source]#

A step that runs seq2seq Huggingface models in inference mode.

This is similar to RunGeneration, but it takes a dataset as input and produces a new dataset as output, which contains the predictions in a new field.

Tip

Registered as a Step under the name “transformers::run_generation_dataset”.

run(model, input, prompt_field, *, tokenizer=None, output_field=None, splits=None, batch_size=4, max_length=20, temperature=1.0, repetition_penalty=1.0, k=0, p=0.9, prefix='', xlm_language='', seed=42, num_return_sequences=1, fp16=False)[source]#

Augment an input dataset with generations from a Huggingface seq2seq model.

Parameters
  • model (Union[str, Model]) – The name of the model to run. Any name that works in the transformers library works here. Or, you can directly provide the model to run.

  • input (Union[DatasetDict, DatasetDict]) – The input dataset.

  • prompt_field (str) – The field in the dataset that contains the text of the prompts.

  • tokenizer (Union[PreTrainedTokenizer, PreTrainedTokenizerFast, None], default: None) – The tokenizer to run.

  • output_field (Optional[str], default: None) – The field in the dataset that we will write the predictions into. In the result, this field will contain List[str].

  • splits (Union[str, Set[str], None], default: None) – A split, or set of splits, to process. If this is not specified, we will process all splits.

  • batch_size (int, default: 4) – The number of sequences to process at one time. This has no bearing on the output, so you can change this number without invalidating cached results.

  • max_length (int, default: 20) – The maximum number of tokens/word pieces that the model will generate. For models that extend the prompt, the prefix does not count towards this limit.

  • temperature (float, default: 1.0) – Passed directly to transformer’s generate() method. The value used to model the next token probabilities.

  • repetition_penalty (float, default: 1.0) – Passed directly to transformer’s generate() method. The parameter for repetition penalty. 1.0 means no penalty.

  • k (int, default: 0) – Passed directly to transformer’s generate() method. The number of highest probability vocabulary tokens to keep for top-k-filtering.

  • p (float, default: 0.9) – Passed directly to transformer’s generate() method. If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.

  • prefix (str, default: '') – A prefix that gets pre-pended to all prompts.

  • xlm_language (str, default: '') – For the XLM model, this is a way to specify the language you want to use.

  • seed (int, default: 42) – Random seed

  • num_return_sequences (int, default: 1) – The number of generations to return for each prompt.

  • fp16 (bool, default: False) – Whether to use 16-bit floats.

Return type

DatasetDict

Returns

Returns a dataset with an extra field containing the predictions.

FORMAT: Format = <tango.format.SqliteDictFormat object>#

This specifies the format the results of this step will be serialized in. See the documentation for Format for details.

SKIP_ID_ARGUMENTS: Set[str] = {'batch_size'}#

If your run() method takes some arguments that don’t affect the results, list them here. Arguments listed here will not be used to calculate this step’s unique ID, and thus changing those arguments does not invalidate the cache.

For example, you might use this for the batch size in an inference step, where you only care about the model output, not about how many outputs you can produce at the same time.

VERSION: Optional[str] = '002'#

This is optional, but recommended. Specifying a version gives you a way to tell Tango that a step has changed during development, and should now be recomputed. This doesn’t invalidate the old results, so when you revert your code, the old cache entries will stick around and be picked up.