🤗 Transformers#
Important
To use this integration you should install tango
with the “transformers” extra
(e.g. pip install tango[transformers]
) or just install the transformers
library after the fact
(e.g. pip install transformers
).
Components for Tango integration with 🤗 Transformers.
This integration provides some useful steps and also registers PyTorch components from the transformers library under the corresponding class from the torch integration, such as:
Model
: All transformers “auto” model classes are registered according to their class names (e.g. “transformers::AutoModelForCausalLM::from_pretrained” or “transformers::AutoModelForCausalLM::from_config”).For example, to instantiate a pretrained transformer model from params:
from tango.integrations.torch import Model model = Model.from_params({ "type": "transformers::AutoModel::from_pretrained", "pretrained_model_name_or_path": "epwalsh/bert-xsmall-dummy", })
Or to instantiate a transformer model from params without loading pretrained weights:
from tango.integrations.torch import Model model = Model.from_params({ "type": "transformers::AutoModel::from_config", "config": {"pretrained_model_name_or_path": "epwalsh/bert-xsmall-dummy"}, })
Tip
You can see a list of all of the available auto model constructors from transformers by running:
from tango.integrations.torch import Model from tango.integrations.transformers import * for name in sorted(Model.list_available()): if name.startswith("transformers::AutoModel"): print(name)
transformers::AutoModel::from_config transformers::AutoModel::from_pretrained transformers::AutoModelForAudioClassification::from_config transformers::AutoModelForAudioClassification::from_pretrained transformers::AutoModelForAudioFrameClassification::from_config transformers::AutoModelForAudioFrameClassification::from_pretrained transformers::AutoModelForAudioXVector::from_config transformers::AutoModelForAudioXVector::from_pretrained transformers::AutoModelForCTC::from_config transformers::AutoModelForCTC::from_pretrained transformers::AutoModelForCausalLM::from_config transformers::AutoModelForCausalLM::from_pretrained transformers::AutoModelForDepthEstimation::from_config transformers::AutoModelForDepthEstimation::from_pretrained transformers::AutoModelForDocumentQuestionAnswering::from_config transformers::AutoModelForDocumentQuestionAnswering::from_pretrained transformers::AutoModelForImageClassification::from_config transformers::AutoModelForImageClassification::from_pretrained transformers::AutoModelForImageSegmentation::from_config transformers::AutoModelForImageSegmentation::from_pretrained transformers::AutoModelForInstanceSegmentation::from_config transformers::AutoModelForInstanceSegmentation::from_pretrained transformers::AutoModelForMaskGeneration::from_config transformers::AutoModelForMaskGeneration::from_pretrained transformers::AutoModelForMaskedImageModeling::from_config transformers::AutoModelForMaskedImageModeling::from_pretrained transformers::AutoModelForMaskedLM::from_config transformers::AutoModelForMaskedLM::from_pretrained transformers::AutoModelForMultipleChoice::from_config transformers::AutoModelForMultipleChoice::from_pretrained transformers::AutoModelForNextSentencePrediction::from_config transformers::AutoModelForNextSentencePrediction::from_pretrained transformers::AutoModelForObjectDetection::from_config transformers::AutoModelForObjectDetection::from_pretrained transformers::AutoModelForPreTraining::from_config transformers::AutoModelForPreTraining::from_pretrained transformers::AutoModelForQuestionAnswering::from_config transformers::AutoModelForQuestionAnswering::from_pretrained transformers::AutoModelForSemanticSegmentation::from_config transformers::AutoModelForSemanticSegmentation::from_pretrained transformers::AutoModelForSeq2SeqLM::from_config transformers::AutoModelForSeq2SeqLM::from_pretrained transformers::AutoModelForSequenceClassification::from_config transformers::AutoModelForSequenceClassification::from_pretrained transformers::AutoModelForSpeechSeq2Seq::from_config transformers::AutoModelForSpeechSeq2Seq::from_pretrained transformers::AutoModelForTableQuestionAnswering::from_config transformers::AutoModelForTableQuestionAnswering::from_pretrained transformers::AutoModelForTokenClassification::from_config transformers::AutoModelForTokenClassification::from_pretrained transformers::AutoModelForUniversalSegmentation::from_config transformers::AutoModelForUniversalSegmentation::from_pretrained transformers::AutoModelForVideoClassification::from_config transformers::AutoModelForVideoClassification::from_pretrained transformers::AutoModelForVision2Seq::from_config transformers::AutoModelForVision2Seq::from_pretrained transformers::AutoModelForVisualQuestionAnswering::from_config transformers::AutoModelForVisualQuestionAnswering::from_pretrained transformers::AutoModelForZeroShotImageClassification::from_config transformers::AutoModelForZeroShotImageClassification::from_pretrained transformers::AutoModelForZeroShotObjectDetection::from_config transformers::AutoModelForZeroShotObjectDetection::from_pretrained transformers::AutoModelWithLMHead::from_config transformers::AutoModelWithLMHead::from_pretrained
Optimizer
: All optimizers from transformers are registered according to their class names (e.g. “transformers::AdaFactor”).Tip
You can see a list of all of the available optimizers from transformers by running:
from tango.integrations.torch import Optimizer from tango.integrations.transformers import * for name in sorted(Optimizer.list_available()): if name.startswith("transformers::"): print(name)
transformers::Adafactor transformers::AdamW
LRScheduler
: All learning rate scheduler function from transformers are registered according to their type name (e.g. “transformers::linear”).Tip
You can see a list of all of the available scheduler functions from transformers by running:
from tango.integrations.torch import LRScheduler from tango.integrations.transformers import * for name in sorted(LRScheduler.list_available()): if name.startswith("transformers::"): print(name)
transformers::constant transformers::constant_with_warmup transformers::cosine transformers::cosine_with_restarts transformers::inverse_sqrt transformers::linear transformers::polynomial transformers::reduce_lr_on_plateau
DataCollator
: All data collators from transformers are registered according to their class name (e.g. “transformers::DefaultDataCollator”).You can instantiate any of these from a config / params like so:
from tango.integrations.torch import DataCollator collator = DataCollator.from_params({ "type": "transformers::DataCollatorWithPadding", "tokenizer": { "pretrained_model_name_or_path": "epwalsh/bert-xsmall-dummy", }, })
Tip
You can see a list of all of the available data collators from transformers by running:
from tango.integrations.torch import DataCollator from tango.integrations.transformers import * for name in sorted(DataCollator.list_available()): if name.startswith("transformers::"): print(name)
transformers::DataCollatorForLanguageModeling transformers::DataCollatorForPermutationLanguageModeling transformers::DataCollatorForSOP transformers::DataCollatorForSeq2Seq transformers::DataCollatorForTokenClassification transformers::DataCollatorForWholeWordMask transformers::DataCollatorWithPadding transformers::DefaultDataCollator
- class tango.integrations.transformers.Config(**kwargs)[source]#
A
Registrable
version of transformers’PretrainedConfig
.- default_implementation: Optional[str] = 'auto'#
The default registered implementation just calls
transformers.AutoConfig.from_pretrained()
.
- class tango.integrations.transformers.FinetuneStep(step_name=None, cache_results=None, step_format=None, step_config=None, step_unique_id_override=None, step_resources=None, step_metadata=None, step_extra_dependencies=None, **kwargs)[source]#
Mostly similar to
TorchTrainStep
with additional preprocessing for data.Tip
Registered as a
Step
under the name “transformers::finetune”.Important
The training loop will use GPU(s) automatically when available, as long as at least
device_count
CUDA devices are available.Distributed data parallel training is activated when the
device_count
is greater than 1.You can control which CUDA devices to use with the environment variable
CUDA_VISIBLE_DEVICES
. For example, to only use the GPUs with IDs 0 and 1, setCUDA_VISIBLE_DEVICES=0,1
(anddevice_count
to 2).Warning
During validation, the validation metric (specified by the
val_metric_name
parameter) is aggregated by simply averaging across validation batches and distributed processes. This behavior is usually correct when your validation metric is “loss” or “accuracy”, for example, but may not be correct for other metrics like “F1”.If this is not correct for your metric you will need to handle the aggregation internally in your model or with a
TrainCallback
using theTrainCallback.post_val_batch()
method. Then set the parameterauto_aggregate_val_metric
toFalse
.Note that correctly aggregating your metric during distributed training will involve distributed communication.
- run(model, tokenizer, training_engine, dataset_dict, train_dataloader, *, train_split='train', validation_split=None, validation_dataloader=None, source_field='source', target_field='target', max_source_length=1024, max_target_length=1024, seed=42, train_steps=None, train_epochs=None, validation_steps=None, grad_accum=1, log_every=10, checkpoint_every=100, validate_every=None, device_count=1, distributed_port=54761, val_metric_name='loss', minimize_val_metric=True, auto_aggregate_val_metric=True, callbacks=None, remove_stale_checkpoints=True)[source]#
Run a basic training loop to train the
model
.- Parameters:
model (
Lazy
[Model
]) – The model to train. It should return adict
that includes theloss
during training and theval_metric_name
during validation.tokenizer (
Tokenizer
) – The tokenizer to use for tokenizing source and target sequences.training_engine (
Lazy
[TrainingEngine
]) – TheTrainingEngine
to use to train the model.dataset_dict (
DatasetDict
) – The train and optional validation data.train_dataloader (
Lazy
[DataLoader
]) – The data loader that generates training batches. The batches should bedict
objects that will be used askwargs
for the model’sforward()
method.train_split (
str
, default:'train'
) – The name of the data split used for training in thedataset_dict
. Default is “train”.validation_split (
Optional
[str
], default:None
) – Optional name of the validation split in thedataset_dict
. Default isNone
, which means no validation.validation_dataloader (
Optional
[Lazy
[DataLoader
]], default:None
) – An optional data loader for generating validation batches. The batches should bedict
objects. If not specified, butvalidation_split
is given, the validationDataLoader
will be constructed from the same parameters as the trainDataLoader
.source_field (
str
, default:'source'
) – The string name of the field containing the source sequence.target_field (
str
, default:'target'
) – The string name of the field containing the target sequence.max_source_length (
Optional
[int
], default:1024
) – The maximum number of tokens in the source sequence.max_target_length (
Optional
[int
], default:1024
) – The maximum number of tokens in the target sequence.seed (
int
, default:42
) – Used to set the RNG states at the beginning of training.train_steps (
Optional
[int
], default:None
) – The number of steps to train for. If not specified training will stop after a complete iteration through thetrain_dataloader
.train_epochs (
Optional
[int
], default:None
) – The number of epochs to train for. You cannot specifytrain_steps
andtrain_epochs
at the same time.validation_steps (
Optional
[int
], default:None
) – The number of steps to validate for. If not specified validation will stop after a complete iteration through thevalidation_dataloader
.grad_accum (
int
, default:1
) –The number of gradient accumulation steps. Defaults to 1.
Note
This parameter - in conjuction with the settings of your data loader and the number distributed workers - determines the effective batch size of your training run.
log_every (
int
, default:10
) – Log every this many steps.checkpoint_every (
int
, default:100
) – Save a checkpoint every this many steps.validate_every (
Optional
[int
], default:None
) – Run the validation loop every this many steps.device_count (
int
, default:1
) – The number of devices to train on, i.e. the number of distributed data parallel workers.distributed_port (
int
, default:54761
) – The port of the distributed process group. Default = “54761”.val_metric_name (
str
, default:'loss'
) – The name of the validation metric, i.e. the key of the metric in the dictionary returned by the forward pass of the model. Default is “loss”.minimize_val_metric (
bool
, default:True
) – Whether the validation metric is meant to be minimized (such as the loss). Default isTrue
. When using a metric such as accuracy, you should set this toFalse
.auto_aggregate_val_metric (
bool
, default:True
) – IfTrue
(the default), the validation metric will be averaged across validation batches and distributed processes. This may not be the correct behavior for some metrics (such as F1), in which you should set this toFalse
and handle the aggregation internally in your model or with aTrainCallback
(usingTrainCallback.post_val_batch()
).callbacks (
Optional
[List
[Lazy
[TrainCallback
]]], default:None
) – A list ofTrainCallback
.remove_stale_checkpoints (
bool
, default:True
) – IfTrue
(the default), stale checkpoints will be removed throughout training so that only the latest and best checkpoints are kept.
- Return type:
- Returns:
The trained model on CPU with the weights from the best checkpoint loaded.
- CACHEABLE: Optional[bool] = True#
This provides a direct way to turn off caching. For example, a step that reads a HuggingFace dataset doesn’t need to be cached, because HuggingFace datasets already have their own caching mechanism. But it’s still a deterministic step, and all following steps are allowed to cache. If it is
None
, the step figures out by itself whether it should be cacheable or not.
- DETERMINISTIC: bool = True#
This describes whether this step can be relied upon to produce the same results every time when given the same inputs. If this is
False
, you can still cache the output of the step, but the results might be unexpected. Tango will print a warning in this case.
-
FORMAT:
Format
= <tango.integrations.torch.format.TorchFormat object># This specifies the format the results of this step will be serialized in. See the documentation for
Format
for details.
- SKIP_ID_ARGUMENTS: Set[str] = {'distributed_port', 'log_every'}#
If your
run()
method takes some arguments that don’t affect the results, list them here. Arguments listed here will not be used to calculate this step’s unique ID, and thus changing those arguments does not invalidate the cache.For example, you might use this for the batch size in an inference step, where you only care about the model output, not about how many outputs you can produce at the same time.
- class tango.integrations.transformers.FinetuneWrapper(config, *inputs, **kwargs)[source]#
Wrapper PreTrainedModel class that returns either a Seq2SeqLM or CausalLM model.
- class tango.integrations.transformers.RunGeneration(step_name=None, cache_results=None, step_format=None, step_config=None, step_unique_id_override=None, step_resources=None, step_metadata=None, step_extra_dependencies=None, **kwargs)[source]#
A step that runs seq2seq Huggingface models in inference mode.
Tip
Registered as a
Step
under the name “transformers::run_generation”.- run(model, prompts, *, tokenizer=None, batch_size=4, max_length=20, temperature=1.0, repetition_penalty=1.0, k=0, p=0.9, prefix='', xlm_language='', seed=42, num_return_sequences=1, fp16=False)[source]#
Run a Huggingface seq2seq model in inference mode.
- Parameters:
model (
Union
[str
,Model
]) – The name of the model to run. Any name that works in the transformers library works here. Or, you can directly provide the model to run.prompts (
Iterable
[str
]) – The prompts to run through the model. You can specify prompts directly in the config, but more commonly the prompts are produced by another step that reads a dataset, for example.tokenizer (
Union
[PreTrainedTokenizer
,PreTrainedTokenizerFast
,None
], default:None
) – The tokenizer to run.batch_size (
int
, default:4
) – The number of sequences to process at one time. This has no bearing on the output, so you can change this number without invalidating cached results.max_length (
int
, default:20
) – The maximum number of tokens/word pieces that the model will generate. For models that extend the prompt, the prefix does not count towards this limit.temperature (
float
, default:1.0
) – Passed directly to transformer’sgenerate()
method. The value used to model the next token probabilities.repetition_penalty (
float
, default:1.0
) – Passed directly to transformer’sgenerate()
method. The parameter for repetition penalty. 1.0 means no penalty.k (
int
, default:0
) – Passed directly to transformer’sgenerate()
method. The number of highest probability vocabulary tokens to keep for top-k-filtering.p (
float
, default:0.9
) – Passed directly to transformer’sgenerate()
method. If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.prefix (
str
, default:''
) – A prefix that gets pre-pended to all prompts.xlm_language (
str
, default:''
) – For the XLM model, this is a way to specify the language you want to use.seed (
int
, default:42
) – Random seednum_return_sequences (
int
, default:1
) – The number of generations to return for each prompt.fp16 (
bool
, default:False
) – Whether to use 16-bit floats.
- Return type:
- Returns:
Returns an iterator of lists of string. Each list contains the predictions for one prompt.
-
FORMAT:
Format
= <tango.format.JsonFormat object># This specifies the format the results of this step will be serialized in. See the documentation for
Format
for details.
- SKIP_ID_ARGUMENTS: Set[str] = {'batch_size'}#
If your
run()
method takes some arguments that don’t affect the results, list them here. Arguments listed here will not be used to calculate this step’s unique ID, and thus changing those arguments does not invalidate the cache.For example, you might use this for the batch size in an inference step, where you only care about the model output, not about how many outputs you can produce at the same time.
- VERSION: Optional[str] = '001'#
This is optional, but recommended. Specifying a version gives you a way to tell Tango that a step has changed during development, and should now be recomputed. This doesn’t invalidate the old results, so when you revert your code, the old cache entries will stick around and be picked up.
- class tango.integrations.transformers.RunGenerationDataset(step_name=None, cache_results=None, step_format=None, step_config=None, step_unique_id_override=None, step_resources=None, step_metadata=None, step_extra_dependencies=None, **kwargs)[source]#
A step that runs seq2seq Huggingface models in inference mode.
This is similar to
RunGeneration
, but it takes a dataset as input and produces a new dataset as output, which contains the predictions in a new field.Tip
Registered as a
Step
under the name “transformers::run_generation_dataset”.- run(model, input, prompt_field, *, tokenizer=None, output_field=None, splits=None, batch_size=4, max_length=20, temperature=1.0, repetition_penalty=1.0, k=0, p=0.9, prefix='', xlm_language='', seed=42, num_return_sequences=1, fp16=False)[source]#
Augment an input dataset with generations from a Huggingface seq2seq model.
- Parameters:
model (
Union
[str
,Model
]) – The name of the model to run. Any name that works in the transformers library works here. Or, you can directly provide the model to run.input (
Union
[DatasetDict
,DatasetDict
]) – The input dataset.prompt_field (
str
) – The field in the dataset that contains the text of the prompts.tokenizer (
Union
[PreTrainedTokenizer
,PreTrainedTokenizerFast
,None
], default:None
) – The tokenizer to run.output_field (
Optional
[str
], default:None
) – The field in the dataset that we will write the predictions into. In the result, this field will containList[str]
.splits (
Union
[str
,Set
[str
],None
], default:None
) – A split, or set of splits, to process. If this is not specified, we will process all splits.batch_size (
int
, default:4
) – The number of sequences to process at one time. This has no bearing on the output, so you can change this number without invalidating cached results.max_length (
int
, default:20
) – The maximum number of tokens/word pieces that the model will generate. For models that extend the prompt, the prefix does not count towards this limit.temperature (
float
, default:1.0
) – Passed directly to transformer’s generate() method. The value used to model the next token probabilities.repetition_penalty (
float
, default:1.0
) – Passed directly to transformer’s generate() method. The parameter for repetition penalty. 1.0 means no penalty.k (
int
, default:0
) – Passed directly to transformer’s generate() method. The number of highest probability vocabulary tokens to keep for top-k-filtering.p (
float
, default:0.9
) – Passed directly to transformer’s generate() method. If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation.prefix (
str
, default:''
) – A prefix that gets pre-pended to all prompts.xlm_language (
str
, default:''
) – For the XLM model, this is a way to specify the language you want to use.seed (
int
, default:42
) – Random seednum_return_sequences (
int
, default:1
) – The number of generations to return for each prompt.fp16 (
bool
, default:False
) – Whether to use 16-bit floats.
- Return type:
- Returns:
Returns a dataset with an extra field containing the predictions.
-
FORMAT:
Format
= <tango.format.SqliteDictFormat object># This specifies the format the results of this step will be serialized in. See the documentation for
Format
for details.
- SKIP_ID_ARGUMENTS: Set[str] = {'batch_size'}#
If your
run()
method takes some arguments that don’t affect the results, list them here. Arguments listed here will not be used to calculate this step’s unique ID, and thus changing those arguments does not invalidate the cache.For example, you might use this for the batch size in an inference step, where you only care about the model output, not about how many outputs you can produce at the same time.
- VERSION: Optional[str] = '002'#
This is optional, but recommended. Specifying a version gives you a way to tell Tango that a step has changed during development, and should now be recomputed. This doesn’t invalidate the old results, so when you revert your code, the old cache entries will stick around and be picked up.
- class tango.integrations.transformers.TokenizeText2TextData(step_name=None, cache_results=None, step_format=None, step_config=None, step_unique_id_override=None, step_resources=None, step_metadata=None, step_extra_dependencies=None, **kwargs)[source]#
A step that tokenizes data containing source and target sequences.
Tip
Registered as a
Step
under the name “transformers::tokenize_text2text”.- run(data, tokenizer, num_workers=1, source_field='source', target_field='target', max_source_length=1024, max_target_length=1024, pad_to_max_length=False, ignore_pad_token_for_loss=True, concat_source_target=False)[source]#
Returns a DatasetDict with tokenized source and target fields.
- Parameters:
data (
DatasetDict
) – The original dataset dict containing the source and target fields.tokenizer (
Tokenizer
) – The tokenizer to use.num_workers (
int
, default:1
) – The number of workers to use for processing the data.source_field (
str
, default:'source'
) – The string name of the field containing the source sequence.target_field (
str
, default:'target'
) – The string name of the field containing the target sequence.max_source_length (
Optional
[int
], default:1024
) – The maximum number of tokens in the source sequence.max_target_length (
Optional
[int
], default:1024
) – The maximum number of tokens in the target sequence.pad_to_max_length (
bool
, default:False
) – Whether to pad to the maximum length when tokenizing.ignore_pad_token_for_loss (
bool
, default:True
) – Whether to ignore the padded tokens for calculating loss. If set to True, all the pad tokens in the labels are replaced by -100, which is ignored by the loss function.concat_source_target (
bool
, default:False
) – If the downstream model is decoder-only, like “gpt2”, the source and target sequences need to be concatenated and fed to the model together.
- Return type:
DatasetDict
Tip
If concat_source_target is set to True, we pad all sequences to max length here. Otherwise, we leave it to the appropriate
DataCollator
object.
-
CACHEABLE:
Optional
[bool
] = True# This provides a direct way to turn off caching. For example, a step that reads a HuggingFace dataset doesn’t need to be cached, because HuggingFace datasets already have their own caching mechanism. But it’s still a deterministic step, and all following steps are allowed to cache. If it is
None
, the step figures out by itself whether it should be cacheable or not.
- class tango.integrations.transformers.Tokenizer(**kwargs)[source]#
A
Registrable
version of transformers’PreTrainedTokenizerBase
.- default_implementation: Optional[str] = 'auto'#
The default registered implementation just calls
transformers.AutoTokenizer.from_pretrained()
.
- tango.integrations.transformers.add_soft_prompt(model, prompt_length, *, only_prompt_is_trainable=True, initialize_from_top_embeddings=5000, random_seed=1940)[source]#
Takes a regular huggingface transformer, and equips it with a soft prompt.
Example:
model = transformers.AutoModelForCausalLM.from_pretrained("gpt2") tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2") generated = model.generate(tokenizer.encode("It was the best of times.", return_tensors="pt")) original_output = tokenizer.decode(generated[0]) add_soft_prompt(model, prompt_length=3) generated = model.generate(tokenizer.encode("It was the best of times.", return_tensors="pt")) prompted_output = tokenizer.decode(generated[0])
- Parameters:
model (
Model
) – the original huggingface transformer. This model is augmented in-place!prompt_length (
int
) – the length of the soft prompt, in tokensonly_prompt_is_trainable (
bool
, default:True
) – freezes the original model’s weights, leaving only the prompt trainableinitialize_from_top_embeddings (
Optional
[int
], default:5000
) – Prompt embeddings are initialized from a random selection of the top n word piece embeddings from the original model. This is how you set n.random_seed (
int
, default:1940
) – random seed used to initialize the prompt embeddings
- Return type:
- tango.integrations.transformers.ia3.modify_with_ia3(transformer, *, config=None, only_ia3_requires_grad=True)[source]#
A function to add ia3 adaptors to the given transformer. Code modified from t-few and Qinyuan Ye
- Parameters:
model – A
PreTrainedModel
to modify.config (
Optional
[WithIA3Config
], default:None
) – AWithIA3Config
that specifies the layers to modify.only_ia3_requires_grad (
bool
, default:True
) – A bool, True if requires_grad should only be set on ia3 paramenters in the output model.
- Return type:
PreTrainedModel
Examples
You can use the provided configurations:
from transformers import AutoModelForCausalLM, AutoTokenizer from tango.integrations.transformers.ia3 import modify_with_ia3, GPT_2_IA3_CONFIG model = AutoModelForCausalLM.from_pretrained("sshleifer/tiny-gpt2") model = modify_with_ia3(model, config=GPT_2_IA3_CONFIG)
Or you can write your own configuration with regex matching the layers to modify and their parents:
from transformers import AutoModelForCausalLM, AutoTokenizer from tango.integrations.transformers.ia3 import modify_with_ia3 my_config = WithIA3Config( attention_modules=".*attn", fused_qkv_layers="c_attn", mlp_modules=".*mlp", mlp_layers="c_fc", ia3_param_names="ia3", ) model = AutoModelForCausalLM.from_pretrained("sshleifer/tiny-gpt2") model = modify_with_ia3(model, config=my_config)