đ€ Datasets#
Important
To use this integration you should install tango
with the âdatasetsâ extra
(e.g. pip install tango[datasets]
) or just install the datasets
library after the fact
(e.g. pip install datasets
).
Components for Tango integration with đ€ Datasets.
Example: loading and combining#
Hereâs an example config that uses the built-in steps from this integration to load, concatenate, and interleave datasets from HuggingFace:
{
"steps": {
"train_data": {
"type": "datasets::load",
"path": "lhoestq/test",
"split": "train"
},
"dev_data": {
"type": "datasets::load",
"path": "lhoestq/test",
"split": "validation"
},
"all_data": {
"type": "datasets::concatenate",
"datasets": [
{
"type": "ref",
"ref": "train_data"
},
{
"type": "ref",
"ref": "dev_data"
}
]
},
"mixed_data": {
"type": "datasets::interleave",
"datasets": [
{
"type": "ref",
"ref": "train_data"
},
{
"type": "ref",
"ref": "dev_data"
}
],
"probabilities": [0.9, 0.1]
}
}
}
You could run this with:
tango run config.json
Reference#
- tango.integrations.datasets.convert_to_tango_dataset_dict(hf_dataset_dict)[source]#
A helper function that can be used to convert a HuggingFace
DatasetDict
orIterableDatasetDict
into a native TangoDatasetDict
orIterableDatasetDict
.This is important to do when your dataset dict is input to another step for caching reasons.
- class tango.integrations.datasets.DatasetsFormat(*args, **kwds)[source]#
This format writes a
datasets.Dataset
ordatasets.DatasetDict
to disk usingdatasets.Dataset.save_to_disk()
.It is the default
Format
for theLoadDataset
step.
- class tango.integrations.datasets.LoadDataset(step_name=None, cache_results=None, step_format=None, step_config=None, step_unique_id_override=None, step_resources=None, step_metadata=None, step_extra_dependencies=None, **kwargs)[source]#
This step loads a HuggingFace dataset.
Tip
Registered as a
Step
under the name âdatasets::loadâ.Important
If you are loading an
IterableDataset
orIterableDatasetDict
you need to use theLoadStreamingDataset
step instead.- run(path, **kwargs)[source]#
Load the HuggingFace dataset specified by
path
.path
is the canonical name or path to the dataset. Additional key word arguments are passed as-is todatasets.load_dataset()
.- Return type:
Union
[DatasetDict
,Dataset
]
-
CACHEABLE:
Optional
[bool
] = True# This provides a direct way to turn off caching. For example, a step that reads a HuggingFace dataset doesnât need to be cached, because HuggingFace datasets already have their own caching mechanism. But itâs still a deterministic step, and all following steps are allowed to cache. If it is
None
, the step figures out by itself whether it should be cacheable or not.
-
DETERMINISTIC:
bool
= True# This describes whether this step can be relied upon to produce the same results every time when given the same inputs. If this is
False
, you can still cache the output of the step, but the results might be unexpected. Tango will print a warning in this case.
-
FORMAT:
Format
= <tango.integrations.datasets.DatasetsFormat object># This specifies the format the results of this step will be serialized in. See the documentation for
Format
for details.
-
VERSION:
Optional
[str
] = '001'# This is optional, but recommended. Specifying a version gives you a way to tell Tango that a step has changed during development, and should now be recomputed. This doesnât invalidate the old results, so when you revert your code, the old cache entries will stick around and be picked up.
- class tango.integrations.datasets.LoadStreamingDataset(step_name=None, cache_results=None, step_format=None, step_config=None, step_unique_id_override=None, step_resources=None, step_metadata=None, step_extra_dependencies=None, **kwargs)[source]#
This step loads an iterable/streaming HuggingFace dataset.
Tip
Registered as a
Step
under the name âdatasets::load_streamingâ.- run(path, **kwargs)[source]#
Load the HuggingFace streaming dataset specified by
path
.path
is the canonical name or path to the dataset. Additional key word arguments are passed as-is todatasets.load_dataset()
.- Return type:
Union
[IterableDatasetDict
,IterableDataset
]
-
CACHEABLE:
Optional
[bool
] = False# This provides a direct way to turn off caching. For example, a step that reads a HuggingFace dataset doesnât need to be cached, because HuggingFace datasets already have their own caching mechanism. But itâs still a deterministic step, and all following steps are allowed to cache. If it is
None
, the step figures out by itself whether it should be cacheable or not.
-
DETERMINISTIC:
bool
= True# This describes whether this step can be relied upon to produce the same results every time when given the same inputs. If this is
False
, you can still cache the output of the step, but the results might be unexpected. Tango will print a warning in this case.
-
VERSION:
Optional
[str
] = '001'# This is optional, but recommended. Specifying a version gives you a way to tell Tango that a step has changed during development, and should now be recomputed. This doesnât invalidate the old results, so when you revert your code, the old cache entries will stick around and be picked up.
- class tango.integrations.datasets.InterleaveDatasets(step_name=None, cache_results=None, step_format=None, step_config=None, step_unique_id_override=None, step_resources=None, step_metadata=None, step_extra_dependencies=None, **kwargs)[source]#
This steps interleaves multiple datasets using
interleave_datasets()
.Tip
Registered as a
Step
under the name âdatasets::interleaveâ.- run(datasets, probabilities=None, seed=None)[source]#
Interleave the list of datasets.
- Return type:
TypeVar
(DatasetType
,Dataset
,IterableDataset
)
-
CACHEABLE:
Optional
[bool
] = False# This provides a direct way to turn off caching. For example, a step that reads a HuggingFace dataset doesnât need to be cached, because HuggingFace datasets already have their own caching mechanism. But itâs still a deterministic step, and all following steps are allowed to cache. If it is
None
, the step figures out by itself whether it should be cacheable or not.
-
DETERMINISTIC:
bool
= True# This describes whether this step can be relied upon to produce the same results every time when given the same inputs. If this is
False
, you can still cache the output of the step, but the results might be unexpected. Tango will print a warning in this case.
-
VERSION:
Optional
[str
] = '001'# This is optional, but recommended. Specifying a version gives you a way to tell Tango that a step has changed during development, and should now be recomputed. This doesnât invalidate the old results, so when you revert your code, the old cache entries will stick around and be picked up.
- class tango.integrations.datasets.ConcatenateDatasets(step_name=None, cache_results=None, step_format=None, step_config=None, step_unique_id_override=None, step_resources=None, step_metadata=None, step_extra_dependencies=None, **kwargs)[source]#
This step concatenates multiple datasets using
concatenate_datasets()
.Tip
Registered as a
Step
under the name âdatasets::concatenateâ.- run(datasets, info=None, split=None, axis=0)[source]#
Concatenate the list of datasets.
- Return type:
Dataset
-
CACHEABLE:
Optional
[bool
] = False# This provides a direct way to turn off caching. For example, a step that reads a HuggingFace dataset doesnât need to be cached, because HuggingFace datasets already have their own caching mechanism. But itâs still a deterministic step, and all following steps are allowed to cache. If it is
None
, the step figures out by itself whether it should be cacheable or not.
-
DETERMINISTIC:
bool
= True# This describes whether this step can be relied upon to produce the same results every time when given the same inputs. If this is
False
, you can still cache the output of the step, but the results might be unexpected. Tango will print a warning in this case.
-
VERSION:
Optional
[str
] = '001'# This is optional, but recommended. Specifying a version gives you a way to tell Tango that a step has changed during development, and should now be recomputed. This doesnât invalidate the old results, so when you revert your code, the old cache entries will stick around and be picked up.
- class tango.integrations.datasets.DatasetRemixStep(step_name=None, cache_results=None, step_format=None, step_config=None, step_unique_id_override=None, step_resources=None, step_metadata=None, step_extra_dependencies=None, **kwargs)[source]#
This step can remix splits in a
DatasetDict
into new splits.Tip
Registered as a
Step
under the name âdatasets::dataset_remixâ.Examples
input = datasets.load_dataset("lhoestq/test") new_splits = { "all": "train + validation", "crossval_train": "train[:1] + validation[1:]", "crossval_test": "train[1:] + validation[:1]", } step = DatasetRemixStep() remixed_dataset = step.run(input=input, new_splits=new_splits)
- run(input, new_splits, keep_old_splits=True, shuffle_before=False, shuffle_after=False, random_seed=1532637578)[source]#
Remixes and shuffles a dataset. This is done eagerly with native đ€ Datasets features.
- Parameters:
input (
DatasetDict
) â The input dataset that will be remixed.new_splits (
Dict
[str
,str
]) âSpecifies the new splits that the output dataset should have. Keys are the name of the new splits. Values refer to the original splits. You can refer to original splits in the following ways:
Mention the original split name to copy it to a new name.
Mention the original split name with Pythonâs slicing syntax to select part of the original splitâs instances. For example,
"train[:1000]"
selects the first 1000 instances from the"train"
split."instances + instances"
concatenates the instances into one split.
You can combine these possibilities.
keep_old_splits (
bool
, default:True
) â Whether to keep the splits from the input dataset in addition to the new ones given bynew_splits
.shuffle_before (
bool
, default:False
) âWhether to shuffle the input splits before creating the new ones.
If you need shuffled instances and youâre not sure the input is properly shuffled, use this.
shuffle_after (
bool
, default:False
) âWhether to shuffle the input splits after creating the new ones.
If you need shuffled instances and youâre slicing or concatenating splits, use this.
If you want to be on the safe side, shuffle both before and after.
random_seed (
int
, default:1532637578
) â Random seed, affects shuffling
- Return type:
DatasetDict
- Returns:
Returns a new dataset that is appropriately remixed.
-
CACHEABLE:
Optional
[bool
] = True# This provides a direct way to turn off caching. For example, a step that reads a HuggingFace dataset doesnât need to be cached, because HuggingFace datasets already have their own caching mechanism. But itâs still a deterministic step, and all following steps are allowed to cache. If it is
None
, the step figures out by itself whether it should be cacheable or not.
-
DETERMINISTIC:
bool
= True# This describes whether this step can be relied upon to produce the same results every time when given the same inputs. If this is
False
, you can still cache the output of the step, but the results might be unexpected. Tango will print a warning in this case.
-
VERSION:
Optional
[str
] = '001'# This is optional, but recommended. Specifying a version gives you a way to tell Tango that a step has changed during development, and should now be recomputed. This doesnât invalidate the old results, so when you revert your code, the old cache entries will stick around and be picked up.