🤗 Datasets#

Important

To use this integration you should install tango with the “datasets” extra (e.g. pip install tango[datasets]) or just install the datasets library after the fact (e.g. pip install datasets).

Components for Tango integration with 🤗 Datasets.

Example: loading and combining#

Here’s an example config that uses the built-in steps from this integration to load, concatenate, and interleave datasets from HuggingFace:

{
    "steps": {
        "train_data": {
            "type": "datasets::load",
            "path": "lhoestq/test",
            "split": "train"
        },
        "dev_data": {
            "type": "datasets::load",
            "path": "lhoestq/test",
            "split": "validation"
        },
        "all_data": {
            "type": "datasets::concatenate",
            "datasets": [
                {
                    "type": "ref",
                    "ref": "train_data"
                },
                {
                    "type": "ref",
                    "ref": "dev_data"
                }
            ]
        },
        "mixed_data": {
            "type": "datasets::interleave",
            "datasets": [
                {
                    "type": "ref",
                    "ref": "train_data"
                },
                {
                    "type": "ref",
                    "ref": "dev_data"
                }
            ],
            "probabilities": [0.9, 0.1]
        }
    }
}

You could run this with:

tango run config.json

Reference#

tango.integrations.datasets.convert_to_tango_dataset_dict(hf_dataset_dict)[source]#

A helper function that can be used to convert a HuggingFace DatasetDict or IterableDatasetDict into a native Tango DatasetDict or IterableDatasetDict.

This is important to do when your dataset dict is input to another step for caching reasons.

class tango.integrations.datasets.DatasetsFormat[source]#

This format writes a datasets.Dataset or datasets.DatasetDict to disk using datasets.Dataset.save_to_disk().

It is the default Format for the LoadDataset step.

class tango.integrations.datasets.LoadDataset(step_name=None, cache_results=None, step_format=None, step_config=None, step_unique_id_override=None, step_resources=None, step_metadata=None, step_extra_dependencies=None, **kwargs)[source]#

This step loads a HuggingFace dataset.

Tip

Registered as a Step under the name “datasets::load”.

Important

If you are loading an IterableDataset or IterableDatasetDict you need to use the LoadStreamingDataset step instead.

run(path, **kwargs)[source]#

Load the HuggingFace dataset specified by path.

path is the canonical name or path to the dataset. Additional key word arguments are passed as-is to datasets.load_dataset().

Return type:: Union[DatasetDict, Dataset]

CACHEABLE: Optional[bool] = True#: This provides a direct way to turn off caching. For example, a step that reads a HuggingFace dataset doesn’t need to be cached, because HuggingFace datasets already have their own caching mechanism. But it’s still a deterministic step, and all following steps are allowed to cache. If it is None, the step figures out by itself whether it should be cacheable or not.

DETERMINISTIC: bool = True#: This describes whether this step can be relied upon to produce the same results every time when given the same inputs. If this is False, you can still cache the output of the step, but the results might be unexpected. Tango will print a warning in this case.

FORMAT: Format = <tango.integrations.datasets.DatasetsFormat object>#: This specifies the format the results of this step will be serialized in. See the documentation for Format for details.

VERSION: Optional[str] = '001'#: This is optional, but recommended. Specifying a version gives you a way to tell Tango that a step has changed during development, and should now be recomputed. This doesn’t invalidate the old results, so when you revert your code, the old cache entries will stick around and be picked up.

class tango.integrations.datasets.LoadStreamingDataset(step_name=None, cache_results=None, step_format=None, step_config=None, step_unique_id_override=None, step_resources=None, step_metadata=None, step_extra_dependencies=None, **kwargs)[source]#

This step loads an iterable/streaming HuggingFace dataset.

Tip

Registered as a Step under the name “datasets::load_streaming”.

run(path, **kwargs)[source]#

Load the HuggingFace streaming dataset specified by path.

path is the canonical name or path to the dataset. Additional key word arguments are passed as-is to datasets.load_dataset().

Return type:: Union[IterableDatasetDict, IterableDataset]

CACHEABLE: Optional[bool] = False#: This provides a direct way to turn off caching. For example, a step that reads a HuggingFace dataset doesn’t need to be cached, because HuggingFace datasets already have their own caching mechanism. But it’s still a deterministic step, and all following steps are allowed to cache. If it is None, the step figures out by itself whether it should be cacheable or not.

DETERMINISTIC: bool = True#: This describes whether this step can be relied upon to produce the same results every time when given the same inputs. If this is False, you can still cache the output of the step, but the results might be unexpected. Tango will print a warning in this case.

VERSION: Optional[str] = '001'#: This is optional, but recommended. Specifying a version gives you a way to tell Tango that a step has changed during development, and should now be recomputed. This doesn’t invalidate the old results, so when you revert your code, the old cache entries will stick around and be picked up.

class tango.integrations.datasets.InterleaveDatasets(step_name=None, cache_results=None, step_format=None, step_config=None, step_unique_id_override=None, step_resources=None, step_metadata=None, step_extra_dependencies=None, **kwargs)[source]#

This steps interleaves multiple datasets using interleave_datasets().

Tip

Registered as a Step under the name “datasets::interleave”.

run(datasets, probabilities=None, seed=None)[source]#

Interleave the list of datasets.

Return type:: TypeVar(DatasetType, Dataset, IterableDataset)

CACHEABLE: Optional[bool] = False#: This provides a direct way to turn off caching. For example, a step that reads a HuggingFace dataset doesn’t need to be cached, because HuggingFace datasets already have their own caching mechanism. But it’s still a deterministic step, and all following steps are allowed to cache. If it is None, the step figures out by itself whether it should be cacheable or not.

DETERMINISTIC: bool = True#: This describes whether this step can be relied upon to produce the same results every time when given the same inputs. If this is False, you can still cache the output of the step, but the results might be unexpected. Tango will print a warning in this case.

VERSION: Optional[str] = '001'#: This is optional, but recommended. Specifying a version gives you a way to tell Tango that a step has changed during development, and should now be recomputed. This doesn’t invalidate the old results, so when you revert your code, the old cache entries will stick around and be picked up.

class tango.integrations.datasets.ConcatenateDatasets(step_name=None, cache_results=None, step_format=None, step_config=None, step_unique_id_override=None, step_resources=None, step_metadata=None, step_extra_dependencies=None, **kwargs)[source]#

This step concatenates multiple datasets using concatenate_datasets().

Tip

Registered as a Step under the name “datasets::concatenate”.

run(datasets, info=None, split=None, axis=0)[source]#

Concatenate the list of datasets.

Return type:: Dataset

CACHEABLE: Optional[bool] = False#: This provides a direct way to turn off caching. For example, a step that reads a HuggingFace dataset doesn’t need to be cached, because HuggingFace datasets already have their own caching mechanism. But it’s still a deterministic step, and all following steps are allowed to cache. If it is None, the step figures out by itself whether it should be cacheable or not.

DETERMINISTIC: bool = True#: This describes whether this step can be relied upon to produce the same results every time when given the same inputs. If this is False, you can still cache the output of the step, but the results might be unexpected. Tango will print a warning in this case.

VERSION: Optional[str] = '001'#: This is optional, but recommended. Specifying a version gives you a way to tell Tango that a step has changed during development, and should now be recomputed. This doesn’t invalidate the old results, so when you revert your code, the old cache entries will stick around and be picked up.

class tango.integrations.datasets.DatasetRemixStep(step_name=None, cache_results=None, step_format=None, step_config=None, step_unique_id_override=None, step_resources=None, step_metadata=None, step_extra_dependencies=None, **kwargs)[source]#

This step can remix splits in a DatasetDict into new splits.

Tip

Registered as a Step under the name “datasets::dataset_remix”.

Examples

input = datasets.load_dataset("lhoestq/test")
new_splits = {
    "all": "train + validation",
    "crossval_train": "train[:1] + validation[1:]",
    "crossval_test": "train[1:] + validation[:1]",
}
step = DatasetRemixStep()
remixed_dataset = step.run(input=input, new_splits=new_splits)

run(input, new_splits, keep_old_splits=True, shuffle_before=False, shuffle_after=False, random_seed=1532637578)[source]#

Remixes and shuffles a dataset. This is done eagerly with native 🤗 Datasets features.

Parameters:

input (DatasetDict) – The input dataset that will be remixed.
new_splits (Dict[str, str]) –
Specifies the new splits that the output dataset should have. Keys are the name of the new splits. Values refer to the original splits. You can refer to original splits in the following ways:
- Mention the original split name to copy it to a new name.
- Mention the original split name with Python’s slicing syntax to select part of the original split’s instances. For example, "train[:1000]" selects the first 1000 instances from the "train" split.
- "instances + instances" concatenates the instances into one split.
You can combine these possibilities.
keep_old_splits (bool, default: True) – Whether to keep the splits from the input dataset in addition to the new ones given by new_splits.
shuffle_before (bool, default: False) –
Whether to shuffle the input splits before creating the new ones.

If you need shuffled instances and you’re not sure the input is properly shuffled, use this.
shuffle_after (bool, default: False) –
Whether to shuffle the input splits after creating the new ones.

If you need shuffled instances and you’re slicing or concatenating splits, use this.

If you want to be on the safe side, shuffle both before and after.
random_seed (int, default: 1532637578) – Random seed, affects shuffling

Return type:

DatasetDict

Returns:

Returns a new dataset that is appropriately remixed.

CACHEABLE: Optional[bool] = True#: This provides a direct way to turn off caching. For example, a step that reads a HuggingFace dataset doesn’t need to be cached, because HuggingFace datasets already have their own caching mechanism. But it’s still a deterministic step, and all following steps are allowed to cache. If it is None, the step figures out by itself whether it should be cacheable or not.

DETERMINISTIC: bool = True#: This describes whether this step can be relied upon to produce the same results every time when given the same inputs. If this is False, you can still cache the output of the step, but the results might be unexpected. Tango will print a warning in this case.

VERSION: Optional[str] = '001'#: This is optional, but recommended. Specifying a version gives you a way to tell Tango that a step has changed during development, and should now be recomputed. This doesn’t invalidate the old results, so when you revert your code, the old cache entries will stick around and be picked up.