Format#

Base class#

class tango.format.Format[source]#

Formats write objects to directories and read them back out.

In the context of Tango, the objects that are written by formats are usually the result of a Step.

Returns a dictionary of parameters that, when turned into a Params object and then fed to .from_params(), will recreate this object.

You don’t need to implement this all the time. Tango will let you know if you need it.

Return type:: Dict[str, Any]

abstract read(dir)[source]#

Reads an artifact from the directory at dir and returns it.

Return type:: TypeVar(T)

abstract write(artifact, dir)[source]#: Writes the artifact to the directory at dir.

VERSION: str = NotImplemented#: Formats can have versions. Versions are part of a step’s unique signature, part of unique_id, so when a step’s format changes, that will cause the step to be recomputed.

Implementations#

class tango.format.DillFormat(compress=None)[source]#

This format writes the artifact as a single file called “data.dill” using dill (a drop-in replacement for pickle). Optionally, it can compress the data.

This is very flexible, but not always the fastest.

Tip

This format has special support for iterables. If you write an iterator, it will consume the iterator. If you read an iterator, it will read the iterator lazily.

VERSION: str = '001'#: Formats can have versions. Versions are part of a step’s unique signature, part of unique_id, so when a step’s format changes, that will cause the step to be recomputed.

class tango.format.DillFormatIterator(filename)[source]#: An Iterator class that is used to return an iterator from tango.format.DillFormat.read().

class tango.format.JsonFormat(compress=None)[source]#

This format writes the artifact as a single file in json format. Optionally, it can compress the data. This is very flexible, but not always the fastest.

Tip

This format has special support for iterables. If you write an iterator, it will consume the iterator. If you read an iterator, it will read the iterator lazily.

VERSION: str = '002'#: Formats can have versions. Versions are part of a step’s unique signature, part of unique_id, so when a step’s format changes, that will cause the step to be recomputed.

class tango.format.JsonFormatIterator(filename)[source]#: An Iterator class that is used to return an iterator from tango.format.JsonFormat.read().

class tango.format.SqliteDictFormat[source]#

This format works specifically on results of type DatasetDict. It writes those datasets into Sqlite databases.

During reading, the advantage is that the dataset can be read lazily. Reading a result that is stored in SqliteDictFormat takes milliseconds. No actual reading takes place until you access individual instances.

During writing, you have to take some care to take advantage of the same trick. Recall that DatasetDict is basically a map, mapping split names to lists of instances. If you ensure that those lists of instances are of type SqliteSparseSequence, then writing the results in SqliteDictFormat can in many cases be instantaneous.

Here is an example of the pattern to use to make writing fast:

@Step.register("my_step")
class MyStep(Step[DatasetDict]):

    FORMAT: Format = SqliteDictFormat()
    VERSION = "001"

    def run(self, ...) -> DatasetDict:
        result: Dict[str, Sequence] = {}
        for split_name in my_list_of_splits:
            output_split = SqliteSparseSequence(self.work_dir / f"{split_name}.sqlite")
            for instance in instances:
                output_split.append(instance)
            result[split_name] = output_split

        metadata = {}
        return DatasetDict(result, metadata)

Observe how for each split, we create a SqliteSparseSequence in the step’s work directory (accessible with work_dir()). This has the added advantage that if the step fails and you have to re-run it, the previous results that were already written to the SqliteSparseSequence are still there. You could replace the inner for loop like this to take advantage:

output_split = SqliteSparseSequence(self.work_dir / f"{split_name}.sqlite")
for instance in instances[len(output_split):]:      # <-- here is the difference
    output_split.append(instance)
result[split_name] = output_split

This works because when you re-run the step, the work directory will still be there, so output_split is not empty when you open it.

VERSION: str = '003'#: Formats can have versions. Versions are part of a step’s unique signature, part of unique_id, so when a step’s format changes, that will cause the step to be recomputed.

class tango.format.SqliteSequenceFormat[source]#

VERSION: str = '003'#: Formats can have versions. Versions are part of a step’s unique signature, part of unique_id, so when a step’s format changes, that will cause the step to be recomputed.

class tango.format.TextFormat(compress=None)[source]#

This format writes the artifact as a single file in text format. Optionally, it can compress the data. This is very flexible, but not always the fastest.

This format can only write strings, or iterable of strings.

Tip

This format has special support for iterables. If you write an iterator, it will consume the iterator. If you read an iterator, it will read the iterator lazily.

Be aware that if your strings contain newlines, you will read out more strings than you wrote. For this reason, it’s often advisable to use JsonFormat instead. With JsonFormat, all special characters are escaped, strings are quoted, but it’s all still human-readable.

VERSION: str = '001'#: Formats can have versions. Versions are part of a step’s unique signature, part of unique_id, so when a step’s format changes, that will cause the step to be recomputed.

class tango.format.TextFormatIterator(filename)[source]#: An Iterator class that is used to return an iterator from tango.format.TextFormat.read().