Format#
Base class#
- class tango.format.Format(*args, **kwds)[source]#
Formats write objects to directories and read them back out.
In the context of Tango, the objects that are written by formats are usually the result of a
Step
.- _to_params()[source]#
Returns a dictionary of parameters that, when turned into a
Params
object and then fed to.from_params()
, will recreate this object.You donāt need to implement this all the time. Tango will let you know if you need it.
Implementations#
- class tango.format.DillFormat(compress=None)[source]#
This format writes the artifact as a single file called ādata.dillā using dill (a drop-in replacement for pickle). Optionally, it can compress the data.
This is very flexible, but not always the fastest.
Tip
This format has special support for iterables. If you write an iterator, it will consume the iterator. If you read an iterator, it will read the iterator lazily.
- class tango.format.DillFormatIterator(filename)[source]#
An
Iterator
class that is used to return an iterator fromtango.format.DillFormat.read()
.
- class tango.format.JsonFormat(compress=None)[source]#
This format writes the artifact as a single file in json format. Optionally, it can compress the data. This is very flexible, but not always the fastest.
Tip
This format has special support for iterables. If you write an iterator, it will consume the iterator. If you read an iterator, it will read the iterator lazily.
- class tango.format.JsonFormatIterator(filename)[source]#
An
Iterator
class that is used to return an iterator fromtango.format.JsonFormat.read()
.
- class tango.format.SqliteDictFormat(*args, **kwds)[source]#
This format works specifically on results of type
DatasetDict
. It writes those datasets into Sqlite databases.During reading, the advantage is that the dataset can be read lazily. Reading a result that is stored in
SqliteDictFormat
takes milliseconds. No actual reading takes place until you access individual instances.During writing, you have to take some care to take advantage of the same trick. Recall that
DatasetDict
is basically a map, mapping split names to lists of instances. If you ensure that those lists of instances are of typeSqliteSparseSequence
, then writing the results inSqliteDictFormat
can in many cases be instantaneous.Here is an example of the pattern to use to make writing fast:
@Step.register("my_step") class MyStep(Step[DatasetDict]): FORMAT: Format = SqliteDictFormat() VERSION = "001" def run(self, ...) -> DatasetDict: result: Dict[str, Sequence] = {} for split_name in my_list_of_splits: output_split = SqliteSparseSequence(self.work_dir / f"{split_name}.sqlite") for instance in instances: output_split.append(instance) result[split_name] = output_split metadata = {} return DatasetDict(result, metadata)
Observe how for each split, we create a
SqliteSparseSequence
in the stepās work directory (accessible withwork_dir()
). This has the added advantage that if the step fails and you have to re-run it, the previous results that were already written to theSqliteSparseSequence
are still there. You could replace the innerfor
loop like this to take advantage:output_split = SqliteSparseSequence(self.work_dir / f"{split_name}.sqlite") for instance in instances[len(output_split):]: # <-- here is the difference output_split.append(instance) result[split_name] = output_split
This works because when you re-run the step, the work directory will still be there, so
output_split
is not empty when you open it.
- class tango.format.TextFormat(compress=None)[source]#
This format writes the artifact as a single file in text format. Optionally, it can compress the data. This is very flexible, but not always the fastest.
This format can only write strings, or iterable of strings.
Tip
This format has special support for iterables. If you write an iterator, it will consume the iterator. If you read an iterator, it will read the iterator lazily.
Be aware that if your strings contain newlines, you will read out more strings than you wrote. For this reason, itās often advisable to use
JsonFormat
instead. WithJsonFormat
, all special characters are escaped, strings are quoted, but itās all still human-readable.