First Steps#

What is a Step?#

Tango is a Python library for choreographing machine learning research experiments by executing a series of steps. A step can do anything, really, such as prepare a dataset, train a model, send an email to your mother wishing her happy birthday, etc.

Concretely, each step is just a subclass of Step, where the run() method in particular defines what the step actually does. So anything that can be implemented in Python can be run as a step.

Steps can also depend on other steps in that the output of one step can be part of the input to another step. Therefore, the steps that make up an experiment form a directed graph.

The concept of the Step is the bread and butter that makes Tango so general and powerful. So powerful, in fact, that you might be wondering if Tango is Turing-complete? Well, we don’t know yet, but we can say at least that Tango is Tango-complete 😉

Configuration files#

Experiments themselves are defined through JSON, Jsonnet, or YAML configuration files. At a minimum, these files must contain the “steps” field, which should be a mapping of arbitrary (yet unique) step names to the configuration of the corresponding step.

For example, let’s create a config file called config.jsonnet with the following contents:

{
  "steps": {
    "random_name": {
      "type": "random_choice",
      "choices": ["Turing", "Tango", "Larry"],
    },
    "say_hello": {
      "type": "concat_strings",
      "string1": "Hello, ",
      "string2": {
        "type": "ref",
        "ref": "random_name"
      }
    },
    "print": {
      "type": "print",
      "input": {
        "type": "ref",
        "ref": "say_hello"
      }
    }
  }
}

Can you guess what this experiment does?

There are three steps in this experiment graph: “random_name” is the name of one step, “say_hello” is the name of another, and “print” is the name of the last. The “type” parameter within the config of each step tells Tango which Step class implementation to use for that step.

So, within the “random_name” step config

"random_name": {
  "type": "random_choice",
  "choices": ["Turing", "Tango", "Larry"],
}

the "type": "random_choice" part tells Tango to use the Step subclass that is registered by the name “random_choice”.

But wait… what do we mean by registered?

Tango keeps track of an internal registry for certain classes (such as the Step class) that is just a mapping of arbitrary unique names to subclasses. When you look through Tango’s source code, you’ll see things like:

@Step.register("foo")
class Foo(Step):
    ...

This is how subclasses get added to the registry. In this case the subclass Foo is added to the Step registry under the name “foo”, so if you were to use "type": "foo" in your configuration file, Tango would understand that you mean to use the Foo class for the given step.

Tip

Any class that inherits from Registrable can have its own registry.

Now back to our example. The step classes referenced in our configuration file (“random_choice” and “concat_strings”) don’t actually exist in the Tango library (though the “print” step does), but we can easily implement and register them on our own.

Let’s put them in a file called components.py:

# file: components.py

import random
from typing import List

from tango import Step

@Step.register("random_choice")
class RandomChoiceStep(Step):
    DETERMINISTIC = False

    def run(self, choices: List[str]) -> str:
        return random.choice(choices)

@Step.register("concat_strings")
class ConcatStringsStep(Step):
    def run(self, string1: str, string2: str) -> str:
        return string1 + string2

Important

It’s important that you use type hints in your code so that Tango can properly construct Python objects from the corresponding serialized (JSON) objects and warn you when the types don’t match up.

So as long as Tango is able to import this module (components.py) these step implementations will be added to the registry and Tango will know how to instantiate and run them.

Executing an experiment#

At this point we’ve implemented our custom steps (components.py) and created our configuration file config.jsonnet, so we’re ready to actually run this experiment.

For that, just use the tango run command:

$ tango run config.jsonnet -i components

Tip

  • The -i option is short for --include-package, which takes the name of a Python package which Tango will try to import. In this case our custom steps are in components.py, so we need Tango to import this module to find those steps. As long as components.py is in the current directory or somewhere else on the PYTHONPATH, Tango will be able to find and import this module when you pass -i components (note the lack of the .py at the end).

You should see something like this in the output:

Starting new run cute-kitten
● Starting step "random_name"
✓ Finished step "random_name"
● Starting step "say_hello"
✓ Finished step "say_hello"
● Starting step "print"
Hello, Tango
✓ Finished step "print"

Step caching#

This particular experiment didn’t write any results to disk, but in many situations you’ll want to save the output of at least some of your steps.

For example, if you’re using the TorchTrainStep step, the output is a trained model, which is certainly a useful thing to keep around. In other cases, you may not actually care about the direct result of a particular step, but it could still be useful to save it when possible so that Tango doesn’t need to run the step again unnecessarily.

This is where Tango’s caching mechanism comes in.

To demonstrate this, let’s look at another example that pretends to do some expensive computation. Here is the config.jsonnet file:

{
  "steps": {
    "add_numbers": {
      "type": "really_inefficient_addition",
      "num1": 34,
      "num2": 8
    }
  }
}

And let’s implement “really_inefficient_addition”:

# components.py

import time

from tango import Step, JsonFormat
from tango.common import Tqdm


@Step.register("really_inefficient_addition")
class ReallyInefficientAdditionStep(Step):
    DETERMINISTIC = True
    CACHEABLE = True
    FORMAT = JsonFormat()

    def run(self, num1: int, num2: int) -> int:
        for _ in Tqdm.tqdm(range(100), desc="Computing...", total=100):
            time.sleep(0.05)
        return num1 + num2

There are a couple of things to note about this step, other than the obvious inefficiencies; the class variables we’ve defined: DETERMINISTIC, CACHEABLE, and FORMAT.

DETERMINISTIC = True tells Tango that, given particular inputs, the output to this step will always be the same every time it is ran, which has implications on caching. By default, Tango assumes steps are deterministic. You can override this by saying DETERMINISTIC = False. Tango will warn you when you try to cache a non-deterministic step.

CACHEABLE = True tells Tango that it can cache this step and FORMAT = JsonFormat() defines which Format Tango will use to serialize the result of the step.

This time when we run the experiment we’ll designate a specific directory for Tango to use:

$ tango run config.jsonnet -i components -d workspace/
Starting new run live-tarpon
● Starting step "add_numbers"
Computing...: 100%|##########| 100/100 [00:05<00:00, 18.99it/s]
✓ Finished step "add_numbers"
✓ The output for "add_numbers" is in workspace/runs/live-tarpon/add_numbers

The last line in the output tells us where we can find the result of our “add_numbers” step. live-parpon is the name of the run. Run names are randomly generated and may be different on your machine. add_numbers is the name of the step in your config. The whole path is a symlink to a directory, which contains (among other things) a file data.json:

$ cat workspace/runs/live-tarpon/add_numbers/data.json
42

Now look what happens when we run this step again:

$ tango run config.jsonnet -i components -d workspace/
Starting new run modest-shrimp
✓ Found output for "add_numbers" in cache
✓ The output for "add_numbers" is in workspace/runs/modest-shrimp/add_numbers

Tango didn’t have to run our really inefficient addition step this time because it found the previous cached result. It put the results in the result directory for a different run (in our case, the modest-shrimp run), but once again it is a symlink that links to the same results from our first run.

If we changed the inputs to the step in config.jsonnet:

     "add_numbers": {
       "type": "really_inefficient_addition",
       "num1": 34,
-      "num2": 8
+      "num2": 2
     }
   }
 }

And ran it again:

$ tango run config.jsonnet -i components -d workspace/
Starting new run true-parrot
● Starting step "add_numbers"
Computing...: 100%|##########| 100/100 [00:05<00:00, 19.13it/s]
✓ Finished step "add_numbers"
✓ The output for "add_numbers" is in workspace/runs/true-parrot/add_numbers

You’d see that Tango had to run our “add_numbers” step again.

You may have noticed that workspace/runs/true-parrot/add_numbers is now a symlink that points to a different place than it did for the first two runs. That’s because it produced a different result this time. All the result symlinks point into the workspace/cache/ directory, where all the step’s results are cached.

This means that if we ran the experiment again with the original inputs, Tango would still find the cached result and wouldn’t need to rerun the step.

Arbitrary objects as inputs#

FromParams#

So far the inputs to all of the steps in our examples have been built-in Python types that can be deserialized from JSON (e.g. int, str, etc.), but sometimes you need the input to a step to be an instance of an arbitrary Python class.

Tango allows this as well as it can infer from type hints what the class is and how to instantiate it. When writing your own classes, it’s recommended that you have your class inherit from the FromParams class, which will gaurantee that Tango can instantiate it from a config file.

For example, suppose we had a step like this:

from tango import Step
from tango.common import FromParams


class Bar(FromParams):
    def __init__(self, x: int) -> None:
        self.x = x


@Step.register("foo")
class FooStep(Step):
    def run(self, bar: Bar) -> int:
        return bar.x

Tip

If you’ve used AllenNLP before, this will look familiar! In fact, it’s the same system under the hood.

Then we could create a config like this:

{
  "steps": {
    "foo": {
      "type": "foo",
      "bar": {"x": 1}
    }
  }
}

And Tango will figure out how to deserialize {"x": 1} into a Bar instance.

You can also have FromParams objects nested within other FromParams objects or standard containers like list:

from typing import List

from tango import Step
from tango.common import FromParams


class Bar(FromParams):
    def __init__(self, x: int) -> None:
        self.x = x


class Baz(FromParams):
    def __init__(self, bar: Bar) -> None:
        self.bar = bar


@Step.register("foo")
class FooStep(Step):
    def run(self, bars: List[Bar], baz: Baz) -> int:
        return sum([bar.x for bar in bars]) + baz.bar.x

Registrable#

The Registrable class is a special kind of FromParams class that allows you to specify from the config which subclass of an expected class to deserialize into.

This is actually how we’ve been instantiating specific Step subclasses. Because Step inherits from Registrable, we can use the "type" fields in the config file to specify a Step subclass.

This is also very useful when you’re writing a step that requires a certain type as input, but you want to be able to change the exact subclass of the type from your config file. For example, the TorchTrainStep takes Registrable inputs such as Model. Model variants can then be subclasses that are specified in the config file by their registered names. A sketch of this might look like the following:

from tango import Step
from tango.common import FromParams, Registrable

class Model(torch.nn.Module, Registrable):
    ...

@Model.register("variant1")
class Variant1(Model):
    ...

@Model.register("variant2")
class Variant2(Model):
    ...

@Step.register("torch::train")
class TorchTrainerStep(Step):
    def run(self, model: Model, ...) -> Model:
        ...

And a sketch of the config file would be something like this:

{
  "steps": {
    "train": {
      "type": "torch::train",
      "model": {
        "type": "variant1",
      }
    }
  }
}

As in the FromParams example the specifications can be nested, but now we also denote the subclass with the "type": "..." field. To swap models we need only change “variant1” to “variant2” in the config. The value for “type” can either be the name that the class is registered under (e.g. “train” for TorchTrainStep), or the fully qualified class name (e.g. tango.integrations.torch.TorchTrainStep).

You’ll see more examples of this in the next section.