Changelog#
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
Unreleased#
Added#
Step resources:
Added a
step_resources
parameter to theStep
class which should be used to describe the computational resources required to run a step.Executor
implementations can use this information. For example, if your step needs 2 GPUs, you should setstep_resources=StepResources(gpu_count=2)
("step_resources": {"gpu_count": 2}
in the configuration language).Added a
Step.resources()
property method. By default this returns the value specified by thestep_resources
parameter. If your step implementation always requires the same resources, you can just override this method so you donât have to provide thestep_resources
parameter.
Step execution:
Added an
executor
field to thetango.yml
settings. You can use this to define the executor you want to use by default.Added a Beaker
Executor
to the Beaker integration, registered as anExecutor
with the name âbeakerâ. To use this executor, add these lines to yourtango.yml
file:executor: type: beaker beaker_workspace: ai2/my-workspace clusters: - ai2/general-cirrascale
See the docs for the
BeakerExecutor
for more information on the input parameters.
Changed#
CLI:
The
tango run
command will throw an error if you have uncommitted changes in your repository, unless you use the--allow-dirty
flag.The
tango run
command will use the lightweight base executor (single process) by default. To use the multi-process executor, set-j/--parallelism
to 1 or higher or -1 to use all available CPU cores.
v0.11.0 - 2022-08-04#
Added#
Added a Flax integration along with an example config.
v0.10.1 - 2022-07-26#
Fixed#
Fixed issue where the StepInfo config argument could be parsed into a Step.
Restored capability to run tests out-of-tree.
v0.10.0 - 2022-07-07#
Changed#
Renamed
workspace
parameter ofBeakerWorkspace
class tobeaker_workspace
.Executor
class is now aRegistrable
base class.MulticoreExecutor
is registered as âmulticoreâ.
Removed#
Removed
StepExecutionMetadata
. Its fields have been absorbed intoStepInfo
.
Fixed#
Improved
Step.ensure_result()
such that the stepâs result doesnât have to be read from the cache.Fixed an issue with the output from
MulticoreExecutor
such that itâs now consistent with the defaultExecutor
for steps that were found in the cache.One of our error messages referred to a configuration file that no longer exists.
Improved performance of
BeakerWorkspace
.
Added#
Added the ability to train straight
Model
instead of justLazy[Model]
v0.9.1 - 2022-06-24#
Fixed#
Fixed non-deterministic behavior in
TorchTrainStep
.Fixed bug in
BeakerWorkspace
where.step_info(step)
would raise aKeyError
if the step hasnât been registered as part of a run yet.Fixed a bug in
BeakerWorkspace
where it would send too many requests to the beaker service.Fixed a bug where
WandbWorkspace.step_finished()
or.step_failed()
would crash if called from a different process than.step_starting()
.Fixed a bug in
WandbWorkspace.step_finished()
which led to aRuntimeError
sometimes while caching the result of a step.
v0.9.0 - 2022-06-01#
Added#
Added a Beaker integration that comes with
BeakerWorkspace
, a remoteWorkspace
implementation that uses Beaker Datasets under the hood.Added a
datasets::dataset_remix
step that provides the split remixing functionality oftango.steps.datasest_remix.DatasetRemixStep
now for HuggingfaceDatasetDict
.Added a config and code example of Registrable to the First Step docs with edits for clarity.
Changed#
If you try to import something from a tango integration that is not fully installed due to missing dependencies, an
IntegrationMissingError
will be raised instead ofModuleNotFound
.You can now set
-j 0
intango run
to disable multicore execution altogether.
Fixed#
Improved how steps and workspaces handle race conditions when different processes are competing to execute the same step. This would result in a
RuntimeError
before with most workspaces, but now itâs handled gracefully.Fixed bug which caused GradScaler state to not be saved and loaded with checkpoints.
v0.8.0 - 2022-05-19#
Added#
Added a Weights & Baises remote
Workspace
implementation:WandbWorkspace
, registered as âwandbâ. This can be instantiated from a workspace URL in the form âwandb://entity/projectâ.Added a method
Workspace.step_result_for_run
which gives the result of a step given the run name and step name within that run.Added property
Workspace.url
, which returns a URL for the workspace that can be used to instantiate the exact same workspace usingWorkspace.from_url()
. Subclasses must implement this.
Changed#
StepInfo
start and end times will be always be in UTC now.WandbTrainCallback
now logs system metrics from each worker process in distributed training.StepCache.__contains__()
andStepCache.__getitem__()
now take accept either aStep
orStepInfo
as an argument (Union[Step, StepInfo]
).Refactored
tango.step_graph.StepGraph
to allow initialization from aDict[str, Step]
.Executor.execute_step_graph()
now attempts to execute all steps and summarizes success/failures.
Fixed#
Fixed bug with
LocalWorkspace.from_parsed_url()
(#278).Deprecation warnings will now be logged from
tango
CLI.Fixed the text format in the case of serializing an iterator of string.
Added missing default value of
None
toTangoGlobalSettings.find_or_default()
.Mypy has become incompatible with transformers and datasets, so we have to disable the checks in some places.
The
VERSION
member of step arguments that were wrapped inLazy
were not respected. Now they are.
v0.7.0 - 2022-04-19#
Added#
Added the â-n/ânameâ option to
tango run
. This option allows the user to give the run an arbitrary name.Added a convenience property
.workspace
toStep
class that can be called from a stepâs.run()
method to get the currentWorkspace
being used.Gave
FromParams
objects (which includes allRegistrable
objects) the ability to version themselves.Added CLI option to run a single step in a config using
--step-name
or-s
.Added a
MultiCoreExecutor
that executes steps in parallel.Added an
ExecutorOutput
dataclass that is returned byExecutor.execute_step_graph()
.StepGraph
now prints itself in a readable way.Tango now automatically detects when itâs running under a debugger, and disables multicore support accordingly. Many debuggers canât properly follow sub-processes, so this is a convenience for people who love debuggers.
Added more models to the stuff we can import from the transformers library.
Added new example for finetuning text-to-text models.
Changed#
Renamed
click_logger
tocli_logger
, and we now use richâs loggingHandler
as the default handler, which means prettier output, better tracebacks, and you can use richâs markup syntax with thecli_logger
to easily add style to text.Refactored
tango.step_graph.StepGraph
to allow initialization from aDict[str, Step]
.Executor.execute_step_graph()
now attempts to execute all steps and summarizes success/failures.Upgraded PyTorch version in
tango
Docker image to latestv1.11.0+cu113
.RunGeneration
now allows model object as input.
Fixed#
Fixed bug that mistakenly disallowed fully-qualified names containing
"_"
(underscores) in the config.Fixed bug where
TorchTrainStep
working directory would be left in an unrecoverable state if training failed after saving the final model weights.Fixed bug in
FromParams
where**kwargs
might be passed down to the constructors of arguments.Fixed bug in the way dependencies are tracked between steps.
Fixed bug that caused
MulticoreExecutor
to hang in case of a failing step that was required recursively (not directly) downstream.Fixed bug in the way dependencies are tracked between steps
Compatibility with PyTorch Lightning 1.6
v0.6.0 - 2022-02-25#
Added#
New example that finetunes a pre-trained ResNet model on the Cats & Dogs dataset.
Added a â@requires_gpusâ decorator for marking tests as needing GPUs. Tests marked with this will be run in the âGPU Testsâ workflow on dual k80 GPUs via Beaker.
Added the â-w/âworkspaceâ option to
tango run
andtango server
commands. This option takes a path or URL, and instantiates the workspace from the URL using the newly addedWorkspace.from_url()
method.Added the âworkspaceâ field to
TangoGlobalSettings
.Added the âenvironmentâ field to
TangoGlobalSettings
for setting environment variables each timetango
is run.Added a utility function to get a
StepGraph
directly from a file.Added
tango.settings
module andtango settings
group of commands.A format for storing sequences as
SqliteSparseSequence
A way to massage kwargs before they determine the unique ID of a
Step
Changed#
local_workspace.ExecutorMetadata
renamed toStepExecutionMetadata
and now saved asexecution-metadata.json
.tango run
without the option â-w/âworkspaceâ or â-d/âworkspace-dirâ will now use aMemoryWorkspace
instead of aLocalWorkspace
in a temp directory, unless youâve specified a default workspace in aTangoGlobalSettings
file.Moved
tango.workspace.MemoryWorkspace
andtango.local_workspace.LocalWorkspace
totango.workspaces.*
.Moved
tango.step_cache.MemoryStepCache
andtango.step_cache.LocalStepCache
totango.step_caches.*
.Deprecated the
-d/--workspace-dir
command-line option. Please use-w/--workspace
instead.
Fixed#
Fixed a small bug
LocalWorkspace
would fail to capture the conda environment in our Docker image.Fixed activation of
FILE_FRIENDLY_LOGGING
when set from the corresponding environment variable.Fixed setting log level via the environment variable
TANGO_LOG_LEVEL
.Use relative paths within the
work_dir
for symbolic links to the latest and the best checkpoints inTorchTrainStep
.Fixed some scenarios where Tango can hang after finishing all steps.
distributed_port
andlog_every
parameters wonât factor intoTorchTrainStep
âs unique ID.MappedSequence
now works with slicing.MappedSequence
now works with HuggingfaceDataset
.Uncacheable steps are now visible in Tango UI.
Fixed bug in
Registrable.list_available()
where an error might be raised if the default implementation hadnât been explicitly imported.Fixed issue where having a default argument to the
run()
method wasnât getting applied to the stepâs unique ID.
v0.5.0 - 2022-02-09#
Added#
Added
TrainingEngine
abstraction to torch integration.Added FairScale with a
FairScaleTrainingEngine
that leverages FairScaleâsFullyShardedDataParallel
. This is meant to be used within theTorchTrainStep
.All PyTorch components (such as learning rate schedulers, optimizers, data collators, etc) from the transformers library and now registered under the corresponding class in the torch integration. For example, transformers
Adafactor
optimizer is registered as anOptimizer
under the name âtransformers::Adafactorâ. More details can be found in the documentation for the transformers integration.
Changed#
Various changes to the parameters othe
TorchTrainStep
due to the introduction of theTrainingEngine
class.Params logged as
DEBUG
level instead ofINFO
to reduce noise in logs.The waiting message for
FileLock
is now clear about which file itâs waiting for.Added an easier way to get the default Tango global config
Most methods to
TorchTrainCallback
also take anepoch
parameter now.WandbTrainCallback
now logs peak GPU memory occupied by PyTorch tensors per worker. This is useful because W&Bâs system metrics only display the total GPU memory reserved by PyTorch, which is always higher than the actual amount of GPU memory occupied by tensors. So these new metrics give a more accurate view into how much memory your training job is actually using.Plain old Python functions can now be used in
Lazy
objects.LocalWorkspace
now creates a symlink to the outputs of the latest run.Tango is now better at guessing when a step has died and should be re-run.
Tango is now more lenient about registering the same class under the same name twice.
When you use
dict
instead ofDict
in your type annotations, you now get a legible error message. Same forList
,Tuple
, andSet
.
Fixed#
Fixed a bug in
Registrable
andFromParams
where registered function constructors would not properly construct arguments that were classes.Fixed a bug in
FromParams
that would cause a crash when an argument to the constructor had the nameparams
.Made
FromParams
more efficient by only trying to parse the params as aStep
when it looks like it actually could be a step.Fixed bug where
Executor
would crash ifgit
command could not be found.Fixed bug where validation settings were not interpreted the right way by the torch trainer.
When you register the same name twice using
Registrable
, you get an error message. That error message now contains the correct class name.
v0.4.0 - 2022-01-27#
Changed#
Default log level is
WARNING
instead ofERROR
.The web UI now renders the step graph left-to-right.
The web UI now shows runs by date, with the most recent run at the top.
The web UI now shows steps in a color-coded way.
The
tango run
command now prints user-friendly paths if possible.The
--include-package
flag now also accepts paths instead of module names.tango.common.sqlite_sparse_sequence.SqliteSparseSequence
now lives attango.common.sequences.SqliteSparseSequence
.
Fixed#
Ensure tqdm log lines always make it into the log file
out.log
even when log level isWARNING
orERROR
.Numerous parts of Tango now have documentation when they didnât before.
v0.4.0rc5 - 2022-01-19#
Added#
Added
TorchEvalStep
to torch integration, registered as âtorch::evalâ.
Changed#
Renamed
aggregate_val_metric
toauto_aggregate_val_metric
inTorchTrainStep
.devices
parameter toTorchTrainStep
replaced withdevice_count: int
.Run name printed at the end of a run so itâs easier to find.
Type information added to package data. See PEP 561 for more information.
A new integration,
transformers
, with two new steps for running seq2seq models.Added
logging_tqdm
, if you donât want a progress bar, but you still want to see progress in the logs.Added
threaded_generator()
, for wrapping generators so that they run in a separate thread from the generatorâs consumer.Added a new example for evaluating the T0 model on XSum, a summarization task.
Added
MappedSequence
for functionally wrapping sequences.Added
TextFormat
, in case you want to store the output of your steps in raw text instead of JSON.Steps can now list arguments in
SKIP_ID_ARGUMENTS
to indicate that the argument should not affect a stepâs unique id. This is useful for arguments that affect the execution of a step, but not the output.Step
now implements__str__
, so steps look pretty in the debugger.Added
DatasetCombineStep
, a step that combines multiple datasets into one.Added
common.logging.initialize_worker_logging()
function for configuring logging from worker processes/threads.Logs from
tango run ...
will be written to a file calledout.log
in the run directory.
Fixed#
Fixed torch
StopEarlyCallback
state not being recovered properly on restarts.Fixed file friendly logging by removing special styling characters.
Ensured exceptions captured in logs.
LocalWorkspace
now works properly with uncacheable steps.When a Tango run got killed hard, with
kill -9
, or because the machine lost power,LocalWorkspace
would sometimes keep a step marked as ârunningâ, preventing further executions. This still happens sometimes, but it is now much less likely (and Tango gives you instructions for how to fix it).To make all this happen,
LocalWorkspace
now saves step info in a Sqlite database. Unfortunately that means that the workspace format changes and existing workspace directories wonât work properly with it.Fixed premature cleanup of temporary directories when using
MemoryWorkspace
v0.4.0rc4 - 2021-12-20#
Fixed#
Fixed a bug where
StepInfo
fails to deserialize whenerror
is an exception that canât be pickled.
v0.4.0rc3 - 2021-12-15#
Added#
Added
DatasetsFormat
format andLoadStreamingDataset
step todatasets
integration.SqliteDictFormat
for datasets.Added
pre_epoch()
andpost_epoch()
callback methods to PyTorchTrainCallback
.
Changed#
LoadDataset
step fromdatasets
integration is now cacheable, using theDatasetsFormat
format by default. But this only works with non-streaming datasets. For streaming datasets, you should use theLoadStreamingDataset
step instead.
Fixed#
Fixed bug where
KeyboardInterrupt
exceptions were not handled properly by steps and workspaces.WandbTrainCallback
now will use part of the stepâs unique ID as the name for the W&B run by default, to make it easier to indentify which tango step corresponds to each run in W&B.WandbTrainCallback
will save the entireTrainConfig
object to the W&B config.
v0.4.0rc2 - 2021-12-13#
Added#
Sample experiment configurations that prove Eulerâs identity
Changed#
Loosened
Click
dependency to include v7.0.Loosened
datasets
dependency.Tightened
petname
dependency to exclude next major release for safety.
Fixed#
Workspace
,MemoryWorkspace
, andLocalWorkspace
can now be imported directly from thetango
base module.Uncacheable leaf steps would never get executed. This is now fixed.
We were treating failed steps as if they were completed by accident.
The visualization had a problem with showing steps that never executed because a dependency failed.
Fixed a bug where
Lazy
inputs to aStep
would fail to resolve arguments that come from the result of another step.Fixed a bug in
TorchTrainStep
where some arguments for distributed training (devices
,distributed_port
) werenât being set properly.
v0.4.0rc1 - 2021-11-30#
Added#
Introduced the concept of the
Workspace
, withLocalWorkspace
andMemoryWorkspace
as initial implementations.Added a stub of a webserver that will be able to visualize runs as they happen.
Added separate classes for
LightningTrainingTypePlugin
,LightningPrecisionPlugin
,LightningClusterEnvironmentPlugin
,LightningCheckpointPlugin
for compatibility withpytorch-lightning>=1.5.0
.Added a visualization of workspaces that can show step graphs while theyâre executing.
Removed#
Removed old
LightningPlugin
classRemoved requirement of the
overrides
package
Changed#
Made it possible to construct a step graph out of
Step
objects, instead of constructing it out ofStepStub
objects.Removed dataset fingerprinting code, since we can now use
Step
to make sure things are cached.Made steps deterministic by default.
Brought back
MemoryStepCache
, so we can run steps without configuring anything.W&B
torch::TrainCallback
logs withstep=step+1
now so that training curves in the W&B dashboard match up with checkpoints saved locally and are easier to read (e.g. step 10000 instead of 9999).filelock >= 3.4
required, parameterpoll_intervall
totango.common.file_lock.FileLock.acquire
renamed topoll_interval
.
Fixed#
Fixed bug in
FromParams
where a parameter to aFromParams
class may not be instantiated correctly if itâs a class with a generic type parameter.
v0.3.6 - 2021-11-12#
Added#
Added a
.log_batch()
method ontorch::TrainCallback
which is given the average loss across distributed workers, but only called everylog_every
steps.
Removed#
Removed
.pre_log_batch()
method ontorch::TrainCallback
.
Fixed#
Fixed typo in parameter name
remove_stale_checkpoints
inTorchTrainStep
(previously wasremove_state_checkpoints
).Fixed bug in
FromParams
that would cause failures whenfrom __future__ import annotations
was used with Python older than 3.10. See PEP 563 for details.
v0.3.5 - 2021-11-05#
Fixed#
Fixed a bug in
FromParams
where the âtypeâ parameter was ignored in some cases where theRegistrable
base class did not directly inherit fromRegistrable
.
v0.3.4 - 2021-11-04#
Added#
Added
StopEarlyCallback
, atorch::TrainCallback
for early stopping.Added parameter
remove_stale_checkpoints
toTorchTrainStep
.
Changed#
Minor changes to
torch::TrainCallback
interface.Weights & Biases
torch::TrainCallback
now logs best validation metric score.
v0.3.3 - 2021-11-04#
Added#
Added support for PEP 604 in
FromParams
, i.e. writing union types as âX | Yâ instead of âUnion[X, Y]â.[internals] Added a spot for miscellaneous end-to-end integration tests (not to be confused with âtests of integrationsâ) in
tests/end_to_end/
.[internals] Core tests now run on all officially supported Python versions.
Fixed#
Fixed a bug in
FromParams
where non-FromParams
class parameters were not instantiated properly (or at all).Fixed a bug in
FromParams
where kwargs were not passed on from a wrapper class to the wrapped class.Fixed small bug where some errors from git would be printed when executor metadata is created outside of a git repository.
v0.3.2 - 2021-11-01#
Fixed#
Fixed a bug with
FromParams
that caused.from_params()
to fail when the params contained an object that was already instantiated.tango command no longer installs a SIGTERM handler, which fixes some bugs with integrations that use multiprocessing.
v0.3.1 - 2021-10-29#
Changed#
Updated the
LightningTrainStep
to optionally take in aLightningDataModule
as input.
v0.3.0 - 2021-10-28#
Added#
Added
IterableDatasetDict
, a version ofDatasetDict
for streaming-like datasets.Added a PyTorch Lightning integration with
LightningTrainStep
.
Fixed#
Fixed bug with
FromParams
andLazy
where extra arguments would sometimes be passed down through to aLazy
class when they shouldnât.
v0.2.4 - 2021-10-22#
Added#
Added support for torch 1.10.0.
Changed#
--file-friendly-logging
flag is now an option to the maintango
command, so needs to be passed beforerun
, e.g.tango --file-friendly-logging run ...
.
Fixed#
Fixed bug with
Step.from_params
.Ensure logging is initialized is spawn processes during distributed training with
TorchTrainStep
.
v0.2.3 - 2021-10-21#
Added#
Added support for global settings file,
tango.yml
.Added âinclude_packageâ (array of string) param to config spec.
Added a custom error
StopEarly
that aTrainCallback
can raise within theTorchTrainStep
to stop training early without crashing.Added step config, tango command, and tango version to executor metadata.
Executor now also saves pip dependencies and conda environment files to the run directory for each step.
Fixed#
Ensured
**kwargs
arguments are logged inFromParams
.
v0.2.2 - 2021-10-19#
Added#
Added new steps to
datasets
integration:ConcatenateDatasets
(âdatasets::concatenateâ) andInterleaveDatasets
(datasets::interleave).Added
__contains__
and__iter__
methods onDatasetDict
so that it is now aMapping
class.Added
tango info
command that - among other things - displays which integrations are installed.
v0.2.1 - 2021-10-18#
Added#
Added
convert_to_tango_dataset_dict()
function in thedatasets
integration. Itâs important for step caching purposes to use this to convert a HFDatasetDict
to a native TangoDatasetDict
when thatDatasetDict
is part of the input to another step. Otherwise the HFDatasetDict
will have to be pickled to determine its hash.
Changed#
Format.checksum()
is now an abstract method. Subclasses should only compute checksum on the serialized artifact and nothing else in the directory.[internals] Changed the relationship between
Executor
,StepCache
, andStep.
Executor
now owns theStepCache
, andStep
never interacts withStepCache
directly.
v0.2.0 - 2021-10-15#
Added#
Added a Weights & Biases integration with a training callback (âwandb::logâ) for
TorchTrainStep
(âtorch::trainâ) that logs training and validation metrics to W&B.
Fixed#
Fixed
Format.checksum()
when there is a symlink to a directory in the cache folder.
v0.1.3 - 2021-10-15#
Added#
Added the ability to track a metric other than âlossâ for validation in
TorchTrainStep
(âtorch::trainâ).
Fixed#
Final model returned from
TorchTrainStep
(âtorch::trainâ) will have best weights loaded.Checkpoints are saved from
TorchTrainStep
(âtorch::trainâ) even when there is no validation loop.Fixed
TorchTrainStep
(âtorch::trainâ) whenvalidation_split
isNone
.Fixed distributed training with
TorchTrainStep
(âtorch::trainâ) on GPU devices.
v0.1.2 - 2021-10-13#
Added#
Added support for YAML configuration files.
v0.1.1 - 2021-10-12#
Added#
TorchTrainStep
now displays a progress bar while saving a checkpoint to file.The default executor now saves a âexecutor-metadata.jsonâ file to the directory for each step.
Changed#
Renamed
DirectoryStepCache
toLocalStepCache
(registered as âlocalâ).LocalStepCache
saves metadata tocache-metadata.json
instead ofmetadata.json
.
Fixed#
Fixed bug with
TorchTrainStep
during distributed training.FromParams
will automatically convert strings intoPath
types now when the annotation isPath
.
v0.1.0 - 2021-10-11#
Added#
Added
StepGraph
andExecutor
abstractions.Added a basic PyTorch training step registered as
"torch::train"
, along with other registrable components, such asModel
,DataLoader
,Sampler
,DataCollator
,Optimizer
, andLRScheduler
.Added
DatasetRemixStep
intango.steps
.Added module
tango.common.sequences
.Added
DatasetDict
class intango.common.dataset_dict
.Added đ€ Datasets integration.
Added command-line options to set log level or disable logging completely.
Changed#
Step.work_dir
,Step.unique_id
,Step.dependencies
, andStep.recursive_dependencies
are now a properties instead of methods.tango run
command will acquire a lock on the directory to avoid race conditions.Integrations can now be installed with
pip install tango[INTEGRATION_NAME]
. For example,pip install tango[torch]
.Added method
Registrable.search_modules()
for automatically finding and importing the modules where a givenname
might be registered.FromParams.from_params()
andRegistrable.resolve_class_name
will now callRegistrable.search_modules()
to automatically import modules where the type might be defined. Thus for classes that are defined and registered within anytango.*
submodules it is not necessary to explicitly import them.
Fixed#
Step
implementations can now take arbitrary**kwargs
in theirrun()
methods.
v0.0.3 - 2021-09-27#
Added#
Added
tango
command.
v0.0.2 - 2021-09-27#
Added#
Ported over core tango components from AllenNLP.
v0.0.1 - 2021-09-22#
Added#
Added initial project boilerplate.