Changelog#
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
Unreleased#
v1.3.2 - 2023-10-27#
Fixed#
Fix issues with gcloud auth in beaker executor.
v1.3.1 - 2023-10-25#
Fixed#
Minor bugs in the
GSWorkspace().
Changed#
Added CLI-style execution functions for experiments defined in Python.
Added
display()toExecutorOutputfor producing a table that summarizes the run.
v1.3.0 - 2023-10-13#
Added#
Added the
Workspace.remove_step()method to safely remove steps.The
GSWorkspace()can now be initialized with google cloud bucket subfolders.
Changed#
The
BeakerExecutornow uses the HEAD commit at the time the executor is instantiated to executor a step instead of the HEAD commit at the time the step is run.
Fixed#
Removed unnecessary code coverage dev requirements.
Fixed issue where new version of torch caused no LR schedulers to be registered.
Updated pinned versions of jax, jaxlib, and flax.
v1.2.1 - 2023-04-06#
Added#
Added the following workspace methods to support the Tango viz UI:
Workspace.search_registered_runs(),Workspace.search_step_info(),Workspace.num_registered_runs(), andWorkspace.num_steps().
Fixed#
Fixes a bug where
FromParamswould fail to parse when an object takes aStepargument directly.Changed a name so we donât override the built-in name
set.Fixed a bug that would cause O(n^2) memory consumption in dense step graphs.
v1.2.0 - 2023-02-10#
Added#
You can now add arguments to steps without invalidating the cache. See
Step.SKIP_DEFAULT_ARGUMENTS.Fixed integration status messages in
tango infocommand.Added abstractions for
RemoteClient,RemoteStepCache, andRemoteWorkspace.Added a GS integration that comes with
GSWorkspace, a remoteWorkspaceimplementation that uses google cloud storage.You can now bind functional steps to the underlying
Stepinstance with@step(bind=True), meaning the first argument to the function will be aStep.Added
ShellStepfor running arbitrary shell commands.Added
@make_registrabledecorator to make arbitrary functions registrable, to make it easier to refer to them in tango configurations.
Fixed#
Jsonnet parsing is now much faster and works on Windows.
Warnings about locks are now reliably printed every 30 seconds
We now make sure Beaker jobs have the latest version of beaker-py, so that weâre compatible with the latest API changes.
Stopping early now works when the metric doesnât change at all.
Fixed bug with
FromParamswhich didnât handle variable length tuples correctly.
Changed#
The default log level for Tango is now
warning.You can specify multiple steps with
-sfrom thetango runcommand.
v1.1.0 - 2022-12-01#
Added#
Added
gpu_typefield toStepResources. TheBeakerExecutorcan use this to determine which clusters to a submit a step to.Added
machinefield toStepResources. You can set this to âlocalâ when using theBeakerExecutorto force it to run the step locally.Added
--ext-varargument totango runfor setting JSONNET external variables when loading the experiment config.Added
@step()decorator to createStepclasses from functions.Added the
transformers::with_soft_promptintegration, to make soft-prompted prefix transformers easy.
Removed#
Removed PyTorch Lightning integration.
Removed
tango servercommand and--serve/--no-serveoption fortango run.Removed
source_release.py, which was checked in by accident.
Fixed#
Fixed issue where Executor
parallelismoption in a Tango settings file would be ignored.Fixed a bug where the unique ID of a step that depends on a key-value of the result of another step could change if the name of the other step changes.
Fixed a bug where importing certain libraries (like torchmetrics) would mess with our exception handling because they set
sys.excepthookfor some reason. Now we always resetsys.excepthookafter importing.The type hints for the flax trainer suggested that the training split is optional when in fact itâs mandatory.
Made
BeakerWorkspace/BeakerStepLockmore robust when a job is preempted.Minor performance improvements for the Beaker executor and workspace.
Fixed bug with
step_extra_dependencieswhere uncacheable dependencies wouldnât be run.
v1.0.2 - 2022-11-14#
Changed#
BeakerSchedulercan now return a list of clusters.
v1.0.1 - 2022-10-20#
Fixed#
LightningTrainStepnow can take aLazymodel object which results in a gauranteed deterministic hash.Fixed issue where remote
Workspaceimplementations likeWandbWorkspaceandBeakerWorkspacewould use the same local cache regardless of the W&B / Beaker workspace being used.Fixed bug with
TorchEvalStepwhen constructing callbacks.Fixed some import error issues caused when an integration is not installed.
Fix incorrect reporting of final results in
MulticoreExecutor.
Changed#
Wandb step cache retries api call in case of timeout
beaker-py >= 1.11required.
v1.0.0 - 2022-10-05#
Added#
Added
step_extra_dependenciesinput field toStepclass that can be used to force a dependency on another step even if the current step doesnât directly depend on the output of the other step. See #418 for more context.
Changed#
beaker-py >= 1.10required.
Fixed#
Long log lines will be soft-wrapped to ensure that links are clickable.
Fixed a bug where some workspaces could be left in a bad state if a stepâs
Formatfailed to serialize the stepâs result inWorkspace.step_finished().Sometimes functions and methods end up as arguments to steps, which means we have to hash them. Instead of taking a hash of the function, we now take a hash of the functionâs module and name.
Fixed a bug with the Beaker executor where it would hang at the end of a run if a step failed that is a dependency of another step.
Fixed tests to work with new version of transformers.
Fixed
Executor.execute_sub_graph_for_step()to be able to run the stepâs dependencies in parallel.
v0.14.0 - 2022-09-20#
Added#
Adds a function to modify a Hugging Face transformer with IA3 adaptors
Added a
BeakerSchedulerregistrable class, specified as the argumentschedulertoBeakerExecutor, which controls the resources assigned to steps ran on Beaker. Users can implement their ownBeakerSchedulersubclasses to customize the resource assignment behavior.
Changed#
In the
tango runcommand,--no-serveris now the default. Use--serverto start the server.
Fixed#
Made
BeakerExecutormore robust to connection, timeout, SSL, and other recoverable HTTP errors.Made the
BeakerStepLockmore robust, and as a resultBeakerWorkspaceis more robust and should require less manual intervention for locks in a bad state.Fixed a bug with the internal scheduling logic of the
BeakerExecutorwhich could delay submitting some steps in parallel.Fixed a bug where creating a
StepInfoobject from params might result in unnecessary imports.Fixed a bug where canceling the Beaker executor might not work properly.
Fixed a bug where the trainer trains too much when
train_epochsis set and youâre using gradient accumulation.Fixed a bug where included modules might not be found when using multiprocessing when theyâre not on
sys.path/PYTHONPATH.Fixed how the results of uncacheable steps are displayed by
tango run.Beaker executor wonât run duplicate cacheable steps at the same time.
v0.13.0 - 2022-09-07#
Added#
You can now reference into a particular index of the result of another step in a config. For example:
{type: "ref", ref: "some_previous_step", key: 0}. The key field can be an integer if the result of the referenced step is a list or tuple, or a string if the result of the referenced step is a dictionary.Added
priorityparameter to Beaker executor for setting the default task priority for Beaker jobs.Added
Workspace.step_result()method for getting a stepâs result from the latest run.tango runwill now display a URL to the logs for failed steps when you use theBeakerExecutor.
Changed#
The
TorchTrainStepnow enables monitoring arbitrary model outputs during training.TorchTrainEngine.forward_trainnow returns a tupleloss, model_outputsfor each micro batch and the list of model outputs for all micro batches in a batch is passed to theTrainCallback.log_batchandTrainCallback.post_batch.Tango will now automatically search Python modules in the current working directory for registered classes so that you donât always need to use the
--include-packagesetting.The minimum supported Python version is now 3.8.
Added support for PyTorch Lightning 1.7.x
The Beaker Executor will no-longer live-stream logs from Beaker jobs, but logs will be viewable on Beaker and more readable.
Only the Beaker executor requires a clean working directory
Fixed#
Fixed a bug that did not allow a wandb artifactâs type to be set from a stepâs metadata dictionary.
Fixed a bug with how the Beaker executor streams log lines from Beaker which sometimes resulted in messages missing some starting characters, and tqdm lines being duplicated.
Fixed a bug in the Beaker workspace where the lock dataset wouldnât be removed if the step was found to be in an invalid state.
Improved cluster choice logic in
BeakerExecutorto ensure greater diversity of clusters when submitting many steps at once.Fixed bug where sub-processes of the multicore executor would use the wrong executor if
executorwas defined in atango.ymlfile.Deterministic hashes for numpy and torch tensors were not deterministic. Now they are.
v0.12.0 - 2022-08-23#
Added#
Step resources:
Added a
step_resourcesparameter to theStepclass which should be used to describe the computational resources required to run a step.Executorimplementations can use this information. For example, if your step needs 2 GPUs, you should setstep_resources=StepResources(gpu_count=2)("step_resources": {"gpu_count": 2}in the configuration language).Added a
Step.resources()property method. By default this returns the value specified by thestep_resourcesparameter. If your step implementation always requires the same resources, you can just override this method so you donât have to provide thestep_resourcesparameter.
Step execution:
Added an
executorfield to thetango.ymlsettings. You can use this to define the executor you want to use by default.Added a Beaker
Executorto the Beaker integration, registered as anExecutorwith the name âbeakerâ. To use this executor, add these lines to yourtango.ymlfile:executor: type: beaker beaker_workspace: ai2/my-workspace clusters: - ai2/general-cirrascale
See the docs for the
BeakerExecutorfor more information on the input parameters.
Step class:
Added a metadata field to the step class API. This can be set through the class variable
METADATAor through the constructor argumentstep_metadata.
Weights & Biases integration:
You can now change the artifact kind for step result artifacts by adding a field called âartifact_kindâ to a stepâs metadata. For models, setting âartifact_kindâ to âmodelâ will add the corresponding artifact to W&Bâs new model zoo.
Changed#
CLI:
The
tango runcommand will throw an error if you have uncommitted changes in your repository, unless you use the--allow-dirtyflag.The
tango runcommand will use the lightweight base executor (single process) by default. To use the multi-process executor, set-j/--parallelismto 1 or higher or -1 to use all available CPU cores.
Fixed#
Fixed bug where
StepInfoenvironment and platform metadata could be out-of-date if a step is run again due to failure.Fixed a bug where an unfortunate combination of early stopping and decreasing model performance could result in a crash in the torch trainer.
v0.11.0 - 2022-08-04#
Added#
Added a Flax integration along with an example config.
v0.10.1 - 2022-07-26#
Fixed#
Fixed issue where the StepInfo config argument could be parsed into a Step.
Restored capability to run tests out-of-tree.
v0.10.0 - 2022-07-07#
Changed#
Renamed
workspaceparameter ofBeakerWorkspaceclass tobeaker_workspace.Executorclass is now aRegistrablebase class.MulticoreExecutoris registered as âmulticoreâ.
Removed#
Removed
StepExecutionMetadata. Its fields have been absorbed intoStepInfo.
Fixed#
Improved
Step.ensure_result()such that the stepâs result doesnât have to be read from the cache.Fixed an issue with the output from
MulticoreExecutorsuch that itâs now consistent with the defaultExecutorfor steps that were found in the cache.One of our error messages referred to a configuration file that no longer exists.
Improved performance of
BeakerWorkspace.
Added#
Added the ability to train straight
Modelinstead of justLazy[Model]
v0.9.1 - 2022-06-24#
Fixed#
Fixed non-deterministic behavior in
TorchTrainStep.Fixed bug in
BeakerWorkspacewhere.step_info(step)would raise aKeyErrorif the step hasnât been registered as part of a run yet.Fixed a bug in
BeakerWorkspacewhere it would send too many requests to the beaker service.Fixed a bug where
WandbWorkspace.step_finished()or.step_failed()would crash if called from a different process than.step_starting().Fixed a bug in
WandbWorkspace.step_finished()which led to aRuntimeErrorsometimes while caching the result of a step.
v0.9.0 - 2022-06-01#
Added#
Added a Beaker integration that comes with
BeakerWorkspace, a remoteWorkspaceimplementation that uses Beaker Datasets under the hood.Added a
datasets::dataset_remixstep that provides the split remixing functionality oftango.steps.datasest_remix.DatasetRemixStepnow for HuggingfaceDatasetDict.Added a config and code example of Registrable to the First Step docs with edits for clarity.
Changed#
If you try to import something from a tango integration that is not fully installed due to missing dependencies, an
IntegrationMissingErrorwill be raised instead ofModuleNotFound.You can now set
-j 0intango runto disable multicore execution altogether.
Fixed#
Improved how steps and workspaces handle race conditions when different processes are competing to execute the same step. This would result in a
RuntimeErrorbefore with most workspaces, but now itâs handled gracefully.Fixed bug which caused GradScaler state to not be saved and loaded with checkpoints.
v0.8.0 - 2022-05-19#
Added#
Added a Weights & Baises remote
Workspaceimplementation:WandbWorkspace, registered as âwandbâ. This can be instantiated from a workspace URL in the form âwandb://entity/projectâ.Added a method
Workspace.step_result_for_runwhich gives the result of a step given the run name and step name within that run.Added property
Workspace.url, which returns a URL for the workspace that can be used to instantiate the exact same workspace usingWorkspace.from_url(). Subclasses must implement this.
Changed#
StepInfostart and end times will be always be in UTC now.WandbTrainCallbacknow logs system metrics from each worker process in distributed training.StepCache.__contains__()andStepCache.__getitem__()now take accept either aSteporStepInfoas an argument (Union[Step, StepInfo]).Refactored
tango.step_graph.StepGraphto allow initialization from aDict[str, Step].Executor.execute_step_graph()now attempts to execute all steps and summarizes success/failures.
Fixed#
Fixed bug with
LocalWorkspace.from_parsed_url()(#278).Deprecation warnings will now be logged from
tangoCLI.Fixed the text format in the case of serializing an iterator of string.
Added missing default value of
NonetoTangoGlobalSettings.find_or_default().Mypy has become incompatible with transformers and datasets, so we have to disable the checks in some places.
The
VERSIONmember of step arguments that were wrapped inLazywere not respected. Now they are.
v0.7.0 - 2022-04-19#
Added#
Added the â-n/ânameâ option to
tango run. This option allows the user to give the run an arbitrary name.Added a convenience property
.workspacetoStepclass that can be called from a stepâs.run()method to get the currentWorkspacebeing used.Gave
FromParamsobjects (which includes allRegistrableobjects) the ability to version themselves.Added CLI option to run a single step in a config using
--step-nameor-s.Added a
MultiCoreExecutorthat executes steps in parallel.Added an
ExecutorOutputdataclass that is returned byExecutor.execute_step_graph().StepGraphnow prints itself in a readable way.Tango now automatically detects when itâs running under a debugger, and disables multicore support accordingly. Many debuggers canât properly follow sub-processes, so this is a convenience for people who love debuggers.
Added more models to the stuff we can import from the transformers library.
Added new example for finetuning text-to-text models.
Changed#
Renamed
click_loggertocli_logger, and we now use richâs loggingHandleras the default handler, which means prettier output, better tracebacks, and you can use richâs markup syntax with thecli_loggerto easily add style to text.Refactored
tango.step_graph.StepGraphto allow initialization from aDict[str, Step].Executor.execute_step_graph()now attempts to execute all steps and summarizes success/failures.Upgraded PyTorch version in
tangoDocker image to latestv1.11.0+cu113.RunGenerationnow allows model object as input.
Fixed#
Fixed bug that mistakenly disallowed fully-qualified names containing
"_"(underscores) in the config.Fixed bug where
TorchTrainStepworking directory would be left in an unrecoverable state if training failed after saving the final model weights.Fixed bug in
FromParamswhere**kwargsmight be passed down to the constructors of arguments.Fixed bug in the way dependencies are tracked between steps.
Fixed bug that caused
MulticoreExecutorto hang in case of a failing step that was required recursively (not directly) downstream.Fixed bug in the way dependencies are tracked between steps
Compatibility with PyTorch Lightning 1.6
v0.6.0 - 2022-02-25#
Added#
New example that finetunes a pre-trained ResNet model on the Cats & Dogs dataset.
Added a â@requires_gpusâ decorator for marking tests as needing GPUs. Tests marked with this will be run in the âGPU Testsâ workflow on dual k80 GPUs via Beaker.
Added the â-w/âworkspaceâ option to
tango runandtango servercommands. This option takes a path or URL, and instantiates the workspace from the URL using the newly addedWorkspace.from_url()method.Added the âworkspaceâ field to
TangoGlobalSettings.Added the âenvironmentâ field to
TangoGlobalSettingsfor setting environment variables each timetangois run.Added a utility function to get a
StepGraphdirectly from a file.Added
tango.settingsmodule andtango settingsgroup of commands.A format for storing sequences as
SqliteSparseSequenceA way to massage kwargs before they determine the unique ID of a
Step
Changed#
local_workspace.ExecutorMetadatarenamed toStepExecutionMetadataand now saved asexecution-metadata.json.tango runwithout the option â-w/âworkspaceâ or â-d/âworkspace-dirâ will now use aMemoryWorkspaceinstead of aLocalWorkspacein a temp directory, unless youâve specified a default workspace in aTangoGlobalSettingsfile.Moved
tango.workspace.MemoryWorkspaceandtango.local_workspace.LocalWorkspacetotango.workspaces.*.Moved
tango.step_cache.MemoryStepCacheandtango.step_cache.LocalStepCachetotango.step_caches.*.Deprecated the
-d/--workspace-dircommand-line option. Please use-w/--workspaceinstead.
Fixed#
Fixed a small bug
LocalWorkspacewould fail to capture the conda environment in our Docker image.Fixed activation of
FILE_FRIENDLY_LOGGINGwhen set from the corresponding environment variable.Fixed setting log level via the environment variable
TANGO_LOG_LEVEL.Use relative paths within the
work_dirfor symbolic links to the latest and the best checkpoints inTorchTrainStep.Fixed some scenarios where Tango can hang after finishing all steps.
distributed_portandlog_everyparameters wonât factor intoTorchTrainStepâs unique ID.MappedSequencenow works with slicing.MappedSequencenow works with HuggingfaceDataset.Uncacheable steps are now visible in Tango UI.
Fixed bug in
Registrable.list_available()where an error might be raised if the default implementation hadnât been explicitly imported.Fixed issue where having a default argument to the
run()method wasnât getting applied to the stepâs unique ID.
v0.5.0 - 2022-02-09#
Added#
Added
TrainingEngineabstraction to torch integration.Added FairScale with a
FairScaleTrainingEnginethat leverages FairScaleâsFullyShardedDataParallel. This is meant to be used within theTorchTrainStep.All PyTorch components (such as learning rate schedulers, optimizers, data collators, etc) from the transformers library and now registered under the corresponding class in the torch integration. For example, transformers
Adafactoroptimizer is registered as anOptimizerunder the name âtransformers::Adafactorâ. More details can be found in the documentation for the transformers integration.
Changed#
Various changes to the parameters othe
TorchTrainStepdue to the introduction of theTrainingEngineclass.Params logged as
DEBUGlevel instead ofINFOto reduce noise in logs.The waiting message for
FileLockis now clear about which file itâs waiting for.Added an easier way to get the default Tango global config
Most methods to
TorchTrainCallbackalso take anepochparameter now.WandbTrainCallbacknow logs peak GPU memory occupied by PyTorch tensors per worker. This is useful because W&Bâs system metrics only display the total GPU memory reserved by PyTorch, which is always higher than the actual amount of GPU memory occupied by tensors. So these new metrics give a more accurate view into how much memory your training job is actually using.Plain old Python functions can now be used in
Lazyobjects.LocalWorkspacenow creates a symlink to the outputs of the latest run.Tango is now better at guessing when a step has died and should be re-run.
Tango is now more lenient about registering the same class under the same name twice.
When you use
dictinstead ofDictin your type annotations, you now get a legible error message. Same forList,Tuple, andSet.
Fixed#
Fixed a bug in
RegistrableandFromParamswhere registered function constructors would not properly construct arguments that were classes.Fixed a bug in
FromParamsthat would cause a crash when an argument to the constructor had the nameparams.Made
FromParamsmore efficient by only trying to parse the params as aStepwhen it looks like it actually could be a step.Fixed bug where
Executorwould crash ifgitcommand could not be found.Fixed bug where validation settings were not interpreted the right way by the torch trainer.
When you register the same name twice using
Registrable, you get an error message. That error message now contains the correct class name.
v0.4.0 - 2022-01-27#
Changed#
Default log level is
WARNINGinstead ofERROR.The web UI now renders the step graph left-to-right.
The web UI now shows runs by date, with the most recent run at the top.
The web UI now shows steps in a color-coded way.
The
tango runcommand now prints user-friendly paths if possible.The
--include-packageflag now also accepts paths instead of module names.tango.common.sqlite_sparse_sequence.SqliteSparseSequencenow lives attango.common.sequences.SqliteSparseSequence.
Fixed#
Ensure tqdm log lines always make it into the log file
out.logeven when log level isWARNINGorERROR.Numerous parts of Tango now have documentation when they didnât before.
v0.4.0rc5 - 2022-01-19#
Added#
Added
TorchEvalStepto torch integration, registered as âtorch::evalâ.
Changed#
Renamed
aggregate_val_metrictoauto_aggregate_val_metricinTorchTrainStep.devicesparameter toTorchTrainStepreplaced withdevice_count: int.Run name printed at the end of a run so itâs easier to find.
Type information added to package data. See PEP 561 for more information.
A new integration,
transformers, with two new steps for running seq2seq models.Added
logging_tqdm, if you donât want a progress bar, but you still want to see progress in the logs.Added
threaded_generator(), for wrapping generators so that they run in a separate thread from the generatorâs consumer.Added a new example for evaluating the T0 model on XSum, a summarization task.
Added
MappedSequencefor functionally wrapping sequences.Added
TextFormat, in case you want to store the output of your steps in raw text instead of JSON.Steps can now list arguments in
SKIP_ID_ARGUMENTSto indicate that the argument should not affect a stepâs unique id. This is useful for arguments that affect the execution of a step, but not the output.Stepnow implements__str__, so steps look pretty in the debugger.Added
DatasetCombineStep, a step that combines multiple datasets into one.Added
common.logging.initialize_worker_logging()function for configuring logging from worker processes/threads.Logs from
tango run ...will be written to a file calledout.login the run directory.
Fixed#
Fixed torch
StopEarlyCallbackstate not being recovered properly on restarts.Fixed file friendly logging by removing special styling characters.
Ensured exceptions captured in logs.
LocalWorkspacenow works properly with uncacheable steps.When a Tango run got killed hard, with
kill -9, or because the machine lost power,LocalWorkspacewould sometimes keep a step marked as ârunningâ, preventing further executions. This still happens sometimes, but it is now much less likely (and Tango gives you instructions for how to fix it).To make all this happen,
LocalWorkspacenow saves step info in a Sqlite database. Unfortunately that means that the workspace format changes and existing workspace directories wonât work properly with it.Fixed premature cleanup of temporary directories when using
MemoryWorkspace
v0.4.0rc4 - 2021-12-20#
Fixed#
Fixed a bug where
StepInfofails to deserialize whenerroris an exception that canât be pickled.
v0.4.0rc3 - 2021-12-15#
Added#
Added
DatasetsFormatformat andLoadStreamingDatasetstep todatasetsintegration.SqliteDictFormatfor datasets.Added
pre_epoch()andpost_epoch()callback methods to PyTorchTrainCallback.
Changed#
LoadDatasetstep fromdatasetsintegration is now cacheable, using theDatasetsFormatformat by default. But this only works with non-streaming datasets. For streaming datasets, you should use theLoadStreamingDatasetstep instead.
Fixed#
Fixed bug where
KeyboardInterruptexceptions were not handled properly by steps and workspaces.WandbTrainCallbacknow will use part of the stepâs unique ID as the name for the W&B run by default, to make it easier to indentify which tango step corresponds to each run in W&B.WandbTrainCallbackwill save the entireTrainConfigobject to the W&B config.
v0.4.0rc2 - 2021-12-13#
Added#
Sample experiment configurations that prove Eulerâs identity
Changed#
Loosened
Clickdependency to include v7.0.Loosened
datasetsdependency.Tightened
petnamedependency to exclude next major release for safety.
Fixed#
Workspace,MemoryWorkspace, andLocalWorkspacecan now be imported directly from thetangobase module.Uncacheable leaf steps would never get executed. This is now fixed.
We were treating failed steps as if they were completed by accident.
The visualization had a problem with showing steps that never executed because a dependency failed.
Fixed a bug where
Lazyinputs to aStepwould fail to resolve arguments that come from the result of another step.Fixed a bug in
TorchTrainStepwhere some arguments for distributed training (devices,distributed_port) werenât being set properly.
v0.4.0rc1 - 2021-11-30#
Added#
Introduced the concept of the
Workspace, withLocalWorkspaceandMemoryWorkspaceas initial implementations.Added a stub of a webserver that will be able to visualize runs as they happen.
Added separate classes for
LightningTrainingTypePlugin,LightningPrecisionPlugin,LightningClusterEnvironmentPlugin,LightningCheckpointPluginfor compatibility withpytorch-lightning>=1.5.0.Added a visualization of workspaces that can show step graphs while theyâre executing.
Removed#
Removed old
LightningPluginclassRemoved requirement of the
overridespackage
Changed#
Made it possible to construct a step graph out of
Stepobjects, instead of constructing it out ofStepStubobjects.Removed dataset fingerprinting code, since we can now use
Stepto make sure things are cached.Made steps deterministic by default.
Brought back
MemoryStepCache, so we can run steps without configuring anything.W&B
torch::TrainCallbacklogs withstep=step+1now so that training curves in the W&B dashboard match up with checkpoints saved locally and are easier to read (e.g. step 10000 instead of 9999).filelock >= 3.4required, parameterpoll_intervalltotango.common.file_lock.FileLock.acquirerenamed topoll_interval.
Fixed#
Fixed bug in
FromParamswhere a parameter to aFromParamsclass may not be instantiated correctly if itâs a class with a generic type parameter.
v0.3.6 - 2021-11-12#
Added#
Added a
.log_batch()method ontorch::TrainCallbackwhich is given the average loss across distributed workers, but only called everylog_everysteps.
Removed#
Removed
.pre_log_batch()method ontorch::TrainCallback.
Fixed#
Fixed typo in parameter name
remove_stale_checkpointsinTorchTrainStep(previously wasremove_state_checkpoints).Fixed bug in
FromParamsthat would cause failures whenfrom __future__ import annotationswas used with Python older than 3.10. See PEP 563 for details.
v0.3.5 - 2021-11-05#
Fixed#
Fixed a bug in
FromParamswhere the âtypeâ parameter was ignored in some cases where theRegistrablebase class did not directly inherit fromRegistrable.
v0.3.4 - 2021-11-04#
Added#
Added
StopEarlyCallback, atorch::TrainCallbackfor early stopping.Added parameter
remove_stale_checkpointstoTorchTrainStep.
Changed#
Minor changes to
torch::TrainCallbackinterface.Weights & Biases
torch::TrainCallbacknow logs best validation metric score.
v0.3.3 - 2021-11-04#
Added#
Added support for PEP 604 in
FromParams, i.e. writing union types as âX | Yâ instead of âUnion[X, Y]â.[internals] Added a spot for miscellaneous end-to-end integration tests (not to be confused with âtests of integrationsâ) in
tests/end_to_end/.[internals] Core tests now run on all officially supported Python versions.
Fixed#
Fixed a bug in
FromParamswhere non-FromParamsclass parameters were not instantiated properly (or at all).Fixed a bug in
FromParamswhere kwargs were not passed on from a wrapper class to the wrapped class.Fixed small bug where some errors from git would be printed when executor metadata is created outside of a git repository.
v0.3.2 - 2021-11-01#
Fixed#
Fixed a bug with
FromParamsthat caused.from_params()to fail when the params contained an object that was already instantiated.tango command no longer installs a SIGTERM handler, which fixes some bugs with integrations that use multiprocessing.
v0.3.1 - 2021-10-29#
Changed#
Updated the
LightningTrainStepto optionally take in aLightningDataModuleas input.
v0.3.0 - 2021-10-28#
Added#
Added
IterableDatasetDict, a version ofDatasetDictfor streaming-like datasets.Added a PyTorch Lightning integration with
LightningTrainStep.
Fixed#
Fixed bug with
FromParamsandLazywhere extra arguments would sometimes be passed down through to aLazyclass when they shouldnât.
v0.2.4 - 2021-10-22#
Added#
Added support for torch 1.10.0.
Changed#
--file-friendly-loggingflag is now an option to the maintangocommand, so needs to be passed beforerun, e.g.tango --file-friendly-logging run ....
Fixed#
Fixed bug with
Step.from_params.Ensure logging is initialized is spawn processes during distributed training with
TorchTrainStep.
v0.2.3 - 2021-10-21#
Added#
Added support for global settings file,
tango.yml.Added âinclude_packageâ (array of string) param to config spec.
Added a custom error
StopEarlythat aTrainCallbackcan raise within theTorchTrainStepto stop training early without crashing.Added step config, tango command, and tango version to executor metadata.
Executor now also saves pip dependencies and conda environment files to the run directory for each step.
Fixed#
Ensured
**kwargsarguments are logged inFromParams.
v0.2.2 - 2021-10-19#
Added#
Added new steps to
datasetsintegration:ConcatenateDatasets(âdatasets::concatenateâ) andInterleaveDatasets(datasets::interleave).Added
__contains__and__iter__methods onDatasetDictso that it is now aMappingclass.Added
tango infocommand that - among other things - displays which integrations are installed.
v0.2.1 - 2021-10-18#
Added#
Added
convert_to_tango_dataset_dict()function in thedatasetsintegration. Itâs important for step caching purposes to use this to convert a HFDatasetDictto a native TangoDatasetDictwhen thatDatasetDictis part of the input to another step. Otherwise the HFDatasetDictwill have to be pickled to determine its hash.
Changed#
Format.checksum()is now an abstract method. Subclasses should only compute checksum on the serialized artifact and nothing else in the directory.[internals] Changed the relationship between
Executor,StepCache, andStep.Executornow owns theStepCache, andStepnever interacts withStepCachedirectly.
v0.2.0 - 2021-10-15#
Added#
Added a Weights & Biases integration with a training callback (âwandb::logâ) for
TorchTrainStep(âtorch::trainâ) that logs training and validation metrics to W&B.
Fixed#
Fixed
Format.checksum()when there is a symlink to a directory in the cache folder.
v0.1.3 - 2021-10-15#
Added#
Added the ability to track a metric other than âlossâ for validation in
TorchTrainStep(âtorch::trainâ).
Fixed#
Final model returned from
TorchTrainStep(âtorch::trainâ) will have best weights loaded.Checkpoints are saved from
TorchTrainStep(âtorch::trainâ) even when there is no validation loop.Fixed
TorchTrainStep(âtorch::trainâ) whenvalidation_splitisNone.Fixed distributed training with
TorchTrainStep(âtorch::trainâ) on GPU devices.
v0.1.2 - 2021-10-13#
Added#
Added support for YAML configuration files.
v0.1.1 - 2021-10-12#
Added#
TorchTrainStepnow displays a progress bar while saving a checkpoint to file.The default executor now saves a âexecutor-metadata.jsonâ file to the directory for each step.
Changed#
Renamed
DirectoryStepCachetoLocalStepCache(registered as âlocalâ).LocalStepCachesaves metadata tocache-metadata.jsoninstead ofmetadata.json.
Fixed#
Fixed bug with
TorchTrainStepduring distributed training.FromParamswill automatically convert strings intoPathtypes now when the annotation isPath.
v0.1.0 - 2021-10-11#
Added#
Added
StepGraphandExecutorabstractions.Added a basic PyTorch training step registered as
"torch::train", along with other registrable components, such asModel,DataLoader,Sampler,DataCollator,Optimizer, andLRScheduler.Added
DatasetRemixStepintango.steps.Added module
tango.common.sequences.Added
DatasetDictclass intango.common.dataset_dict.Added đ€ Datasets integration.
Added command-line options to set log level or disable logging completely.
Changed#
Step.work_dir,Step.unique_id,Step.dependencies, andStep.recursive_dependenciesare now a properties instead of methods.tango runcommand will acquire a lock on the directory to avoid race conditions.Integrations can now be installed with
pip install tango[INTEGRATION_NAME]. For example,pip install tango[torch].Added method
Registrable.search_modules()for automatically finding and importing the modules where a givennamemight be registered.FromParams.from_params()andRegistrable.resolve_class_namewill now callRegistrable.search_modules()to automatically import modules where the type might be defined. Thus for classes that are defined and registered within anytango.*submodules it is not necessary to explicitly import them.
Fixed#
Stepimplementations can now take arbitrary**kwargsin theirrun()methods.
v0.0.3 - 2021-09-27#
Added#
Added
tangocommand.
v0.0.2 - 2021-09-27#
Added#
Ported over core tango components from AllenNLP.
v0.0.1 - 2021-09-22#
Added#
Added initial project boilerplate.