optimizers
allennlp.training.optimizers
AllenNLP just uses
PyTorch optimizers,
with a thin wrapper to allow registering them and instantiating them from_params
.
The available optimizers are
- adadelta
- adagrad
- adam
- adamw
- huggingface_adamw
- huggingface_adafactor
- sparse_adam
- sgd
- rmsprop
- adamax
- averaged_sgd
ParameterGroupsType¶
ParameterGroupsType = List[Tuple[List[str], Dict[str, Any]]]
make_parameter_groups¶
def make_parameter_groups(
model_parameters: List[Tuple[str, torch.nn.Parameter]],
groups: Optional[ParameterGroupsType] = None
) -> Union[List[Dict[str, Any]], List[torch.nn.Parameter]]
Takes a list of model parameters with associated names (typically coming from something like
model.named_parameters()
), along with a grouping (as specified below), and prepares them to be passed
to the __init__
function of a torch.Optimizer
. This means separating the parameters into
groups with the given regexes, and prepping whatever keyword arguments are given for those
regexes in groups
.
groups
contains something like:
[
(["regex1", "regex2"], {"lr": 1e-3}),
(["regex3"], {"lr": 1e-4})
]
All of key-value pairs specified in each of these dictionaries will passed passed as-is
to the optimizer, with the exception of a dictionaries that specify requires_grad
to be False
:
[
...
(["regex"], {"requires_grad": False})
]
When a parameter group has {"requires_grad": False}
, the gradient on all matching parameters
will be disabled and that group will be dropped so that it's not actually passed to the optimizer.
Ultimately, the return value of this function is in the right format to be passed directly
as the params
argument to a pytorch Optimizer
.
If there are multiple groups specified, this is a list of dictionaries, where each
dict contains a "parameter group" and groups specific options, e.g., {'params': [list of
parameters], 'lr': 1e-3, ...}. Any config option not specified in the additional options (e.g.
for the default group) is inherited from the top level arguments given in the constructor. See:
https://pytorch.org/docs/0.3.0/optim.html?#per-parameter-options. See also our
test_optimizer_parameter_groups
test for an example of how this works in this code.
The dictionary's return type is labeled as Any
, because it can be a List[torch.nn.Parameter]
(for the "params" key), or anything else (typically a float) for the other keys.
Optimizer¶
class Optimizer(torch.optim.Optimizer, Registrable)
This class just allows us to implement Registrable
for Pytorch Optimizers. We do something a
little bit different with Optimizers
, because they are implemented as classes in PyTorch, and
we want to use those classes. To make things easy, we just inherit from those classes, using
multiple inheritance to also inherit from Optimizer
. The only reason we do this is to make
type inference on parameters possible, so we can construct these objects using our configuration
framework. If you are writing your own script, you can safely ignore these classes and just use
the torch.optim
classes directly.
If you are implementing one of these classes, the model_parameters
and parameter_groups
arguments to __init__
are important, and should always be present. The trainer will pass
the trainable parameters in the model to the optimizer using the name model_parameters
, so if
you use a different name, your code will crash. Nothing will technically crash if you use a
name other than parameter_groups
for your second argument, it will just be annoyingly
inconsistent.
Most subclasses of Optimizer
take both a model_parameters
and a parameter_groups
constructor argument. The model_parameters
argument does not get an entry in a typical
AllenNLP configuration file, but the parameter_groups
argument does (if you want a non-default
value). See the documentation for the make_parameter_groups
function for more information on
how the parameter_groups
argument should be specified.
default_implementation¶
class Optimizer(torch.optim.Optimizer, Registrable):
| ...
| default_implementation = "adam"
default¶
class Optimizer(torch.optim.Optimizer, Registrable):
| ...
| @staticmethod
| def default(model_parameters: List) -> "Optimizer"
MultiOptimizer¶
@Optimizer.register("multi")
class MultiOptimizer(Optimizer):
| def __init__(
| self,
| model_parameters: List[Tuple[str, torch.nn.Parameter]],
| optimizers: Dict[str, Lazy[Optimizer]],
| parameter_groups: ParameterGroupsType
| )
A MultiOptimizer
creates a dictionary of Optimizer
s keyed on some 'name'.
Each Optimizer contains its own set of parameters which are obtained using
regex matches for certain model parameters.
This optimizer works by taking in a parameter optimizers
which contains a list of Optimizers
with their keyword arguments, and a parameter parameter_groups
, which contains regexes and their
corresponding optimizer and optional non-default optimizer options for this group.
The regexes in the parameter groups are assigned to their optimizer based on the 'name' argument
where the 'name' value should be the same for the optimizer and parameter group.
You should specify a default optimizer with 'name': 'default' which will be used for all
parameters which didn't obtain a regex match or when your parameter group doesn't contain a 'name'
parameter.
Parameters¶
-
optimizers :
List[Dict[str, Any]]
A list of optimizers to use. Each entry in the list is a dictionary of keyword arguments. A 'name' keyword argument should be given which will serve as the key to match the optimizer with a specific parameter group. You should also supply an entry for the default parameter group, e.g. 'name': 'default'. -
parameter_groups :
List[Tuple[List[str], Dict[str, Any]]
, optional (default =None
)
See the docstring ofmake_parameter_groups
for what this parameter should look like. It should follow the same format as there, except an additional 'optimizer_name' argument should be provided to match this group to its own optimizer. Optimizer options can also be set for this group which will override the default options.
step¶
class MultiOptimizer(Optimizer):
| ...
| def step(self)
Takes an optimization step for each optimizer.
state_dict¶
class MultiOptimizer(Optimizer):
| ...
| def state_dict(self)
Creates an object optimizer_state_dict
, which is a dictionary mapping an optimizer key to its
state_dict
. This dictionary is used as the value for 'optimizer' in the 'training_states' dictionary in
the gradient_descent
Trainer
, e.g.
"optimizer" : {
"optimizer1": `optimizer1_state_dict`,
"optimizer2": `optimizer2_state_dict`
}.
load_state_dict¶
class MultiOptimizer(Optimizer):
| ...
| def load_state_dict(self, training_state: Dict[str, Any])
Loads each optimizer's state_dict
.
zero_grad¶
class MultiOptimizer(Optimizer):
| ...
| def zero_grad(self, set_to_none: bool = False)
Sets parameter gradients to zero or None.
AdamOptimizer¶
@Optimizer.register("adam")
class AdamOptimizer(Optimizer, torch.optim.Adam):
| def __init__(
| self,
| model_parameters: List[Tuple[str, torch.nn.Parameter]],
| parameter_groups: List[Tuple[List[str], Dict[str, Any]]] = None,
| lr: float = 0.001,
| betas: Tuple[float, float] = (0.9, 0.999),
| eps: float = 1e-08,
| weight_decay: float = 0.0,
| amsgrad: bool = False
| )
Registered as an Optimizer
with name "adam".
SparseAdamOptimizer¶
@Optimizer.register("sparse_adam")
class SparseAdamOptimizer(Optimizer, torch.optim.SparseAdam):
| def __init__(
| self,
| model_parameters: List[Tuple[str, torch.nn.Parameter]],
| parameter_groups: List[Tuple[List[str], Dict[str, Any]]] = None,
| lr: float = 0.001,
| betas: Tuple[float, float] = (0.9, 0.999),
| eps: float = 1e-08
| )
Registered as an Optimizer
with name "sparse_adam".
AdamaxOptimizer¶
@Optimizer.register("adamax")
class AdamaxOptimizer(Optimizer, torch.optim.Adamax):
| def __init__(
| self,
| model_parameters: List[Tuple[str, torch.nn.Parameter]],
| parameter_groups: List[Tuple[List[str], Dict[str, Any]]] = None,
| lr: float = 0.002,
| betas: Tuple[float, float] = (0.9, 0.999),
| eps: float = 1e-08,
| weight_decay: float = 0.0
| )
Registered as an Optimizer
with name "adamax".
AdamWOptimizer¶
@Optimizer.register("adamw")
class AdamWOptimizer(Optimizer, torch.optim.AdamW):
| def __init__(
| self,
| model_parameters: List[Tuple[str, torch.nn.Parameter]],
| parameter_groups: List[Tuple[List[str], Dict[str, Any]]] = None,
| lr: float = 0.001,
| betas: Tuple[float, float] = (0.9, 0.999),
| eps: float = 1e-08,
| weight_decay: float = 0.01,
| amsgrad: bool = False
| )
Registered as an Optimizer
with name "adamw".
HuggingfaceAdamWOptimizer¶
@Optimizer.register("huggingface_adamw")
class HuggingfaceAdamWOptimizer(Optimizer, transformers.AdamW):
| def __init__(
| self,
| model_parameters: List[Tuple[str, torch.nn.Parameter]],
| parameter_groups: List[Tuple[List[str], Dict[str, Any]]] = None,
| lr: float = 1e-5,
| betas: Tuple[float, float] = (0.9, 0.999),
| eps: float = 1e-08,
| weight_decay: float = 0.0,
| correct_bias: bool = True
| )
Registered as an Optimizer
with name "huggingface_adamw".
HuggingfaceAdafactor¶
@Optimizer.register("huggingface_adafactor")
class HuggingfaceAdafactor(Optimizer, transformers.Adafactor):
| def __init__(
| self,
| model_parameters: List[Tuple[str, torch.nn.Parameter]],
| parameter_groups: List[Tuple[List[str], Dict[str, Any]]] = None,
| lr: Optional[float] = None,
| eps: Tuple[float, float] = (1e-30, 1e-3),
| clip_threshold: float = 1.0,
| decay_rate: float = -0.8,
| beta1: Optional[float] = None,
| weight_decay: float = 0.0,
| scale_parameter: bool = True,
| relative_step: bool = True,
| warmup_init: bool = False
| )
Registered as an Optimizer
with name "huggingface_adafactor".
AdagradOptimizer¶
@Optimizer.register("adagrad")
class AdagradOptimizer(Optimizer, torch.optim.Adagrad):
| def __init__(
| self,
| model_parameters: List[Tuple[str, torch.nn.Parameter]],
| parameter_groups: List[Tuple[List[str], Dict[str, Any]]] = None,
| lr: float = 0.01,
| lr_decay: float = 0.0,
| weight_decay: float = 0.0,
| initial_accumulator_value: float = 0.0,
| eps: float = 1e-10
| )
Registered as an Optimizer
with name "adagrad".
AdadeltaOptimizer¶
@Optimizer.register("adadelta")
class AdadeltaOptimizer(Optimizer, torch.optim.Adadelta):
| def __init__(
| self,
| model_parameters: List[Tuple[str, torch.nn.Parameter]],
| parameter_groups: List[Tuple[List[str], Dict[str, Any]]] = None,
| lr: float = 1.0,
| rho: float = 0.9,
| eps: float = 1e-06,
| weight_decay: float = 0.0
| )
Registered as an Optimizer
with name "adadelta".
SgdOptimizer¶
@Optimizer.register("sgd")
class SgdOptimizer(Optimizer, torch.optim.SGD):
| def __init__(
| self,
| model_parameters: List[Tuple[str, torch.nn.Parameter]],
| lr: float,
| parameter_groups: List[Tuple[List[str], Dict[str, Any]]] = None,
| momentum: float = 0.0,
| dampening: float = 0,
| weight_decay: float = 0.0,
| nesterov: bool = False
| )
Registered as an Optimizer
with name "sgd".
RmsPropOptimizer¶
@Optimizer.register("rmsprop")
class RmsPropOptimizer(Optimizer, torch.optim.RMSprop):
| def __init__(
| self,
| model_parameters: List[Tuple[str, torch.nn.Parameter]],
| parameter_groups: List[Tuple[List[str], Dict[str, Any]]] = None,
| lr: float = 0.01,
| alpha: float = 0.99,
| eps: float = 1e-08,
| weight_decay: float = 0.0,
| momentum: float = 0.0,
| centered: bool = False
| )
Registered as an Optimizer
with name "rmsprop".
AveragedSgdOptimizer¶
@Optimizer.register("averaged_sgd")
class AveragedSgdOptimizer(Optimizer, torch.optim.ASGD):
| def __init__(
| self,
| model_parameters: List[Tuple[str, torch.nn.Parameter]],
| parameter_groups: List[Tuple[List[str], Dict[str, Any]]] = None,
| lr: float = 0.01,
| lambd: float = 0.0001,
| alpha: float = 0.75,
| t0: float = 1000000.0,
| weight_decay: float = 0.0
| )
Registered as an Optimizer
with name "averaged_sgd".
DenseSparseAdam¶
@Optimizer.register("dense_sparse_adam")
class DenseSparseAdam(Optimizer, torch.optim.Optimizer):
| def __init__(
| self,
| model_parameters: List[Tuple[str, torch.nn.Parameter]],
| parameter_groups: List[Tuple[List[str], Dict[str, Any]]] = None,
| lr=1e-3,
| betas=(0.9, 0.999),
| eps=1e-8
| )
NOTE: This class has been copied verbatim from the separate Dense and Sparse versions of Adam in Pytorch.
Implements Adam algorithm with dense & sparse gradients. It has been proposed in Adam: A Method for Stochastic Optimization.
Registered as an Optimizer
with name "dense_sparse_adam".
Parameters¶
- params :
iterable
iterable of parameters to optimize or dicts defining parameter groups - lr :
float
, optional (default =1e-3
)
The learning rate. - betas :
Tuple[float, float]
, optional (default =(0.9, 0.999)
)
coefficients used for computing running averages of gradient and its square. - eps :
float
, optional (default =1e-8
)
A term added to the denominator to improve numerical stability.
step¶
class DenseSparseAdam(Optimizer, torch.optim.Optimizer):
| ...
| def step(self, closure=None)
Performs a single optimization step.
Parameters¶
- closure :
callable
, optional
A closure that reevaluates the model and returns the loss.