gradient_descent_trainer
allennlp.training.gradient_descent_trainer
GradientDescentTrainer¶
@Trainer.register("gradient_descent", constructor="from_partial_objects")
class GradientDescentTrainer(Trainer):
| def __init__(
| self,
| model: Model,
| optimizer: torch.optim.Optimizer,
| data_loader: DataLoader,
| patience: Optional[int] = None,
| validation_metric: Union[str, List[str]] = "-loss",
| validation_data_loader: DataLoader = None,
| num_epochs: int = 20,
| serialization_dir: Optional[Union[str, os.PathLike]] = None,
| checkpointer: Optional[Checkpointer] = None,
| cuda_device: Optional[Union[int, torch.device]] = None,
| grad_norm: Union[float, bool] = False,
| grad_clipping: Optional[float] = None,
| learning_rate_scheduler: Optional[LearningRateScheduler] = None,
| momentum_scheduler: Optional[MomentumScheduler] = None,
| moving_average: Optional[MovingAverage] = None,
| callbacks: List[TrainerCallback] = None,
| distributed: bool = False,
| local_rank: int = 0,
| world_size: int = 1,
| num_gradient_accumulation_steps: int = 1,
| use_amp: bool = False,
| enable_default_callbacks: bool = True,
| run_confidence_checks: bool = True,
| grad_scaling: bool = True,
| ddp_wrapped_model: Optional[DdpWrappedModel] = None,
| **kwargs
| ) -> None
A trainer for doing supervised learning with gradient descent. It just takes a labeled dataset
and a DataLoader
, and uses the supplied Optimizer
to learn the weights for your model over
some fixed number of epochs. You can also pass in a validation data_loader and enable early
stopping. There are many other bells and whistles as well.
Registered as a Trainer
with the name "gradient_descent" (and is also the default Trainer
).
The constructor that is registered is from_partial_objects
-
see the arguments to that function for the exact keys that should be used, if you are using
a configuration file. They largely match the arguments to __init__
, and we don't repeat their
docstrings in from_partial_objects
.
Parameters¶
-
model :
Model
An AllenNLP model to be optimized. Pytorch Modules can also be optimized if theirforward
method returns a dictionary with a "loss" key, containing a scalar tensor representing the loss function to be optimized.If you are training your model using GPUs, your model should already be on the correct device. (If you are using our
train
command this will be handled for you.)In a typical AllenNLP configuration file, this parameter does not get an entry under the "trainer", it gets constructed separately.
-
optimizer :
torch.nn.Optimizer
An instance of a Pytorch Optimizer, instantiated with the parameters of the model to be optimized. -
data_loader :
DataLoader
ADataLoader
containing yourDataset
, yielding padded indexed batches.In a typical AllenNLP configuration file, this parameter does not get an entry under the "trainer", it gets constructed separately.
-
patience :
Optional[int] > 0
, optional (default =None
)
Number of epochs to be patient before early stopping: the training is stopped afterpatience
epochs with no improvement. If given, it must be> 0
. If None, early stopping is disabled. -
validation_metric :
Union[str, List[str]]
, optional (default ="-loss"
)
Validation metric to measure for whether to stop training using patience and whether to serialize anis_best
model each epoch. The metric name must be prepended with either "+" or "-", which specifies whether the metric is an increasing or decreasing function. If you specify more than one metric, the metrics will be summed to make theis_best
decision. -
validation_data_loader :
DataLoader
, optional (default =None
)
ADataLoader
to use for the validation set. IfNone
, then use the trainingDataLoader
with the validation data.In a typical AllenNLP configuration file, this parameter does not get an entry under the "trainer", it gets constructed separately.
-
num_epochs :
int
, optional (default =20
)
Number of training epochs. -
serialization_dir :
str
, optional (default =None
)
Path to directory for saving and loading model files. Models will not be saved if this parameter is not passed.In a typical AllenNLP configuration file, this parameter does not get an entry under the "trainer", it gets constructed separately.
-
checkpointer :
Checkpointer
, optional (default =None
)
ACheckpointer
is responsible for periodically saving model weights. If none is given here, we will construct one with default parameters. -
cuda_device :
Optional[Union[int, torch.device]]
, optional (default =None
)
An integer ortorch.device
specifying the CUDA device to use for this process. If -1, the CPU is used. IfNone
and you have a GPU available, that GPU will be used.Note
If you don't intend to use a GPU, but you have one available, you'll need to explicitly set
cuda_device=-1
.Note
If you intend to use a GPU, your model already needs to be on the correct device, which you can do with
model = model.cuda()
.Note
Data parallelism is controlled at the allennlp train level, so each trainer will have a single GPU.
-
grad_norm :
Union[float, bool]
, optional (default =False
)
If a float, gradient norms will be rescaled to have a maximum of this value. IfTrue
, the gradient norms will be calculated and passed through to anyTrainerCallbacks
, but won't be rescaled. IfFalse
, gradient norms will not be calculated or rescaled. -
grad_clipping :
float
, optional (default =None
)
If provided, gradients will be clippedduring the backward pass
to have an (absolute) maximum of this value. If you are gettingNaNs
in your gradients during training that are not solved by usinggrad_norm
, you may need this. -
learning_rate_scheduler :
LearningRateScheduler
, optional (default =None
)
If specified, the learning rate will be decayed with respect to this schedule at the end of each epoch (or batch, if the scheduler implements thestep_batch
method). If you usetorch.optim.lr_scheduler.ReduceLROnPlateau
, this will use thevalidation_metric
provided to determine if learning has plateaued. To support updating the learning rate on every batch, this can optionally implementstep_batch(batch_num_total)
which updates the learning rate given the batch number. -
momentum_scheduler :
MomentumScheduler
, optional (default =None
)
If specified, the momentum will be updated at the end of each batch or epoch according to the schedule. -
moving_average :
MovingAverage
, optional (default =None
)
If provided, we will maintain moving averages for all parameters. During training, we employ a shadow variable for each parameter, which maintains the moving average. During evaluation, we backup the original parameters and assign the moving averages to corresponding parameters. Be careful that when saving the checkpoint, we will save the moving averages of parameters. This is necessary because we want the saved model to perform as well as the validated model if we load it later. But this may cause problems if you restart the training from checkpoint. -
callbacks :
List[TrainerCallback]
, optional (default =None
)
A list of callbacks that can be called at certain events: e.g. each batch, epoch, and at the start and end of training, etc. -
distributed :
bool
, optional (default =False
)
If set, PyTorch'sDistributedDataParallel
is used to train the model in multiple GPUs. This also requiresworld_size
to be greater than 1.In a typical AllenNLP configuration file, this parameter does not get an entry under the "trainer", it gets constructed separately (you need a top-level "distributed" key, next to the "trainer" entry, that specifies a list of "cuda_devices").
-
local_rank :
int
, optional (default =0
)
This is the unique identifier of theTrainer
in a distributed process group. The GPU device id is used as the rank.In a typical AllenNLP configuration file, this parameter does not get an entry under the "trainer", it gets constructed separately.
-
world_size :
int
, optional (default =1
)
The number ofTrainer
workers participating in the distributed training.In a typical AllenNLP configuration file, this parameter does not get an entry under the "trainer", it gets constructed separately.
-
num_gradient_accumulation_steps :
int
, optional (default =1
)
Gradients are accumulated for the given number of steps before doing an optimizer step. This can be useful to accommodate batches that are larger than the RAM size. Refer Thomas Wolf's post for details on Gradient Accumulation. -
use_amp :
bool
, optional (default =False
)
IfTrue
, we'll train using Automatic Mixed Precision. -
enable_default_callbacks :
bool
, optional (default =True
)
WhenTrue
, theDEFAULT_CALLBACKS
will be used in addition to any other callbacks listed in thecallbacks
parameter. When set toFalse
,DEFAULT_CALLBACKS
are not used. -
run_confidence_checks :
bool
, optional (default =True
)
Determines whether model confidence checks, such asNormalizationBiasVerification
, are run. -
run_sanity_checks :
bool
, optional (default =True
)
This parameter is deprecated. Please userun_confidence_checks
instead. -
grad_scaling :
bool
, optional (default =True
)
Whenuse_amp
isTrue
, this determines whether or not to use aGradScaler
.Note
This parameter is ignored when
use_amp
isFalse
. -
ddp_wrapped_model :
Optional[DdpWrappedModel]
, optional (default =None
)
Themodel
wrapped with aDdpAccelerator
for distributed training.Note
This is required for distributed training.
clip_gradient¶
class GradientDescentTrainer(Trainer):
| ...
| def clip_gradient(self)
Performs gradient clipping. If the model is in mixed precision training, we would first unscale the gradient.
rescale_gradients¶
class GradientDescentTrainer(Trainer):
| ...
| def rescale_gradients(self) -> Optional[float]
Performs gradient rescaling. Is a no-op if gradient rescaling is not enabled.
Returns the norm of the gradients if grad_norm
is True
or a float
,
otherwise returns None
.
batch_outputs¶
class GradientDescentTrainer(Trainer):
| ...
| def batch_outputs(
| self,
| batch: TensorDict,
| for_training: bool
| ) -> Dict[str, torch.Tensor]
Does a forward pass on the given batch and returns the output dictionary that the model returns, after adding any specified regularization penalty to the loss (if training).
train¶
class GradientDescentTrainer(Trainer):
| ...
| def train(self) -> Dict[str, Any]
Trains the supplied model with the supplied parameters.
get_checkpoint_state¶
class GradientDescentTrainer(Trainer):
| ...
| def get_checkpoint_state(self) -> Optional[TrainerCheckpoint]
from_partial_objects¶
class GradientDescentTrainer(Trainer):
| ...
| @classmethod
| def from_partial_objects(
| cls,
| model: Model,
| serialization_dir: str,
| data_loader: DataLoader,
| validation_data_loader: DataLoader = None,
| local_rank: int = 0,
| patience: int = None,
| validation_metric: Union[str, List[str]] = "-loss",
| num_epochs: int = 20,
| cuda_device: Optional[Union[int, torch.device]] = None,
| grad_norm: Union[float, bool] = False,
| grad_clipping: float = None,
| distributed: bool = False,
| world_size: int = 1,
| num_gradient_accumulation_steps: int = 1,
| use_amp: bool = False,
| no_grad: List[str] = None,
| optimizer: Lazy[Optimizer] = Lazy(Optimizer.default),
| learning_rate_scheduler: Lazy[LearningRateScheduler] = None,
| momentum_scheduler: Lazy[MomentumScheduler] = None,
| moving_average: Lazy[MovingAverage] = None,
| checkpointer: Optional[Lazy[Checkpointer]] = Lazy(Checkpointer),
| callbacks: List[Lazy[TrainerCallback]] = None,
| enable_default_callbacks: bool = True,
| run_confidence_checks: bool = True,
| grad_scaling: bool = True,
| ddp_accelerator: Optional[DdpAccelerator] = None,
| **kwargs
| ) -> Trainer
This method exists so that we can have a documented method to construct this class using
FromParams
. If you are not using FromParams
or config files, you can safely ignore this
method.
The reason we can't just use __init__
with FromParams
here is because there are
sequential dependencies to this class's arguments. Anything that has a Lazy[]
type
annotation needs something from one of the non-Lazy
arguments. The Optimizer
needs to
have the parameters from the Model
before it's constructed, and the Schedulers
need to
have the Optimizer
. Because of this, the typical way we construct things FromParams
doesn't work, so we use Lazy
to allow for constructing the objects sequentially.
If you're not using FromParams
, you can just construct these arguments in the right order
yourself in your code and call the constructor directly.
get_best_weights_path¶
class GradientDescentTrainer(Trainer):
| ...
| def get_best_weights_path(self) -> Optional[str]
DEFAULT_CALLBACKS¶
DEFAULT_CALLBACKS: Tuple[Type[TrainerCallback]] = (ConsoleLoggerCallback,)
The default callbacks used by GradientDescentTrainer
.