Checkpointer( self, serialization_dir: str = None, keep_serialized_model_every_num_seconds: int = None, num_serialized_models_to_keep: int = 2, model_save_interval: float = None, ) -> None
This class implements the functionality for checkpointing your model and trainer state
during training. It is agnostic as to what those states look like (they are typed as
Dict[str, Any]), but they will be fed to
torch.save so they should be serializable
in that sense. They will also be restored as Dict[str, Any], which means the calling
code is responsible for knowing what to do with them.
- num_serialized_models_to_keep :
int, optional (default=2) Number of previous model checkpoints to retain. Default is to keep 2 checkpoints. A value of None or -1 means all checkpoints will be kept.
- keep_serialized_model_every_num_seconds :
int, optional (default=None) If num_serialized_models_to_keep is not None, then occasionally it's useful to save models at a given interval in addition to the last num_serialized_models_to_keep. To do so, specify keep_serialized_model_every_num_seconds as the number of seconds between permanently saved checkpoints. Note that this option is only used if num_serialized_models_to_keep is not None, otherwise all checkpoints are kept.
- model_save_interval :
float, optional (default=None) If provided, then serialize models every
model_save_intervalseconds within single epochs. In all cases, models are also saved at the end of every epoch if
Checkpointer.find_latest_checkpoint(self) -> Tuple[str, str]
Return the location of the latest model and training state files. If there isn't a valid checkpoint then return None.
Checkpointer.maybe_save_checkpoint( self, trainer: 'allennlp.training.trainer.Trainer', epoch: int, batches_this_epoch: int, ) -> None
Given amount of time lapsed between the last save and now (tracked internally), the
current epoch, and the number of batches seen so far this epoch, this method decides whether
to save a checkpoint or not. If we decide to save a checkpoint, we grab whatever state we
need out of the
Trainer and save it.
This function is intended to be called at the end of each batch in an epoch (perhaps because your data is large enough that you don't really have "epochs"). The default implementation only looks at time, not batch or epoch number, though those parameters are available to you if you want to customize the behavior of this function.
Checkpointer.restore_checkpoint(self) -> Tuple[Dict[str, Any], Dict[str, Any]]
Restores a model from a serialization_dir to the last saved checkpoint.
This includes a training state (typically consisting of an epoch count and optimizer state),
which is serialized separately from model parameters. This function should only be used to
continue training - if you wish to load a model for inference/load parts of a model into a new
computation graph, you should use the native Pytorch functions:
self._serialization_dir does not exist or does not contain any checkpointed weights,
this function will do nothing and return empty dicts.
states: Tuple[Dict[str, Any], Dict[str, Any]]
The model state and the training state.