util
allennlp.common.util
Various utilities that don't fit anywhere else.
JsonDict¶
JsonDict = Dict[str, Any]
START_SYMBOL¶
START_SYMBOL = "@start@"
END_SYMBOL¶
END_SYMBOL = "@end@"
PathType¶
PathType = Union[os.PathLike, str]
T¶
T = TypeVar("T")
ContextManagerFunctionReturnType¶
ContextManagerFunctionReturnType = Generator[T, None, None]
sanitize¶
def sanitize(x: Any) -> Any
Sanitize turns PyTorch and Numpy types into basic Python types so they can be serialized into JSON.
group_by_count¶
def group_by_count(
iterable: List[Any],
count: int,
default_value: Any
) -> List[List[Any]]
Takes a list and groups it into sublists of size count
, using default_value
to pad the
list at the end if the list is not divisable by count
.
For example:
>>> group_by_count([1, 2, 3, 4, 5, 6, 7], 3, 0)
[[1, 2, 3], [4, 5, 6], [7, 0, 0]]
This is a short method, but it's complicated and hard to remember as a one-liner, so we just make a function out of it.
A¶
A = TypeVar("A")
lazy_groups_of¶
def lazy_groups_of(
iterable: Iterable[A],
group_size: int
) -> Iterator[List[A]]
Takes an iterable and batches the individual instances into lists of the specified size. The last list may be smaller if there are instances left over.
pad_sequence_to_length¶
def pad_sequence_to_length(
sequence: Sequence,
desired_length: int,
default_value: Callable[[], Any] = lambda: 0,
padding_on_right: bool = True
) -> List
Take a list of objects and pads it to the desired length, returning the padded list. The original list is not modified.
Parameters¶
-
sequence :
List
A list of objects to be padded. -
desired_length :
int
Maximum length of each sequence. Longer sequences are truncated to this length, and shorter ones are padded to it. -
default_value :
Callable
, optional (default =lambda: 0
)
Callable that outputs a default value (of any type) to use as padding values. This is a lambda to avoid using the same object when the default value is more complex, like a list. -
padding_on_right :
bool
, optional (default =True
)
When we add padding tokens (or truncate the sequence), should we do it on the right or the left?
Returns¶
- padded_sequence :
List
add_noise_to_dict_values¶
def add_noise_to_dict_values(
dictionary: Dict[A, float],
noise_param: float
) -> Dict[A, float]
Returns a new dictionary with noise added to every key in dictionary
. The noise is
uniformly distributed within noise_param
percent of the value for every value in the
dictionary.
namespace_match¶
def namespace_match(pattern: str, namespace: str)
Matches a namespace pattern against a namespace string. For example, *tags
matches
passage_tags
and question_tags
and tokens
matches tokens
but not
stemmed_tokens
.
prepare_environment¶
def prepare_environment(params: Params)
Sets random seeds for reproducible experiments. This may not work as expected if you use this from within a python project in which you have already imported Pytorch. If you use the scripts/run_model.py entry point to training models with this library, your experiments should be reasonably reproducible. If you are using this from your own project, you will want to call this function before importing Pytorch. Complete determinism is very difficult to achieve with libraries doing optimized linear algebra due to massively parallel execution, which is exacerbated by using GPUs.
Parameters¶
- params :
Params
AParams
object or dict holding the json parameters.
LOADED_SPACY_MODELS¶
LOADED_SPACY_MODELS: Dict[Tuple[str, bool, bool, bool], SpacyModelType] = {}
get_spacy_model¶
def get_spacy_model(
spacy_model_name: str,
pos_tags: bool = True,
parse: bool = False,
ner: bool = False
) -> SpacyModelType
In order to avoid loading spacy models a whole bunch of times, we'll save references to them, keyed by the options we used to create the spacy model, so any particular configuration only gets loaded once.
pushd¶
@contextmanager
def pushd(
new_dir: PathType,
verbose: bool = False
) -> ContextManagerFunctionReturnType[None]
Changes the current directory to the given path and prepends it to sys.path
.
This method is intended to use with with
, so after its usage, the current directory will be
set to the previous value.
push_python_path¶
@contextmanager
def push_python_path(
path: PathType
) -> ContextManagerFunctionReturnType[None]
Prepends the given path to sys.path
.
This method is intended to use with with
, so after its usage, its value willbe removed from
sys.path
.
import_module_and_submodules¶
def import_module_and_submodules(
package_name: str,
exclude: Optional[Set[str]] = None
) -> None
Import all public submodules under the given package. Primarily useful so that people using AllenNLP as a library can specify their own custom packages and have their custom classes get loaded and registered.
peak_cpu_memory¶
def peak_cpu_memory() -> Dict[int, int]
Get peak memory usage for each worker, as measured by max-resident-set size:
https://unix.stackexchange.com/questions/30940/getrusage-system-call-what-is-maximum-resident-set-size
Only works on OSX and Linux, otherwise the result will be 0.0 for every worker.
peak_gpu_memory¶
def peak_gpu_memory() -> Dict[int, int]
Get the peak GPU memory usage in bytes by device.
Returns¶
Dict[int, int]
Keys are device ids as integers. Values are memory usage as integers in bytes. Returns an emptydict
if GPUs are not available.
ensure_list¶
def ensure_list(iterable: Iterable[A]) -> List[A]
An Iterable may be a list or a generator. This ensures we get a list without making an unnecessary copy.
is_lazy¶
def is_lazy(iterable: Iterable[A]) -> bool
Checks if the given iterable is lazy, which here just means it's not a list.
int_to_device¶
def int_to_device(device: Union[int, torch.device]) -> torch.device
log_frozen_and_tunable_parameter_names¶
def log_frozen_and_tunable_parameter_names(
model: torch.nn.Module
) -> None
get_frozen_and_tunable_parameter_names¶
def get_frozen_and_tunable_parameter_names(
model: torch.nn.Module
) -> Tuple[Iterable[str], Iterable[str]]
dump_metrics¶
def dump_metrics(
file_path: Optional[str],
metrics: Dict[str, Any],
log: bool = False
) -> None
flatten_filename¶
def flatten_filename(file_path: str) -> str
is_distributed¶
def is_distributed() -> bool
Checks if the distributed process group is available and has been initialized
is_global_primary¶
def is_global_primary() -> bool
Checks if the distributed process group is the global primary (rank = 0).
If the distributed process group is not available or has not been initialized,
this trivially returns True
.
sanitize_wordpiece¶
def sanitize_wordpiece(wordpiece: str) -> str
Sanitizes wordpieces from BERT, RoBERTa or ALBERT tokenizers.
sanitize_ptb_tokenized_string¶
def sanitize_ptb_tokenized_string(text: str) -> str
Sanitizes string that was tokenized using PTBTokenizer
find_open_port¶
def find_open_port() -> int
Find a random open port on local host.
format_timedelta¶
def format_timedelta(td: timedelta) -> str
Format a timedelta for humans.
format_size¶
def format_size(size: int) -> str
Format a size (in bytes) for humans.
nan_safe_tensor_divide¶
def nan_safe_tensor_divide(numerator, denominator)
Performs division and handles divide-by-zero.
On zero-division, sets the corresponding result elements to zero.
shuffle_iterable¶
def shuffle_iterable(
i: Iterable[T],
pool_size: int = 1024
) -> Iterable[T]
cycle_iterator_function¶
def cycle_iterator_function(
iterator_function: Callable[[], Iterable[T]]
) -> Iterator[T]
Functionally equivalent to itertools.cycle(iterator_function())
, but this function does not
cache the result of calling the iterator like cycle
does. Instead, we just call
iterator_function()
again whenever we get a StopIteration
. This should only be preferred
over itertools.cycle
in cases where you're sure you don't want the caching behavior that's
done in itertools.cycle
.
hash_object¶
def hash_object(o: Any) -> str
Returns a character hash code of arbitrary Python objects.
SigTermReceived¶
class SigTermReceived(Exception)
install_sigterm_handler¶
def install_sigterm_handler()