src.corpora.auto.get_lambada

get_lambada(tokenizer: PreTrainedTokenizer, paths: Dict[str, Path], dataset_id: str = 'lambada', dataset_name: Optional[str] = None, validation_ratio: float = 0.0005, seq_len: int = 1024, preprocessing_num_proc: int = 4, stride: int = - 1, ignore_train: bool = False) DatasetDict[source]

Run special tokenization and grouping for the Lambada dataset.

Taken from https://github.com/NVIDIA/Megatron-LM/blob/main/tasks/zeroshot_gpt2/datasets.py