src.corpora.auto.get_auto_dataset¶

get_auto_dataset(tokenizer: PreTrainedTokenizer, paths: Dict[str, Path], dataset_id: str = 'wikitext', dataset_name: str = 'wikitext-103-raw-v1', validation_ratio: float = 0.0005, seq_len: int = 1024, preprocessing_num_proc: int = 64, stride: int = - 1, ignore_train: bool = False) → DatasetDict[source]¶: Run basic tokenization and grouping to turn a Hugging Face Dataset (via datasets) into a torch.Dataset.