src.corpora.auto.build_indexed_dataset¶

build_indexed_dataset(tokenizer: PreTrainedTokenizer, paths: Dict[str, Path], dataset_id: str, dataset_name: Optional[str], dataset_dir: Optional[str], seq_len: int, stride: Optional[int] = None, preprocessing_num_proc: int = 64, ignore_train: bool = False, shuffle_seed: int = 42, train_shuffle_buffer_size: Optional[int] = 10000) → Dict[str, IndexedDataset][source]¶: Builds Indexed Datasets from a Dataset Dictionary.