

Configurations are specified using the Quinine library.

Quinine allows users to integrate multiple config files and layer configs on top of each other. It is designed for machine learning projects with large sets of nested hyperparameters.

The easiest way to understand Quinine is to study conf/mistral-micro.yaml which is presented below.

This config specifies a variety of settings, and draws configurations from conf/datasets/wikitext103.yaml, conf/models/gpt2-micro.yaml and conf/trainers/gpt2-small.yaml. This allows for clean separation of the configs for the dataset (e.g. name or number of pre-processing workers), the model (e.g. number of layers), and the trainer (e.g. learning rate), while high level configs are specified in the main config file.

Most of the defaults in conf/mistral-micro.yaml will work, but you will need to change the Weights & Biases settings and specify the artifacts directories cache_dir and run_dir.

Example config: tutorial-gpt2-micro.yaml

conf/mistral-micro.yaml is a basic configuration file that can be used for an introductory training run

# tutorial-gpt2-micro.yaml
#   Demo GPT-2 Micro Training Config, currently working with the WikiText103 Dataset, GPT-2 Micro Architecture,
#   and batch size of 2. Runs with DeepSpeed ZeRO-2, with a per-device BSZ of 2.
#   Inheritance and core paths can all be overridden from the command line or by re-writing these files.
# Inherit Dataset, Tokenization, Model, and Training Details
    - datasets/wikitext2.yaml
    - models/mistral-micro.yaml
    - trainers/gpt2-small.yaml

# Run ID -- make sure to override!
run_id: null

# Weights & Biases
wandb: stanford-mercury
group: mistral-gpt2

# Artifacts & Caching
    cache_dir: /u/scr/nlp/data/mercury/mistral-demo/artifacts
    run_dir: /u/scr/nlp/data/mercury/mistral-demo/runs

# Save Effective Batch Size for Easy Handling ==> Main Code asserts infra + training_config results in this!
effective_bsz: 32

# Resume from Checkpoint
resume: false
resume_checkpoint: null

# List of frequencies at which to save checkpoints, provided as a list of two-element tuples:
#   - Frequency (`freq`) at which to save checkpoints (# steps)
#   - Bound (`until`) on global step for given frequency (checkpoint every `freq` steps until global step = `until`)
    - [2, 10]
    - [10, 100]
    - [50, 2000]
    - [100, 20000]
    - [1000, 400000]

# `torch.distributed` Default Infra Parameters -- to be overwritten by call to `torch.distributed.launch`
local_rank: -1
nnodes: -1
nproc_per_node: -1

# DeepSpeed Default Infra Parameters -- to be overwritten by call to `DeepSpeed`
num_gpus: -1
num_nodes: -1
world_size: -1

# Logging Parameters -- 10 = DEBUG, 20 = INFO, 30 = WARNING, 40 = ERROR, 50 = CRITICAL
log_level: 20

# Random Seed
seed: 21

# Set Eval Stride
    stride: 256