Training With Multiple GPU’s

Once you’ve got training working with a single node/single gpu, you can easily move on to training with multiple GPUs if your machine has them.

This can be done two ways. The first, which we show here, uses torch.distributed.launch , a utility for launching multiple processes per node for distributed training. The second uses DeepSpeed, which we go over in our multi node training.

To use torch, run this command with --nproc_per_node set to the number of GPUs you want to use (in this example we’ll go with 2)

conda activate mistral
cd mistral
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 train.py --config conf/mistral-micro.yaml --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 2 --run_id mistral-micro-multi-gpu

You should see similar output as when running single node/single gpu training, except it should run twice as fast!

As noted with single node/single gpu training, you may need to adjust the batch size to avoid OOM memories.