To train your model using mixed precision or TF32 with Tensor Cores or FP32, perform the following steps using the default parameters of the ResNet-50 v1.5 model on the ImageNet dataset. For the specifics concerning training and inference, see the Advanced section.
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/TensorFlow/Classification/ConvNets
mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
cd ..
mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar
docker build . -t nvidia_rn50
nvidia-docker run --rm -it -v <path to imagenet>:/data/tfrecords --ipc=host nvidia_rn50
bash ./utils/dali_index.sh /data/tfrecords <index file store location>
Index files can be created once and then reused. It is highly recommended to save them into a persistent location.
resnet50v1.5/training
directory. Ensure ImageNet is mounted in the
/data/tfrecords
directory.For example, to train on DGX-1 for 90 epochs using AMP, run:
bash ./resnet50v1.5/training/DGX1_RN50_AMP_90E.sh /path/to/result /data
Additionally, features like DALI data preprocessing or TensorFlow XLA can be enabled with following arguments when running those scripts:
bash ./resnet50v1.5/training/DGX1_RN50_AMP_90E.sh /path/to/result /data --xla --dali
/data/tfrecords
, run main.py
with
--mode=evaluate
. For example:python main.py --mode=evaluate --data_dir=/data/tfrecords --batch_size <batch size> --model_dir <model location> --results_dir <output location> [--xla] [--amp]
The optional --xla
and --amp
flags control XLA and AMP during evaluation.