NGC | Catalog
CatalogModelsBertBaseUncasedForNemo

BertBaseUncasedForNemo

Logo for BertBaseUncasedForNemo
Description
BERT Base Model trained on uncased Wikipedia and BookCorpus dataset on a sequence length of 512.
Publisher
NVIDIA
Latest Version
1
Modified
April 4, 2023
Size
511.75 MB

Overview

This is a checkpoint for the BERT Base model trained in NeMo on the uncased English Wikipedia and BookCorpus dataset on sequence length of 512. It was trained with Apex/Amp optimization level O1.

The model is trained for 2285714 iterations on a DGX1 with 8 V100 GPUs.

The model achieves EM/F1 of 82.74/89.79 on SQuADv1.1 and 71.24/74.32 on SQuADv2.0. On GLUE benchmark MRPC task the model achieves accuracy/F1 od 86.52/90.53.

Please be sure to download the latest version in order to ensure compatibility with the latest NeMo release.

  • BERT-STEP-2285714.pt - pretrained BERT encoder weights
  • BertTokenClassifier-STEP-2285714.pt - pretrained BERT masked language model head weights
  • SequenceClassifier-STEP-2285714.pt - pretrained BERT next sentence prediction head weights
  • bert-config.json - the config file used to initialize BERT network architecture in NeMo

More Details

BERT, or Bidirectional Encoder Representations from Transformers, is a neural approach to pre-train language representations which obtains near state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks, including the GLUE Benchmark and SQuAD Question Answering dataset. This model is based on the BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper and Hugging Face implementation, leveraging mixed precision arithmetic and Tensor Cores on V100 GPUs for faster training times while maintaining target accuracy.

The BERT architecture uses the same architecture as the encoder half of the Transformer. Input sequences are projected into an embedding space before being fed into the encoder structure. Additionally, positional and segment encodings are added to the embeddings to preserve positional information. The encoder structure is simply a stack of 12 Transformer blocks, which consist of a multi-head attention layer followed by successive stages of feed-forward networks and layer normalization. The multi-head attention layer accomplishes self-attention on multiple input representations. The total number of parameters is 110M.

The model is trained using masked language model loss and next sentence prediction loss.

Documentation

Source code and developer guide is available at https://github.com/NVIDIA/NeMo Refer to documentation at https://docs.nvidia.com/deeplearning/nemo/neural-modules-release-notes/index.html Code to pretrain and reproduce this model checkpoint are available at https://github.com/NVIDIA/NeMo.

This model checkpoint can be used for either finetuning BERT on your custom dataset, or finetuning downstream tasks, including GLUE benchmark tasks, question answering tasks e.g. SQuAD, joint intent and slot detection, punctuation and capitalization, named entity recognition, and speech recognition postprocessing model to correct mistakes. All of these tasks and scripts can be found at https://github.com/NVIDIA/NeMo.

In the following we show examples for how to train BERT and finetune two downstream tasks, GLUE MRPC and SQuAD.

Usage example 1: Pretraining BERT

  1. Download and preprocess uncased Wikipedia and BookCorpus dataset:
  • Run the script and extract preprocessed hdf5 files into $train_data and $eval_data
  1. Run BERT base on the sequence length 512 and DGX1 with 8 V100 GPUs

    cd examples/nlp/language_modeling;

    python -m torch.distributed.launch --nproc_per_node=8 bert_pretraining.py --config_file bert-config.json --train_data $train_data --eval_data $eval_data --num_gpus 8 --batch_size 8 --amp_opt_level "O1" --lr_policy SquareRootAnnealing --beta1 0.9 --beta2 0.999 --lr_warmup_proportion 0.01 --optimizer adam_w --weight_decay 0.01 --lr 0.4375e-4 data_preprocessed --max_predictions_per_seq 80 --num_iters 2285714

Checkpoints will be store at args.work_dir folder.

Usage example 2: Using BERT checkpoint for downstream task, using the example of GLUE benchmark task MRPC

Download BERT-STEP-2285714.pt and bert-config.json.

cd examples/nlp/glue_benchmark;

python glue_benchmark_with_bert.py --data_dir $mrpc_dataset --task_name mrpc --pretrained_bert_model bert-base-uncased --bert_checkpoint /path_to/BERT-STEP-2285714.pt --bert_config /path_to/bert-config.json --lr 2e-5 

Usage example 3: Using BERT checkpoint for downstream task SQuAD Question Answering task

Download BERT-STEP-2285714.pt and bert-config.json.

SQuAD v1.1

cd examples/nlp/question_answering;

python -m torch.distributed.launch --nproc_per_node=8 question_answering_squad.py --mode train_eval --amp_opt_level O1 --num_gpus 8 --train_file=/path_to/squad/v1.1/train-v1.1.json --eval_file /path_to/squad/v1.1/dev-v1.1.json --bert_checkpoint /path_to/BERT-STEP-2285714.pt --bert_config /path_to/bert-config.json --pretrained_model_name bert-base-uncased --batch_size 3 --num_epochs 2 --lr_policy SquareRootAnnealing --optimizer adam_w --lr 3e-5 --do_lower_case --no_data_cache

SQuAD v2.0

cd examples/nlp/question_answering;

python -m torch.distributed.launch --nproc_per_node=8 question_answering_squad.py --mode train_eval --amp_opt_level O1 --num_gpus 8 --train_file /path_to/squad/v2.0/train-v2.0.json --eval_file /path_to/squad/v2.0/dev-v2.0.json --bert_checkpoint /path_to/BERT-STEP-2285714.pt --bert_config /path_to/bert-config.json --pretrained_model_name=bert-base-uncased --batch_size 3 --num_epochs 2 --lr_policy SquareRootAnnealing --optimizer adam_w --lr 3e-5 --do_lower_case --version_2_with_negative --no_data_cache