7 Basic NLP Models to Empower Your ML Application

An overview of the 7 NLP models.

In the previous post, we have explained what is NLP and its real-world applications. In this post, we will continue to introduce some of the main deep-learning models used in NLP applications.

BERT

Bidirectional Encoder Representations from Transformers (BERT) was first proposed in 2018 by Jacob Devlin in his paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".
The main breakthrough of the BERT model is that it scans text in a bidirectional way rather than a left-to-right or combined left-to-right and right-to-left sequence when looking at texts during training.
There are two general types of BERT: BERT (base) and BERT (large). The difference is in configurable parameters: base-110 million parameter, large-345 million.

XLNet

XLNet was published in the paper "XLNet: Generalized Autoregressive Pretraining for Language Understanding" in 2019.
XLNet outperforms BERT by large margins in 20 benchmark tests as it leverages the best of both autoregressive models and bidirectional context modeling. XLNet adopts a newly proposed modeling method called "permutation language modeling".
Unlike traditional tokenization in a language model that predict the word in a sentence based on the context of the previous token, XLNet's permutation language modeling takes into account the interdependency between tokens.
XLNet achieves a performance test result of 2-15% improvement over BERT.

RoBERTa

RoBERTa was proposed in the paper "RoBERTa: A Robustly Optimized BERT Pretraining Approach" in 2019.
RoBERTa makes changes to the architecture and training procedures of BERT. Specifically, RoBERTa removes the next sentence prediction (NSP) objective, uses a much larger dataset than BERT, and replaces static masking with dynamic masking.
RoBERTa achieves a performance test result of 2-20% improvement over BERT.

ALBERT

ALBERT model was proposed in the paper "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations" in 2019.
ALBERT is developed based on the BERT model. Its major breakthrough is that it brings a significant parameter reduction but maintains the same level of performance compared to BERT.
In ALBERT, parameters are shared across 12 layers of transformer encoders, while in the original BERT, each layer of encoders have a unique set of parameters.

StructBERT

StructBERT was proposed in the paper "StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding" in 2019.
StructBERT further extends BERT by incorporating language structure into the training procedure.
StructBERT also introduces the word structural objective (WSO), which helps the model to learn the ordering of words.

T5

T5 was introduced in the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" in 2019. T5 is the short form for "Text-to-Text Transfer Transformer".
A clean, massive, and open-source dataset C4(Colossal Clean Crawled Corpus) is released in T5.
T5 categorizes all NLP tasks as "text-to-text" tasks.
There are five different sizes of T5 model, each with different number of parameters: T5-small (60 million parameters), T5-base (220 million parameters), T5-large (770 million parameters), T5-3B (3 billion parameters), T5-11B (11 billion parameters).

ELECTRA

ELECTRA was proposed in the paper "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators" in 2020.
ELECTRA proposes a new framework for pre-training which combines generator and discriminator.
ELECTRA changes the training method of masked language model to replaced token detection.
ELECTRA performs better on small-sized models.

7 Basic NLP Models to Empower Your ML Application

Cheatsheet for BERT, XLNet, RoBERTa, ALBERT, StructBERT, T5, and ELECTRA.

An overview of the 7 NLP models.

BERT

XLNet

RoBERTa

ALBERT

StructBERT

T5

ELECTRA