SenseBERT: Driving Some Sense Into BERT

This is a short summary of SenseBERT: Driving Some Sense into BERT by Levine et al., 2020.

The paper proposes a method to employ weak-supervision directly on the word sense level, by pre-training a model to predict masked words and their WordNet supersenses, with no use of human annotation. SenseBERT uses WordNet, an organised lexical resource.

SenseBERT uses self supervision, a technique that allows it to learn from large amounts of unannotated text, to train on an input sentence with masked words in order to predict their senses (their actual meaning given their context). This is done by adding masked-word sense prediction as an auxiliary task in pre-training. The model is then exposed to lexical semantics when learning from a large unannotated corpus, thus training a semantic-level language model that predicts missing word meanings. The paper uses supersenses, broader semantic categories, to mitigate WordNet’s fine-grained word sense systems caused by arbitrary sense granularity, blurriness and general subjectivity. Precise word senses can sometimes also suffer from bias, limited and infrequent coverage, and consistency. As such, coarser-grained supersenses have been used to improve several NLP tasks like dependency parsing, question generation, and metaphor detection. There are 45 supersense categories and they act as possible labels for the sense prediction task. Labelling words with a single supersense involves training the network to predict this supersense given a masked word’s context. For words with multiple supsersenses, like ‘bass’, the model is trained to predict any of the word’s senses.

SenseBERT’s pre-training allows word-sense information to be integrated into the model. The base model BERT is made up of an internal Transformer encoder wrapped by an external mapping to the word vocab space W. The Transformer Layers produce new sequences of contextualised embeddings at each layer, and the Transformer itself outputs the final sequence of contextualised word embeddings. The external mapping W transforms the output from the Transformer to create the word-score vectors for masked words in specific positions. W essentially translates the observed vocabulary space into and out of the transformer encoder space, but I’m not completely sure how weight tying works to improve the quality of the scores by making the input more sensitive to the training signal. For SenseBERT, a semantic-level language model is also trained to predict the supersense of every masked word. An external mapping to the words’ supersenses space S is added. Human annotation is not required as WordNet Supersense information is added to S as word-level information is added to W with a loss function, hence implementing word-form and word-sense multi-task learning. A(w), the union of supersenses coupled to the synsets that are mapped to the lemmatized word in WordNet, is used to construct loss functions for the supersense-level language model, and maximise the probability that the predicted sense is in the set of allowed supersenses of the masked word w. The model accounts for overconfidence in early stages of training with a regularization term. It learns a representation of the sentences’ supersenses, allowing it to extract meaningful information. The paper also makes sure to accounts for the model’s predictions of out-of-vocabulary words and for single-supersensed words. At the pre-training stage, it takes as input a sequence of words where 15% are replaced by MASK tokens and outputs word-score vectors for the masked words along with the pre word score.

The paper concludes that SenseBERT does better than the initial BERT model in three cases: on SemEval-SS Frozen, SemEval-SS Fine-tuned, and for words in context. For words in context, it does better than five other models. SenseBERT shows an increase in word-level semantic awareness, outperforming previous existing models on Supersense Disambiguation and word in context tasks. Human supervision and human annotation are not required with the use of WordNet, an external linguistic knowledge source, and the introduction of semantic signals at the pre-training stage. The paper introduces supersenses to transfer learning. Ideas here could be used in ELMo, using embeddings from language models to compute contextualised word representations in a model for a downstream task. SenseBERT may however, be limited by reasoning about physical relations and common sense, long-range coherence and repetition of texts. It may not generalise as well as people and may lead to social impact issues where biased language models are misused to spread fake news.