Semantic Alignment of Linguistic and Visual Understanding using Multi-modal Transformer

Published in

AR/VR Journey: Augmented & Virtual Reality Magazine

7 min readJun 29, 2022

Also, they don’t understand — writing is language. The use of language. The language to create image, the language to create drama. It requires a skill of learning how to use language.
- John Milius

Vision-language tasks, such as image captioning, visual question answering, and visual commonsense reasoning, serve as rich test-beds for evaluating the reasoning capabilities of visually informed systems.

These tasks require a joint understanding of visual contents, language semantics, and cross-modal alignments.

Fig. 1: Examples of images, questions (Q), and answers (A)

With the success of BERT on a variety of NLP tasks, there has been a surge in building pre-trained models for vision-language tasks, such as ViLBERT and VL-BERT, etc. However, these models suffer from fundamental difficulties in learning effective visually grounded representations and relationships between different attributes.

Here, I would like to elucidate the two different research works which are breakthroughs in aligning the image and text information with the multi-modal transformer.

Pixel-BERT

Pixel-BERT is a unified end-to-end framework that aligns the image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding. Pixel-BERT which aligns semantic connection at the pixel and text level solves the limitation of region-based image feature extractors (e.g., Faster R-CNN), which are designed for specific visual tasks (e.g. object detection), and this will cause an information gap in language understanding. Some important factors of visual information are lost, such as shapes of objects, spatial relations between objects with overlap, etc.

We show some examples that region-based visual features cannot well handle in Fig. 1. In Example (A), it is difficult for object detection models to obtain the status of the plane. For Example (B), even though we can detect the “girl” and “ground” since there is overlap between their regions, it will be even hard for further fusion embedding models to judge the actual spatial relation given their bounding boxes. Similarly in Example (C), with only visual features of “giraffe”, it is difficult to infer the status of the animals.

Pixel-BERT: The model contains a visual feature embedding module, a sentence feature embedding module, and a cross-modality alignment module. Pixel-BERT takes image-sentence pairs as input and outputs the attention features of each input element. Images are passed into a pixel feature embedding module pixel by pixel and sentences are fed into a sentence feature embedding module token by token. The model can be pre-trained by MLM and ITM tasks and can be flexibly applied to downstream tasks (e.g. VQA, retrieval, etc).

Sentence Feature Embedding

Given a sentence as input, we first split it into a sequence of words, and use WordPiece to tokenize each word into tokens. We then adopt an embedding matrix to embed each token into a vector. Here we use w = {w1, w2, …, wn} ∈ R to represent the embedded sequence, where n indicates the sequence length, and d is the embedding dimension.

where pi indicates the embedding vector at position i, sw is a semantic embedding vector and LayerNorm is a normalization function.

Image Feature Embedding

Given an input image I, we first use the CNN backbone to extract its feature, then flat the feature along the spatial dimension. We denote the flatten feature as v = {v1, v2, …, vk} ∈ R, where k indicates the number of feature pixels. The visual embedding feature {vˆ1, vˆ2, …, vˆk} can be computed by

where sv is a semantic embedding vector to distinguish the difference with language embedding. Since all pixels share the same sv, this embedding vector can be considered as a bias term to be combined with the CNN backbone.

Cross-Modality Module

After obtaining sentence embedding vectors and pixel features, we combine all vectors to construct the input sequence. We have also added two special tokens [CLS] and [SEP] for learning joint classification features and specifying token length, respectively. The final input sequence to the joint-learning Transformer is formulated as

To further check whether Pixel-BERT can well learn the visual representation by cross-modality attention across language and pixels, below are some intermediate results of attention maps on examples.

Fig. 3: Visualization of attention regions extracted from the first Transformer layer of Pixel-BERT. The attention regions are extracted by using the specific token as query and pixel features as keys. Highlight areas indicate regions with high attention scores.

The visualization results can be found in Fig. 3. From the result of Case (A), we can see that the response areas of token “dog”, “grass” and “frisbee” are actually distributed in the correct region. For Case (B), we can find that although “cutting” is a verb, it can attend to the most related region in which the action of “cutting” is performed with a knife. From Case (C), we find that the token “room” can attend to the correct region in the image.

Semantic Aligned Multi-modal Transformer

The pre-trained models for vision-language tasks, such as ViLBERT, VL-BERT, and UNITER, ignore the rich visual information, such as attributes and relationships between objects. Without such information as contextual cues, the core challenge of ambiguity in visual grounding remains difficult to solve. Visual Scene Graphs as the bridge to align vision-language semantics to tackle the above challenges. Visual scene graphs extracted from the image using modern scene graph generators, a visual scene graph effectively depicts salient objects and their relationships.

Samformer (Semantic Aligned Multi-modal transFORMER) learns the alignment between the modalities of text, image, and graphical structure. For each object-relation label in the scene graph, the model can easily find the referring text segments in natural language, and then learn to align to the image regions already associated with the scene graph.

Fig. 4: A Visual question-answering example illustrating the effectiveness of using a scene graph as the bridge for cross-modal alignment.

Given an image-text pair (I, w), we first extract the visual scene graph G from the image with a scene graph generator. A scene graph is a directed graph with the nodes representing the objects and the edges depicting their pairwise relationships. we first embed tokens in both the text sequence w and scene graph triplets with a pre-trained BERT embedder. We then extract the visual embedding of each image region and also the union region of each triplet with the Faster R-CNN.

All the embedding vectors are then fed into a transformer network with self-attention mechanisms to infer the alignment, as shown in Figure 5.

Faster R-CNN

Here, I have provided the Fast R-CNN architecture, which is a regional-based object detector that is used by the above algorithms. The Faster R-CNN architecture consists of the RPN as a region proposal algorithm and the Fast R-CNN as a detector network. The in-depth discussion regarding the Fast R-CNN is beyond our agenda, So, for more information visit here.

Conclusions

We discuss the visual embedding method that is commonly used in existing works and aim to solve the limitation of region-based visual representation. We have also seen a CNN-based Visual Encoder and combined it with multi-modal Transformers to construct Pixel-BERT in an end-to-end manner and build more accurate and more thorough embedding between visual and linguistic contents at the pixel and text level. Samformer, a novel semantic aligned multi-modal transformer model for vision-language pre-training. We explicitly align the visual scene graphs and text using triplet tags with the visual embedding.

References:

[1]Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers: https://arxiv.org/pdf/2004.00849.pdf

[2]Semantic Aligned Multi-modal Transformer for Vision-Language Understanding: A Preliminary Study on Visual QA: https://aclanthology.org/2021.maiworkshop-1.11.pdf

[3]Faster R-CNN for object detection: https://towardsdatascience.com/faster-r-cnn-for-object-detection-a-technical-summary-474c5b857b46

[4]Basic intuition of Conversational Question Answering Systems (CQA): https://medium.com/@ardeshnabhavik/basic-intuition-of-conversational-question-answering-systems-cqa-cf79bb5fa1d6

[5]Cascading Adaptors to Leverage English Data to Improve Performance of Question Answering for Low-Resource Languages: https://arxiv.org/abs/2112.09866