# Cross Domain Normalization: NLP In the Visual World (Part 2)

In the previous part we saw how important it is to understand the nature of vision-language interaction. We focused on grounding textual phrases in images and found a crucial mathematical problem we must address when combining different domains with different distributions. In this part we propose a solution – Cross Domain Normalization (CDN). Despite its simplicity, with CDN we learn much faster and outperform all state-of-the-art results. As we will show, this improvement is the result of balancing the way these two domains affect each other.

*Our Models*

As we’ve discussed, our goal is to test how (and if) the statistics of the two domains affect the interaction between them. Thus, we needed a simple model in which we can isolate the required elements. We also wanted our model to represent the underlying structure of today’s solution. With this in mind, our baseline is a variation of the supervised GroundeR model in [1], to which we’ll refer as Cross Domain Grounder (CDG). It has the same structure as we’ve described:

- Language Model: During training of the overall model, we keep two pre-trained word embeddings, but only one is fine-tuned together with the entire model, while the other remained fixed. Given the phrase p={w1, w2, …, wn}, we embed each word, wt, by summing its word vectors according to the two embeddings and feed it to an LSTM cell whose hidden state has 200 dimensions.

- Image Domain: We crop each bounding box and feed it to a VGG16 model, pre-trained on Imagenet and extract a 4,096 dimensional feature from the fully connected fc7 layer. Similarly to previous work, the image model is fixed during training, that is, the VGG16 model is not trained together with the rest of the model. Finally, we concatenate this vector with another vector that encodes the bbox’s spatial location.

- Matching Model: We’ve used attention to compute the matching score between the bbox and the phrase. To do so, we first compute the attention score: a’i = vmTReLU(Wlangvlang+Wbboxvbbox+b). Where vlangand vbboxare the outputs of the language and image models, Wlangvlang and Wbboxvbbox are the outputs of the projections layers (with 200 dimensions), vmis a matching vector and b is our bias. Finally, we normalize the scores of the image’s bboxes using softmax. As for our loss, we use Cross-Entropy over the bboxes scores.

One natural way to manipulate the cross domain statistics is by adding common normalization layers over the image and the language models. Therefore, we also test **CDG+BN** that has two Batch Normalization (BN) layers over the outputs of the two models and **CDG+LN **which uses Layer Normalization (LN) instead of BN. In addition we’ve tested **CDG+lang_BN** and **CDG+image_BN** with Batch Normalization *only* over the language model and *only *over the image one. The purpose of these two models is to test the effect of setting two different statistics to the two domains. Bellow you can see our baseline+BN.

*Cross Domain Normalization (CDN)*

Recent normalization architectures like BN and LN aim to learn good statistics by utilizing two trainable variables: a scaling variable that controls the parameters variance and a bias that controls their mean. These architectures have a problematic premise by which the same statistics should be set to all parameters and if this default behavior is not optimal, these techniques should be able to learn better statistics. Nevertheless, how can we *learn* better statistics if the defaultive behavior yields bad *learning*?

Indeed, in our case the imbalance interaction between the two domains prevent these methods from learning good statistics. This problem can be solved with CDN. All we need to do is to add two LN layers, one over the language model and another over the image one, and scale their outputs. That is:

Where slang and simageare two hyperparameters which we find using random search. Not that CDN also contains *trainable *scaling factors as it utilizes LN in a constrained form.

While CDN is extremely simple to implement, the theory behind it is rooted in the interaction between different parameters that hold different information – one of the most important factors in DL optimization. Hence, with CDN we see a dramatic improvement in accuracy, generalization and speed (up to 100 times faster and 19 times faster compared to Batch Normalization). With such a high convergence rate, we can find the best values for slang and simageextremely fast. Furthermore, we found the same slang and simageto be optimal for several different datasets! This supports our claim that the source of our problem is the different nature of linguistic and visual information and not the structure of our datasets.

In the next part of this blog series we’ll test our models on today’s popular benchmarks for grounding textual phrases in images. We’ll see the dramatic effect that statistics has on our results and on the dependency between the language and image models update rates. Our experiments will reveal the huge improvement in generalization, speed and accuracy which CDN provides. Finally, we’ll introduce the mathematical relationship between cross domain statistics and the interaction between our domains. Apart from explaining our experimental observations, it will also shed some light on the benefits of normalization techniques.

## Discuss / Read Comments

## Leave a Reply