April 24, 2019 | By

Cross Domain Normalization: NLP In the Visual World (Part 3)

Previously we’ve discussed the task of grounding textual phrases in images. We saw how important this task is, both for applications that combine visual and linguistics information and for generating better language and image models. Nevertheless, we also found a fundamental mathematical problem caused by the nature of the two domains and predicted that this problem will yield an imbalance interaction between them. With this in mind, we proposed a new normalization technique – Cross Domain Normalization (CDN).

In this part we’ll test our hypothesis on today’s common datasets for this task. Despite the simplicity of our model, with CDN we outperform all today’s state-of-the-art models, which are much more complex. Importantly, our competitors also use normalization methods together with many other regularization techniques that require different hyperparameters. In contrast, CDN is the only regularization we’ve used.

We’ll also show empirically how cross domain statistics affects learning and that CDN is indeed the source of our success, it significantly reduces overfit and speeds up our convergence rates in all experiments. We’ll then dive into the mathematical relationship between the language and visual domains which motivated this work.

Cross Domain Statistics

We start with RefCLEF dataset. It contains a set of images and a set of bboxes (bounding boxes) which were gathered manually for each image. Each bbox is paired with one or more phrases that refer to the object which is bounded by the bbox. We’ve tested our baseline together with CDG+BN and CDG+LN in order to see whether BN (Batch Normalization) and/or LN (Layer Normalization) can stabilize the interaction between the two domains by learning better statistics.

Without normalization, the variances of the image and language embeddings are about 0.01 and 0.8. To analyse the effect of cross domain statistics we’ve also tested CDG+lang_BN with BN layer only over the language model, this increases the language embedding variance from about 0.01 to 1, and  CDG+image_BN with BN only over the image model to increase the image embedding variance to 1. Finally, we’ve tested CDG+CDN and CDG+SBN. SBN (Scaled Batch Normalization) is similar to CDN but it uses BN instead of LN. For SBN we have simage=1000 and  slang = 13.16 and  for CDN simage =43.5 and  slang = 6.25.


Model Test Accuracy Train Accuracy Test Loss Train Loss
CDG 66.0 99.3 3.64 0.470
CDG+image_BN 72.7 99.0 2.94 0.130
CDG+lang_BN 80.0 100.0 3.03 0.002
CDG+BN 81.9 100.0 3.15 0.004
CDG+LN 80.0 100.0 2.75 0.014
CDG+SBN 83.1 95.0 1.25 0.519
CDG+CDN 84.6 95.0 1.113 0.475

We can see that LN and our BN variants significantly improve accuracy, yet the differences in loss and accuracy between the train and test sets reveal severe overfit. With SBN and CDN we see an even better accuracy, Furthermore, CDN reached the best result observed on CDG 100 times faster; it passed CDG+BN best results 5 times faster.

But more markedly, we see a remarkable reduction in overfit! The below figure shows just how dramatic this reduction is.


The first and second rows show the accuracy and loss plots respectively. With BN the overfit is huge, we can see an increase in loss even when the accuracy improves. This indicates a very low confidence level, which will make it infeasible to identify images in which the referred object is not present – a basic requirement for any real life application. However, with CDN we see a much more stable learning behavior with a much lower overfit. The success of CDN indicates that BN and LN can not find better statistics even though it exists.

The source of this improvement can be seen in the third row where we plot the average gradient norm w.r.t the image and language projection layers. The plots confirm our expectations. First, we see that with BN the slop of the language’s plot is much higher than the image one. Second, we see a clear dependency between the two gradient plots, even small noises in the image plot is amplified in the language one.  In other words, the language’s parameters are highly affected by the image’s parameters. However, with CDN we see no dependency! The learning is much more stable and smooth!

RefCLEF –  Automatically Generated Bonding Boxes

A more realistic experiment is one in which the bboxes are generated automatically without the help of any annotator. Similar to other models with which we compare our results, we use EdgeBox algorithm to generate 100 bboxes per image. Note that in some cases the ground truth bbox was not generated, hence, the best accuracy we can get is 59.38%. The results can be seen in the table below.

Model Accuracy
SCRC 17.93
GroundeR 26.93
MBC 28.91
Comprehension 31.85
EB+QRN (VGGcls-SPAT) 32.21
CDG 22.40
CDG+CDN 33.60
Accuracy upper bound 59.38


The improvement brought by CDN over CDG is even more significant for the noisy  automatically generated bboxes. CDG+CDN also outperforms GroundeR, whose infrastructure is similar to CDG+BN, by 25% relative improvement. Moreover, it outperforms models whose architectures are much more complex than the simple mechanism of CDG+CDN.

Below you can see the examples from part 1, where CDG+CDN found the right bbox (highlighted in green) while GroundeR found the wrong one (highlighted in red). Unlike GroundeR, with CDN our language model can now understand negation (see picture a). Furthermore, it can understand the relationship between objects much better. In picture c, CDG+CDN can find the object even though the phrase refers to it as a ‘thing’ (remember Chewbacca from the first part of this blog?)




Next we move to RefCOCO, RefCOCO+ and RefCOCOg. The images of these datasets were gathered from COCO.  RefCOCO and RefCOCO+ split the test set into two different sets: testA, which contains images with multiple people, and testB, whose images contain multiple non-human objects. Just like RefCLEF, we test our model both with manual annotated bboxes and with automatically generated ones. To generate the candidate bboxes we’ve followed the same practice as our competitors and used Faster-RCNN.

Importantly, with automatically generated bboxes, the models with which we compare our results all use Faster-RCNN pre-trained on COCO’s images as their image models. However, we use VGG16 pre-trained on ImageNet. Empirically, Faster-RCNN yields better results than VGG16 for this task, nevertheless, even with our inferior image model pre-trained on ImageNet and not COCO, our model significantly outperform their results. The tables below show the results for the two experiments.

Finally, we compare the generalization capabilities and training speed between CDG with no normalization, with BN and with CDN. The tables below show the results.



Under the ‘train/test loss’ column we see the loss achieved on the train/test sets by the corresponding model, in parentheses we see the same results for CDG+CDN. Under the ‘epochs’ column we see the epoch in which the model got its best results, in parentheses we see the epoch in which CDG+CDN achieved (at least) the same result.  

Our results are consistent throughout all experiments. With CDN our model is much faster than with BN (up to 19 times faster) and with no normalization at all. CDN also reduces overfit dramatically. Note that  in most cases BN actually increases overfit.

While CDN was tested only for the grounding task, the rational behind it extends far beyond this task. For the first time we can see empirically how co-adaptation is affected by the parameters statistics and how these statistics should be manipulated in order to achieve optimal results. Hence, it’s reasonable to assume that the same phenomena occurs in many other tasks in which co-adaptation is expected and that CDN will be effective for other multi-domains tasks. We hope to confirm these assumptions in future works.  In the meantime, we’ve built our own app to see what our model can really do. The bbox candidates were automatically generated and Mask-RCNN segmented the objects. Here are some examples.


Mathematical Analysis

The empirical results presented above indicates the important of CDN in cross domain setting. Our hypothesis was further confirmed by the gradient plots above: CDN removes the observed correlation between the gradients flowing through each domain. In this section we’ll explain the theoretical analyse that motivated our work.

Let X = {x1, x2, …, xn} and Y = {y1, y2, …, ycn} be the parameters of the language and image projection layers respectively, where c >>1 (in our case, c = 4101/200 ≈ 200). To see how a change in the parameters of one domain affects the parameters of the other, we define:


Where f is some matching function. These two snapshots of the Jacobian indicate the change in gradient toward one domain that corresponds to changes in the other. For example, changing only yi will cause the gradient w.r.t X to change by Jiy~x  (the i-th column of Jy~x).  

Since Jy~x projects the cn-dimensional vector yf into n dimensional space, a parameter in X is likely to be affected by many parameters in Y. Hence, while yf has cn dimensions, their effect are accumulated in the parameters of X, whose information lies in at most n dimensional space. Note that there’s a dependency between Jy~x column vectors (as there are more column than rows) and this dependency controls the accumulative effect.

On the other hand, due to the symmetry of the second derivative we also have Jy~x=JTx~y. Therefore the same dependency exists between Jx~y rows, in other words, the same dependency indicates how changes in X are distributed over different yf components. It follows that small changes in few of Y’s parameters are likely to accumulate into a big change in one of X’s parameters. At the same time, a small change in the same parameter of X will generate small changes in few of Y’s parameters. Thus, as the number of Jy~x non-zero column vectors increases, the dependency between X and Y becomes stronger and imbalanced.

CDN balances the interaction between the two domains by constraining the following ratio:


Where ηyf is the gradient w.r.t Y’s parameters times η, which can be interrupted as learning rate, and the transformation Jy~x(η⛛yf) tells us how the changes in Y’s parameters affect xf. Similarly, Jx~yxf) tells us how the changes in X’s parameters affect yf. Insight into the relation between the cross domain statistics and (**) can be gained by adding two scaling layers above the language and image projection layers:

A simple chain rule yields:

Hence, it follows that:


Note that scaling the outputs of the image and language projection layers (whose parameters are X and Y) is equivalent to scaling X and Y. Hence, having the image embedding variance be much higher than the one of the language embedding, is equivalent to setting sx>> sy. In this setting, even if we have:


 we’ll have that


This is a crucial point for the grounding problem as the language model commonly involves an RNN whose initialization variance is small (a common practice that aims to avoid saturated regime) while the image model is commonly pre-trained and fixed during training and its output is likely to have higher variance. Indeed the gradients plots of CDN+image_BN are the result of having  sx>> sy, where we see that the slope of the gradient w.r.t X’s (i.e., language) plot is much bigger. Furthermore, we can see a clear dependency between the slopes of the two plots!

The traditional solution is to normalize both embeddings using common methods like BN or LN. However, having sx = sy is not necessarily a good solution: assuming that on average, a parameter in X is as sensitive to Y’s updates as a parameter in Y is to X’s (i.e., the average absolute value of a dimension in Jx~y(η⛛xf) is about the same as in Jy~x(η⛛yf))  then ||Jx~y(η⛛xf)||2 grows quadratically in c. Therefore, if the interaction between the two domains’ parameters is balanced than as c increases, the ratio in (**) becomes smaller. It follows that if c > 1, keeping (**) smaller than one is beneficial.

CDG+BN sets the same statistics to both domains. Indeed, we can still see similar dependency in its gradients plots – evidence that the assumption in the previous paragraph doesn’t hold. Note that the CDG+lang_BN is similar to CDG+BN since the variance of the image embeddings is about one. Yet, BN manages to reduce the ratio between the slopes. Our measures of the variances of the two embeddings indicate that BN does learn better statistics: it actually tries to increase sy and decrease sx, however this change is too slow due to poor learning generated by the imbalanced interaction.

This inspired our work where we search a priori optimized initialization for sy and sx in order to minimize (**) thereby stabilizing the learning behavior from the very beginning of training. Indeed, we found that heaving sy >> sx yields significant improvement by relaxing the observed dependency between the domains’ gradients.



CDN provides empirical evidence to the ability of normalization to balance co-adapting parameters interaction. Note that if the co-adaptation is distributed evenly between parameters (c ≈ 1), normalizing the embeddings with the same variance might be sufficient. We believe that the experimental framework presented in this work, where we create a simple partitioning between co-adapting parameters, can lead to general insight into parameters co-adaptation.

In Addition, CDN significantly improves accuracy, speeds up training, and yields better generalization in all our experiments. The simple CDG+CDN model consistently outperforms much more complex models. CDN also sheds light on the effect of normalization and is likely to be beneficial for a broad set of problems that combine multiple domains.


<< Part 2

Discuss / Read Comments

Leave a Reply

Leave a Reply

Explore our other AI Research or Recent Posts.