Cross Domain Normalization: NLP In the Visual World (Part 1)
In this blog post we’ll tackle an extremely important task for any AI agent that wants to interact in a human like manner – the task of grounding textual phrases in images. We are given an image in which we have a set of candidate bounding boxes (bbox) that bound some of the image’s objects. We also have a referential phrase and need to find the bbox that bounds the referred object. In the example below, the referred bounding box is highlighted in green.
While many multi-media applications can benefit from the ability to ground language in the visual world, It turns out we can also use this task to study a fundamental building block of human-like learning while investigating one of the hardest problems in Deep Learning: co-optimization
This realization motivated this study which I’ve conducted together with the help of Prof. Michael Elhadad from Ben Gurion university. We introduce Cross Domain Normalization (CDN), which extends normalization mechanisms by manipulating the cross domains statistics. While CDN is extremely easy to implement, it improves our model performances dramatically while significantly reducing overfit and speeding up convergence rate (up to 19 times faster than Batch Normalization).
Learning Language and Vision Together
Obviously, solving this problem is crucial for numerous applications such as robotics, image retrieval, image Q&A and more. However, it’s also an important milestone in how we humans learn language and vision. Cognitive and psychology studies show that visual aids significantly improves language learning. This is pretty intuitive but let’s look at an example. Imagine we get the next referential phrase:
‘The man to the right of Chewbacca’
We can’t understand the meaning of this phrase without knowing who or what Chewbacca is. What if instead we see the following image:
We can see the weird hairy dude but we have no idea that this is Chewbacca. But what if we see both the image and the phrase? It’s easy now! There’s only one man in the image and his standing to the right of Chewbacca. Not only that we can now find Chewbacca, we can learn how to find it in future images, and we’ve done it with only one example.
Another advantage here is the fact that utilizing the relations between objects is crucial for this task (e.g, ‘the woman to the right of the tall building’ ). It follows that this problem is a great opportunity for us to learn syntax.
But before you get all fired up, it turns out that there’s a mathematical problem we first need to solve. The problem is rooted in the different nature of linguistic and visual information.
Co-adaptation is a huge problem in Deep Learning which is known to be a major cause for over-fitting and slowing down learning. Empirically, it’s usually extremely hard to learn the nature of co-adaptation. Just imagine having millions of different parameters – there is pretty much an infinite number of ways in which co-adaptation can take place. So what makes our problem so different? To understand that we need to understand the typical structure of today’s solutions.
In the above image we see:
- The language model (for which RNN is commonly use) embeds the phrase
- The image model (CNN) embeds the bbox together with some visual context (e.g., the entire image)
- The two projection layers project the two embeddings into a common space
- Finally, the outputs of these two projection layers are given to a matching model which decides whether the phrase refers to the bbox.
From this description it’s clear: the image domain (in green) and the language one (blue) should co-adapt. However, all we really need to understand is the interaction between the two projection layer as they control the learning behavior of the image and language models.
The bottom line here is that we have a clear partitioning between two sets of co-adapting parameters (the parameters of the two projection layers). This makes it much easier to study how these parameters affect each other and how we can manipulate co-adapting parameters. It gets even more interesting due to the different nature of the information which these two domains represent.
Ambiguity vs Clarity
Let’s say you have the phrase ‘man’ with the following image:
It’s easy to find the referred object in the image. Now what if the man in the picture was taller or thinner? What if his hair color or maybe his pose was different? It doesn’t really matter, we would still be able to find the referred object just by looking at the phrase ‘man’!
This example shows a fundamental difference in how similar concepts are represented in the two domains: a phrase can capture abstract and ambiguous meaning of the objects it describes, an image must include specific visual features. Thus, in general, the visual representation is likely to hold much more information than the corresponding language representation. Note that this information can be important in certain cases (e.g. ‘the tall man’).
Indeed, image embeddings are commonly much larger than language embeddings and our dataset will have different visual objects with different embeddings grounded by the same phrase with the same (smaller) language embedding. Therefore, a language’s parameter is likely to be affected by many of the visual parameters.
This can have a destructive impact on the learning behavior! Latter in this post we’ll analyze the implications mathematically, but first let’s get some intuition.
Imagine we have a language’s parameter and whatever it learns depends on the values of n visual parameters. Changing each one of these n parameters will have cumulative effect on the language feature! On the other hand, the effect of changing the language’s parameter will be distributed over the n visual parameters! As a result, the learning can become very unstable. A small change in the visual parameters will cause a big change in the co-adapting language’s parameter, which will then cause another change in the visual parameters and so forth.
Our solution is simple. Imagine we only have two co-adapting parameters which can take any value between zero and one:
What will happen if we increase the range of the blue parameter?
Changing the value of the green parameter by 0.1 is significant, after all it’s 10% of the entire range. However, compared to the range of the blue parameter, 0.1 is extremely small. We found that this decreases the effect that changes made to the green parameter have on the blue one. On the other hand, it turns out that changing the blue parameter by 10% (= 1.0 = the entire range of the green parameter) will now have a much bigger effect on the green one.
As a result, if the image’s parameters have lower variance than the language ones, their cumulative effect on the co-adapted language’s parameter will be smaller, which is what we wanted. At the same time, changing a language’s parameter will have bigger effect on the image’s co-adapted parameters. It follows that by setting the right variances for the two domains, we can balance the effect they have on each other.
Below you can see some examples where the red bounding boxes were produced by our baseline with Batch Normalization and the green bounding boxes are the results of replacing the Batch Normalization into Cross Domain Normalization. In the next part we’ll see how all state-of-the-art models are outperformed by our extremely simple baseline just by adding CDN. We’ll see a huge reduction in overfit and up to 19 times (!!) faster training time compared to Batch Normalization. We’ll also develop the mathematical relationship between cross domain statistics and the effect these two domains have on each other, which will give us important insight into the reasons behind the benefits of today’s normalization techniques.