一周总结与回顾-1

Last updated on October 28, 2024 pm

论文阅读

Balanced Multimodal Learning via On-the-fly Gradient Modulation

Multimodal learning helps to comprehensively understand the world, by integrating different senses. Accordingly, multiple input modalities are expected to boost model performance, but we actually find that they are not fully exploited even when the multimodal model outperforms its uni-modal counterpart. Specifically, in this paper we point out that existing multimodal discriminative models, in which uniform objective is designed for all modalities, could remain under-optimized uni-modal representations, caused by another dominated modality in some scenarios, e.g., sound in blowing wind event, vision in drawing picture event, etc. To alleviate this optimization imbalance, we propose on-the-fly gradient modulation to adaptively control the optimization of each modality, via monitoring the discrepancy of their contribution towards the learning objective. Further, an extra Gaussian noise that changes dynamically is introduced to avoid possible generalization drop caused by gradient modulation. As a result, we achieve considerable improvement over common fusion methods on different multimodal tasks, and this simple strategy can also boost existing multimodal methods, which illustrates its efficacy and versatility. The source code is available at https://github.com/GeWu-Lab/OGM-GE_CVPR2022.

Balanced Multimodal Learning via On-the-fly Gradient Modulation 里面提到,在之前的一些文章中发现不同模态的收敛速度是不同的 (different modalities have different convergence rates),因而导致了 The modality with better performance dominates the optimization progress 的问题(即当某个贡献更大的模态已经收敛之后,其他的模态会因此受到制约,无法收敛,产生 under-optimized representation 现象).

另外也提到,SGD 中 mini-batch 实际上为梯度下降添加了一个随机噪声(stochastic gradient noise),可能因此增加了模型的泛化性和效果。

他们的方法是在一个简单的 discriminative 任务上融合 vision 和 audio 两个模态,修改 gradient back-propagation 的过程。

原来模态融合还可以在 gradient 上做文章!

他们对每一个模态计算一个 discrepancy ratio,具体见论文原文。他们构造的出发点有点令人迷惑…… discrepancy ratio 高说明该模态更占优。在更新参数时,惩罚 discrepancy ratio 较大的模态,而较小的模态则不受影响。但是这个惩罚会影响到 stochastic gradient noise,具体推导见论文,所以后期主动补上这部分来弥补(Gradient Estimation)。

最后看他们的实验结果,好像准确度进步了两个点左右,最高的进步了五个点。并且不论在 adam 还是 sgd 上应用都有所进步(不过在 sgd 上应用效果比在 adam 上更好,应该是因为本身设计就是根据 sgd 设计的,可能重新按照 adam 下的情况重新设计会获得更好的效果)

局限

  • 即使使用了 OGM-GE,多模态模型的最好效果仍然小于单一模态模型
  • 只在拼接、加和方式的 fusion 任务上进行了测试,更复杂的 fusion 方式没有相应的办法;换句话说,他们的方法只适用于“拼接/加和 +discrimination task(classification)”的情况。感觉可以拓展到其他 fusion 的方法上,做一个通用的方法

GitHub 主页 有一些其他的人的论文,还有一个数据集,可以看看。pipeline 已经装到 server 上去了,之后需要可以直接用。

What Makes Training Multi-Modal Classification Networks Hard?

What Makes Training Multi-Modal Classification Networks Hard? 和上一篇论文一样,致力于解决不同模态分配的问题。它指出多模态模型表现不足的两个原因:

  • multi-modal networks are often prone to overfitting due to their increased capacity.
  • different modalities overfit and generalize at different rates, so training them jointly with a single optimization strategy is sub-optimal.

并且给出了一种刻画(从而控制)overfitting 的方式 **overfitting-to-generalization-ratio (OGR)**:

OGR 图示

$$
OGR=\left\lvert \frac{\Delta O_{N,n}}{\Delta G_{N,n}} \right\rvert = \left\lvert \frac{O_{N+n}-O_{N}}{G_{N+n} - G_{N}} \right\rvert
$$

$OGR$ 刻画了不同 checkpoint 之间学习到的信息的质量(解释见论文附录)。也存在一些问题:

  1. 如果对 global 的 $OGR$ 都做优化,则计算量过高
  2. 在训练初期或 underfit 的模型中,$OGR$ 得分可能也会很小,甚至优于训练好的模型(因为此时 train loss 和 validation loss 差别不大)

为了估计 $OGR$,他们使用 gradient blend trick 来获得更好的近似效果。

对于多模态的模型,就根据 $OGR$ 来计算不同模态的配比,用于进行 Gradient Blend (GB) (通过为不同模态的 loss 进行加权求和达到这个效果),论文中给出的算法如下,细节见论文:

算法一算法二算法三

得到的结果如下:

结果

另外,论文附录里面还提到了 SE network 和 NL network,可以之后看看。

不过,这篇文章和 Balanced Multimodal Learning via On-the-fly Gradient Modulation 还是有所不同:

  • 这篇认为某一个模态存在 overfitting 的问题,从而导致模型表现差。解决方案是从 training loss 和 validation loss 中直接推导出不同模态的 blend(混合)配方(本质上是信息获取的多少),从而控制所有模态的更新速度。
  • 而后者认为是因为某一两个占优的模态的收敛速度会更快,当它们收敛的时候,会抑制其他模态的训练,从而导致模型表现不佳。解决方案是抑制占优的模态的更新速度(通过计算不同模态对结果的贡献来判断哪个占优)

相同点都是根据不同模态之间某个衡量值的大小来动态分配更新梯度的大小。

Characterizing and Overcoming the Greedy Nature of Learning in Multi-modal Deep Neural Networks

We hypothesize that due to the greedy nature of learning in multi-modal deep neural networks, these models tend to rely on just one modality while under-fitting the other modalities. Such behavior is counter-intuitive and hurts the models’ generalization, as we observe empirically. To estimate the model’s dependence on each modality, we compute the gain on the accuracy when the model has access to it in addition to another modality. We refer to this gain as the conditional utilization rate. In the experiments, we consistently observe an imbalance in conditional utilization rates between modalities, across multiple tasks and architectures. Since conditional utilization rate cannot be computed efficiently during training, we introduce a proxy for it based on the pace at which the model learns from each modality, which we refer to as the conditional learning speed. We propose an algorithm to balance the conditional learning speeds between modalities during training and demonstrate that it indeed addresses the issue of greedy learning. The proposed algorithm improves the model’s generalization on three datasets: Colored MNIST, ModelNet40, and NVIDIA Dynamic Hand Gesture.

采用 intermediate fusion 的形式,在两个 uni-modal branches 之间搭建若干个 *multi-modal transfer module (MMTM)*,从而进行模态之间信息的传递,如下图。具体实现见论文。

MMTM 架构

再通过 accuracy 来定义某一个模态在另一个模态的基础上可以得到的信息。直观来讲,就是先进行一次正常的训练,得到两个模态在对方存在的时候的模型 $f_{0}$ 和 $f_{1}$,以及准确率 $A(f_{0})$ 和 $A(f_{1})$;然后切断模态间信息的传递,让每个 uni-modal 的 model 单独计算出 accuracy,记为 $A(f_{1}’)$ 和 $A(f_{2}’)$。这样之后就可以定义 conditional utilization rate $\mathbf{u}(m_{0}|m_{1}) = \frac{A(f_{1}) - A(f_{1}’)}{A(f_{1})}$ 以及 $\mathbf{u}(m_{1}|m_{0}) = \frac{A(f_{0})-A(f_{0}’)}{A(f_{0})}$,再定义 $d_{util}(f)=\mathbf{u}(m_{1}|m_{0}) - u(m_{0}|m_{1}) \in [-1,1]$,当 $d_{util}$ 接近 $1$ 或者 $-1$ 的时候,说明模型 $f$ 只依赖了其中一个模态

他们还提出了 greedy learner hypothesis

A multi-modal learning process is greedy when it produces models that rely on only one of the available modalities. The modality that the multi-modal DNN primarily relies on is the modality that is the fastest to learn from. We hypothesize that a multimodal learning process, in which a multi-modal DNN is trained to minimize the sum of modalityspecific losses, is greedy.

这和 Balanced Multimodal Learning via On-the-fly Gradient Modulation 中的想法是一致的,只不过刻画依赖程度的方式有所不同;一个称作是 unbalanced,一个称作是 greedy。

而由于 conditional utilization rate 只能在训练完 4 个模型后计算得到,因此不适用于 real time 的训练调整。于是他们又提出了新的指标 conditional learning speed,计算公式见论文。算法如下:

算法实现

论文最后还提到一个观点:strong regularization encourages greediness,即过强的 regularization 会导致模型更倾向于只依赖一个 modality,而减少使用其他的 modalities (imbalance between modalities)。附录里面还提到了 data preparation 的流程,可以参考参考。

上面三篇论文的对比

总结一下以上三篇论文的异同:

  • 分别从 overfitting 和 unbalanced(greedy) 两个角度分析,感觉可能是相对的关系,因为在一个 underfit 的模型看来,表现得更好的模态自然是 overfitting 的;反之亦然。
  • Balanced Multimodal Learning via On-the-fly Gradient Modul ation 这篇根据在分类任务上不同模态对正确预测概率的贡献,相除得到不同模态的相对贡献比率,贡献高的就让它更新慢一些。
  • What Makes Training Multi-Modal Classification Networks Hard? 这篇根据不同模态在训练期间的 loss 来刻画学到的信息的多少、是否 overfitting 等,并通过最小化训练得到的 loss 的差值和真实数据分布的 loss 的差值来得到不同模态的更新梯度的比例
  • 最后一篇则通过一个模态 $m_{1}$ 本身能学到的信息和从另一个模态 $m_{2}$ 中学到本模态的信息(表现为 intermediate fusion)计算比例/学习速度,反过来再计算一遍,得到的两个值做差来判定哪个模态更占优,然后加速不占优的那个模态的更新速度(虽然我不知道算法中是怎么实现这个目的的,没搞懂)

团建

和初中同学团建!很开心,能够和三四年前的同学一起再聚一次,真的很难得……希望以后还能有这样的机会🥰🥰🥰


一周总结与回顾-1
http://kinnariyamamatanha.github.io/blogs/2024/10/28/一周总结与回顾-1/
Author
Kinnari
Posted on
October 28, 2024
Updated on
October 28, 2024
Licensed under