Zongwei Wu soutiendra sa thèse le 21 novembre à 14h dans la salle 111 du bâtiment I3M.
Titre de la thèse :
Depth Attention is (not) All You Need but Helpful for Scenes Understanding
Lien Teams :
Composition du Jury :
- Nicolas Thome – Sorbonne Université – Rapporteur
- Christian Wolf – Navers Labs Europe – Rapporteur
- Liming Chen – Ecole Centrale Lyon – Examinateur
- David Picard – Ecole des Ponts ParisTech – Examinateur
- Guillaume Allibert – Université Côte d’Azur – Encadrant
- Christophe Stolz – Université Bourgogne Franche-Comté – co-directeur
- Cédric Demonceaux – Université Bourgogne Franche-Comté – co-directeur
Résumé :
Deep learning models can nowadays teach a machine to realize a number of tasks, even with better precision than human beings. Among all the modules of an intelligent machine, perception is the most essential part without which all other action modules have difficulties in safely and precisely realizing the target task under complex scenes. Conventional perception systems are based on RGB images which provide rich texture information about the 3D scene. However, the quality of RGB images highly depends on environmental factors, which further influence the performance of deep learning models. Therefore, in this thesis, we aim to improve the performance and robustness of RGB models with complementary depth cues by proposing novel RGB-D fusion designs.
Traditionally, pixel-wise concatenation with addition and convolution is the widely applied approach for RGB-D fusion designs. Inspired by the success of attention modules in deep networks, in this thesis we analyze and propose different depth-aware attention modules and demonstrate our effectiveness in basic segmentation tasks such as saliency detection and semantic segmentation. First, we leverage the geometric cues and propose a novel depth-wise channel of attention. We merge the fine-grained details and the semantic cues to constrain the channel attention into various local regions, improving the model discriminability during the feature extraction. Second, we investigate the depth-adapted offset which serves as a local but deformable spatial attention for convolution. Our approach forces the networks to take more relevant pixels into account with the help of depth prior. Third, we improve the contextualized awareness within RGB-D fusion by leveraging transformer attention. We show that transformer attention can improve the model robustness against feature misalignment. Last but not least, we focus on fusion architecture by proposing an adaptive fusion design. We learn the trade-off between early and late fusion with respect to the depth quality, yielding a more robust manner to merge RGB-D cues for deep networks. Extensive comparisons on the reference benchmarks validate the effectiveness of our proposed methods compared to other fusion alternatives.