Look&Listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement
Junwen Xiong1* Yu Zhou1* Peng Zhang1*†, Lei Xie1, Wei Huang2, Yufei Zha1,
School of Computer Science, ASGO, Northwestern Polytechnical University, Xi’an, China1
School of Mathematics and Computer Sciences, Nanchang University, China2
IEEE Transactions on Multimedia(TMM), 2022
Overview
We propose a unified learning framework to jointly achieve active speaker detection and audio-visual speech enhancement. Firstly, we introduce a cross-modal conformer which is used to anticipate and model the temporal audio-visual relationships across spatial-temporal space. Then, a plug-and-play multi-modal layer normalization is designed to alleviate the distribution misalignment of multi-modal features. Lastly, a cross-modal circulant fusion scheme is proposed to enable intrinsic assistance of two tasks by audio-visual feature interaction. In comparison to other state-of-the-art works, the proposed work shows a superior performance for active speaker detection and audio visual speech enhancement on three benchmark datasets.
Method
We propose a unified learning framework to jointly learn active speaker detection and audio-visual speech enhancement. To further achieve the mutual learning of both audio enhancement and visual detection, a scheme of cross-modal circulant fusion is proposed to leverage the complementary cues between the bifurcated processes for the establishment of their associations. By detecting active speakers with the help of enhanced audio information, the more accurate the detection result, the more reliable visual information to guide the speech enhancement; the cleaner the enhanced sound, the more distinctive the audio embedding, which will in turn help the detection, and overall performance can be guaranteed in such cyclic mutual learning.


The overall pipeline of our proposed ADENet model. Its framework is divided into three stages: audio-visual correlation learning, audio contextual learning, and cross-modal circulant fusion. The audio-visual correlation learning and the audio contextual learning aim at modeling the associations between multi-modal data and extracting contextual embeddings in the audio domain, respectively. Then, cross-modal circulant fusion is proposed to integrate correlation features and contextual features for active speaker detection and speech enhancement.


Visualization of the feature distribution alignment. (a) shows the distribution and topology of audio and visual features. It is obviously that visual features are not aligned with audio ones. Thanks to the multi-modal layer normalization, visual and auditory features are brought into similar distributions as (b).
Results
Qualitative results of audio-visual sound enhancement for both audible and silent objects:
Demo video link: [Google Drive]
Demo video of active speaker detection and speech enhancement result by ADENet.

Also can be found in: [Baidu Netdisk]
BibTeX
  @article{xiong2022look,
    title={Look$\backslash$\&Listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement},
    author={Xiong, Junwen and Zhou, Yu and Zhang, Peng and Xie, Lei and Huang, Wei and Zha, Yufei},
    journal={arXiv preprint arXiv:2203.02216},
    year={2022}
  }