For speaker tracking, integrating multimodal information from audio and video provides an effective and promising solution. The current challenges are focused on the construction of a stable observation model. To this end, we propose a 3D audio-visual speaker tracker assisted by deep metric learning on the two-layer particle filter framework. Firstly, the audio-guided motion model is applied to generate candidate samples in the hierarchical structure consisting of an audio layer and a visual layer. Then, a stable observation model is proposed with a designed Siamese network, which provides the similarity-based likelihood to calculate particle weights. The speaker position is estimated using an optimal particle set, which integrates the deci...
PhD ThesisThis thesis concerns the problem of target localization and tracking in an indoor environm...
Audio-visual tracking of multiple speakers requires to estimate the state (e.g. velocity and locatio...
Tracking an unknown and time-varying number of targets (e.g., speakers) in indoor environments using...
We propose an audio-visual fusion algorithm for 3D speaker tracking from a localised multi-modal sen...
Audio-visual tracking of an unknown number of concurrent speakers in 3D is a challenging task, espec...
Abstract—The problem of tracking multiple moving speakers in indoor environments has received much a...
Abstract—The problem of tracking multiple moving speakers in indoor environments has recently receiv...
Compact multi-sensor platforms are portable and thus desirable for robotics and personal-assistance ...
The problem of tracking multiple moving speakers in indoor environments has received much attention....
In this thesis, a novel approach is proposed for multi-speaker tracking by integrating audio and vis...
We present a robust and efficient audio-visual (AV) approach to speaker tracking in a room environme...
It is often advantageous to track objects in a scene using multimodal information when such informat...
We propose a multi-modal object tracking algorithm that combines appearance, motion and audio inform...
It is often advantageous to track objects in a scene using multimodal information when such informat...
In this paper, we present a novel approach for tracking a lecturer during the course of his speech. ...
PhD ThesisThis thesis concerns the problem of target localization and tracking in an indoor environm...
Audio-visual tracking of multiple speakers requires to estimate the state (e.g. velocity and locatio...
Tracking an unknown and time-varying number of targets (e.g., speakers) in indoor environments using...
We propose an audio-visual fusion algorithm for 3D speaker tracking from a localised multi-modal sen...
Audio-visual tracking of an unknown number of concurrent speakers in 3D is a challenging task, espec...
Abstract—The problem of tracking multiple moving speakers in indoor environments has received much a...
Abstract—The problem of tracking multiple moving speakers in indoor environments has recently receiv...
Compact multi-sensor platforms are portable and thus desirable for robotics and personal-assistance ...
The problem of tracking multiple moving speakers in indoor environments has received much attention....
In this thesis, a novel approach is proposed for multi-speaker tracking by integrating audio and vis...
We present a robust and efficient audio-visual (AV) approach to speaker tracking in a room environme...
It is often advantageous to track objects in a scene using multimodal information when such informat...
We propose a multi-modal object tracking algorithm that combines appearance, motion and audio inform...
It is often advantageous to track objects in a scene using multimodal information when such informat...
In this paper, we present a novel approach for tracking a lecturer during the course of his speech. ...
PhD ThesisThis thesis concerns the problem of target localization and tracking in an indoor environm...
Audio-visual tracking of multiple speakers requires to estimate the state (e.g. velocity and locatio...
Tracking an unknown and time-varying number of targets (e.g., speakers) in indoor environments using...