The 27th IPP Symposium

Speaker Detection Using Boosted Dynamic Bayesian Networks

Jim Rehg, Compaq

Advanced user interfaces based on speech and vision pose a challenging inference problem: The actions and intentions of multiple people must be estimated from sequences of noisy and ambiguous sensor data. In this talk I describe some recent applications of dynamic Bayesian network models to user interface for a Smart Kiosk. The Smart Kiosk provides information and entertainment to multiple people in public spaces and is based on vision, speech and touch sensing. I present a DBN architecture for speaker detection, inferring when a user is speaking to the kiosk. This architecture fuses off-the-shelf visual and audio sensors (face, skin, texture, mouth motion and silence detectors) with contextual cues from the interface itself. Experimental results confirm the importance of temporal duration and context in accurate classification. A novel application of boosting is shown to improve classifier performance. This is joint work with Vladimir Pavlovic and Ashutosh Garg.