Bubble cooperative networks for identifying important speech cues

Predicting the intelligibility of noisy recordings is difficult and most current algorithms treat all speech energy as equally important to intelligibility. Our previous work on human perception used a listening test paradigm and correlational analysis to show that some energy is more important to intelligibility than other energy and that the audibility of certain regions containing no speech energy at all can be important for correctly identifying certain words. In this paper, we propose a system called the “Bubble Cooperative Network” (BCN), which aims to predict important areas directly from clean speech. Given such a prediction, noise is added to the utterance in unimportant regions and then presented to a recognizer. The goal of the BCN is to add as much noise as possible without affecting recognition performance, encouraging it to identify important regions precisely. Our results show that this is possible for a simple speech recognizer that compares a noisy test utterance with a clean reference utterance. The predicted masks show patterns that are similar to analyses derived from human listening tests, but with better generalization and less context-dependence than previous approaches. Figure