About Me


Currently I am a final-year PhD student in the Department of Computer Science at CUNY Graduate Center, supervised by Professor Michael Mandel. My research focuses on Machine Learning and Speech Processing.

Research summary:

An automatic speech recognition (ASR) can make predictions based on irrelevant signals when it learns the wrong associations between the label and noise in the training data. This leads to low performance when the model is deployed in the real world. Therefore, I conduct research on finding the time-frequency regions in the spectrogram that the ASR pays attention to when making a transcription. Those regions are called audible importance/attention maps. Attention maps can enhance the model’s interpretability and performance [1][2]; or boost the accuracy in a small training set scenario [3].

I have published 3 conference papers (Interspeech, ICASSP) and a journal paper (TASLP) on this topic. In the first article, I proposed an adversarial-based approach to predict the 2-D importance maps (IM) without the ground true IM. The importance maps were 2-D as they indicated important time-frequency points. Meanwhile, the 1-D attention maps commonly used in ASR showing only which time steps are important. The importance maps show patterns that are similar to analyses derived from human listening tests while exhibiting better generalization.

In the second paper, I developed a new evaluation metric, which was the first to evaluate the importance maps in a structured prediction task. This metric is different from the existing metrics, in which the accuracy of other words in a sentence and the predicted important speech energies are taken into account.

In the third paper, I developed a data augmentation for speech based on importance map.

In the journal article, I compared the important time-frequency regions that humans, non-neural network ASR, and neural network ASR focus on. My analysis concluded that the neural network ASR has importance maps that are much more similar to the human ones than the non-neural network ASR’s maps, however, it does not capture all the cues that the human listeners utilize. On this basis, it is recommended that the ASR’s performance in noisy conditions can be improved by adapting it to paying better attention to the cues used by human listeners.

[1] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, 2015.
[2] J. Schlemper, O. Oktay, M. Schaap, M. P. Heinrich, B. Kainz,B. Glocker, and D. Rueckert, “Attention gated networks:Learning to leverage salient regions in medical images,”MedicalImage Anal., vol. 53, pp. 197–207, 2019.
[3] C. F. Flores, A. Gonzalez-Garcia, J. van de Weijer, and B. Rad-ucanu, “Saliency for fine-grained object recognition in domainswith scarce training data,”Pattern Recognition, vol. 94, pp. 62–73,2019.

Recent Post

Introduction to Tensorflow
Restore model


Email: vtrinh@gradcenter.cuny.edu and anhtv1@gmail.com
Skype: tvanh512