Accepted Papers

Title Authors
Flow-ER: a Flow-based Embedding Regularization Strategy for Robust Speech Representation Learning; Kang, Woo Hyun*; Alam, Jahangir ; Fathan, Abderrahim
Continual Self-supervised Domain Adaptation for End-to-end Speaker Diarization; Coria, Juan Manuel*; Bredin, Hervé; Ghannay, Sahar; Rosset, Sophie
Fine Grained Spoken Document Summarization Through Text Segmentation; Kotey, Samantha*; Dahyot, Rozenn; Harte, Naomi
Joint speaker diarisation and tracking in switching state-space model; Wong, Jeremy H. M.*; Gong, Yifan
Diarisation using location tracking with agglomerative clustering; Wong, Jeremy H. M.*; Abramovski, Igor; Xiao, Xiong; Gong, Yifan
Exploration of Language-Specific Self-Attention Parameters for Multilingual End-to-End Speech Recognition; Houston, Brady*; Kirchhoff, Katrin
End-to-End Multi-speaker ASR with Independent Vector Analysis; Scheibler, Robin*; Zhang, Wangyou; Chang, Xuankai; Watanabe, Shinji; Qian, Yanmin
An Experimental Study on Private Aggregation of Teacher Ensemble Learning for End-to-End Speech Recognition; Yang, Chao-Han Huck*; Chen, I-Fan; Stolcke, Andreas; Siniscalchi, Sabato M; Lee, Chin-hui
THE CLEVER HANS EFFECT IN VOICE SPOOFING DETECTION; Chettri, Bhusan*
Distribution-based Emotion Recognition in Conversation; Wu, Wen*; Zhang, Chao; Woodland, Phil
JOIST: A Joint Speech and Text Streaming Model For ASR; Sainath, Tara*; Prabhavalkar, Rohit; Bapna, Ankur; Zhang, Yu; Huo, Zhouyuan; Chen, Zhehuai; Li, Bo; Wang, Weiran; Strohman, Trevor
Mixture of Domain Experts for Language Understanding: An Analysis of Modularity, Task Performance, and Memory Tradeoffs; Kleiner, Benjamin*; FitzGerald, Jack; Khan, Haidar; Tur, Gokhan
DUAL LEARNING FOR LARGE VOCABULARY ON-DEVICE ASR; Peyser, Charles C*; Huang, Ronny; Sainath, Tara; Prabhavalkar, Rohit; Picheny, Michael; Cho, Kyunghyun
Untied Positional Encodings for Efficient Transformer-based Speech Recognition; Samarakoon, Lahiru T*; Fung, Ivan
PHONE-LEVEL PRONUNCIATION SCORING FOR L1 USING WEIGHTED-DYNAMIC TIME WARPING; SINI, Aghilas*; Perquin, Antoine; Lolive, Damien; Delhay, Arnaud
MASC: Massive Arabic Speech Corpus; Al-Fetyani, Mohammad*; AlBarham, Mohammad; Abandah, Gheith A.; Alsharkawi, Adham; Dawas, Maha
Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual Active Speaker Detection; Chen, Xuanjun*; Wu, Haibin; Lee, Hung-yi; Meng, Helen; Jang, Roger
StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models; Li, Yinghao A*; Han, Cong; Mesgarani, Nima
Generative Models for Improved Naturalness, Intelligibility, and Voicing of Whispered Speech; Wagner, Dominik*; Bayerl, Sebastian P; Cordourier, Hector; Bocklet, Tobias
Improving generalizability of distilled self-supervised speech processing models under distorted settings; Huang, Kuan-Po*; FU, YU-KUAN; Hsu, Tsu-Yuan; Ritter Gutierrez, Fabian Alejandro; Wang, Fan-Lin; Tseng, Liang-Hsuan; Zhang, Yu; Lee, Hung-yi
AN ANALYSIS OF THE EFFECTS OF DECODING ALGORITHMS ON FAIRNESS IN OPEN-ENDED LANGUAGE GENERATION; Dhamala, Jwala*; Kumar , Varun ; Gupta, Rahul; Chang, Kai-Wei; Galstyan, Aram
SIMD-SIZE AWARE WEIGHT REGULARIZATION FOR FAST NEURAL VOCODING ON CPU; Kanagawa, Hiroki*; Ijima, Yusuke
IMPROVED NOISY ITERATIVE PSEUDO-LABELING FOR SEMI-SUPERVISED SPEECH RECOGNITION; Li, Tian*; Meng, Qingliang; Sun, Yujian
Towards End-to-end Unsupervised Speech Recognition; Liu, Alexander H*; Hsu, Wei-Ning; Auli, Michael; Baevski, Alexei
Inter-KD: Intermediate Knowledge Distillation for CTC-Based Automatic Speech Recognition; Yoon, Ji Won; Woo, Beom Jun; Ahn, Sunghwan; Lee, Hyeonseung; Kim, Nam Soo*
MULTI-STAGE PROGRESSIVE AUDIO BANDWIDTH EXTENSION; wen, liang*; Wang, Lizhong; Zhang, Ying; Choi, Kwang Pyo
Spatial-DCCRN: DCCRN Equipped with Frame-level Angle Feature and Hybrid Filtering for Multi-channel Speech Enhancement; Lv, Shubo*; Fu, Yihui; Ju, Yukai; Xie, Lei; Zhu, Weixin; Rao, Wei; Wang, Yannan
ASBERT: ASR-SPECIFIC SELF-SUPERVISED LEARNING WITH SELF-TRAINING; kim, hyungyong; Kim, Byeong-Yeol*; Yu, Seung Woo; Lim, Youshin; Lim, Yunkyu; Lee, Hanbin
Code-switched language modelling using a code predictive LSTM in under-resourced South African languages; Jansen Van Vuren, Joshua M*; Niesler, Thomas
Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss; Georgiou, Efthymios*; Kritsis, Kosmas; Paraskevopoulos, Georgios; Katsamanis, Athanasios; Katsouros, Vassilis; Potamianos, Alexandros
HOW TO BOOST ANTI-SPOOFING WITH X-VECTORS; Ma, Xinyue*; Zhang, Shanshan; Huang, Shen; Gao, Ji; Hu, Ying; HE, Liang
Speed-Robust Keyword Spotting via Soft Self-Attention on Multi-Scale Features; Ding, Chaoyue*; Li, Jiakui; Zong, Martin; Li, Baoxiang
Can we use Common Voice to train a Multi-Speaker TTS system?; Ogun, Sewade O*; Colotte, Vincent; Vincent, Emmanuel
TRANSFORMER-BASED LIP-READING WITH REGULARIZED DROPOUT AND RELAXED ATTENTION; Li, Zhengyang*; Lohrenz, Timo; Dunkelberg, Matthias; Fingscheidt, Tim
A DATA-DRIVEN INVESTIGATION OF NOISE-ADAPTIVE UTTERANCE GENERATION WITH LINGUISTIC MODIFICATION; Chingacham, Anupama*; Demberg, Vera; Klakow, Dietrich
Flickering reduction with partial hypothesis reranking for streaming ASR; Bruguier, Antoine*; Qiu, David; strohman, Trevor; He, Yanzhang
Match to Win: Analysing Sequences Lengths for Efficient Self-supervised Learning in Speech and Audio; Gao, Yan*; Fernandez-Marques, Javier; Parcollet, Titouan; Gusmao, Pedro; Lane, Nicholas
Mutual Learning of Single- and Multi-Channel End-to-End Neural Diarization; Horiguchi, Shota*; Takashima, Yuki; Watanabe, Shinji; Garcia, Paola
GUIDED CONTRASTIVE SELF-SUPERVISED PRE-TRAINING FOR AUTOMATIC SPEECH RECOGNITION; Khare, Aparna*; Wu, Minhua; Bhati, Saurabhchand; Droppo, Jasha; Maas, Roland
Exploring a unified ASR for multiple south Indian languages leveraging multilingual acoustic and language models; C. S., ANOOP*; A G, Ramakrishnan
HMM vs. CTC for Automatic Speech Recognition: Comparison Based on Full-Sum Training from Scratch; Raissi, Tina*; Zhou, Wei; Berger, Simon; Schlüter, Ralf; Ney, Hermann
Exploring Efficient-tuning Methods in Self-supervised Speech Models; Chen, Zih-Ching; Fu, Chin-Lun; Liu, Chih Ying; Li, Shang-Wen; Lee, Hung-yi*
A MULTI-MODAL ARRAY OF INTERPRETABLE FEATURES TO EVALUATE LANGUAGE AND SPEECH PATTERNS IN DIFFERENT NEUROLOGICAL DISORDERS; Favaro, Anna*; Motley, Chelsie; Cao, Tianyu; Iglesias, Miguel ; Butala, Ankur; Oh, Esther S. ; Stevens, Robert; Villalba, Jesús ; Dehak, Najim; Moro-Velazquez, Laureano
A Truly Multilingual First Pass and Monolingual Second Pass Streaming On-Device ASR System; Mavandadi, Sepand*; Li, Bo; Zhang, Chao; Farris, Brian; Sainath, Tara; Strohman‎, Trevor
Scaling Up Deliberation for Multilingual ASR; Hu, Ke*; Sainath, Tara; Li, Bo
On the Use of Semantically-Aligned Speech Representation for Spoken Language Understanding; Laperrière, Gaëlle; Pelloin, Valention; Rouvier, Mickael; Stafylakis, Themos; Estève, Yannick*
MULTILINGUAL SPEECH EMOTION RECOGNITION WITH MULTI-GATING MECHANISM AND NEURAL ARCHITECTURE SEARCH; Wang, Zihan*; Meng, Qi; Lan, Haifeng; Zhang, Xinrui; Guo, Kehao; Gupta, Akshat
Improving Semi-supervised E2E ASR using CycleGAN and Inter-domain Losses; Li, Chia-Yu*; Thang, Vu
STOP: A DATASET FOR SPOKEN TASK ORIENTED SEMANTIC PARSING; Tomasello, Paden*; Shrivastava, Akshat; Lazar, Daniel A; Hsu, Po-chun; Le, Duc; Sagar, Adithya; Elkahky, Ali; Copet, Jade; Hsu, Wei-Ning; Adi, Yossi; Algayres, Robin; Nguyen, Tu Anh; Dupoux, Emmanuel; Zettlemoyer, Luke; Mohamed, Abdel-rahman
Alternate Intermediate Conditioning with Syllable-level and Character-level Targets for Japanese ASR; Fujita, Yusuke*; Komatsu, Tatsuya; Kida, Yusuke
Exploiting information from native data for non-native automatic pronunciation assessment; Lin, Binghuai; wang, Liyuan*
Fully Unsupervised Training of Few-Shot Keyword Spotting; Kim, Minchan*; Lee, Dongjune; Mun, Sung Hwan; Han, Min Hyun; Kim, Nam Soo
FLEURS: FEW-SHOT LEARNING EVALUATION OF UNIVERSAL REPRESENTATIONS OF SPEECH; Conneau, Alexis; Ma, Min*; Khanuja, Simran; Zhang, Yu; Axelrod, Vera; Dalmia, Siddharth; Riesa, Jason; Rivera, Clara; Bapna, Ankur
SUB-8-BIT QUANTIZATION FOR ON-DEVICE SPEECH RECOGNITION: A REGULARIZATION-FREE APPROACH; Zhen, Kai*; Radfar, Martin; Nguyen, Hieu D; Strimel, Grant ; Mouchtaris, Athanasios; Susanj, Nathan
FREQUENCY AND MULTI-SCALE SELECTIVE KERNEL ATTENTION FOR SPEAKER VERIFICATION; Mun, Sung Hwan*; Jung, Jee-weon; Han, Min Hyun; Kim, Nam Soo
MFCCA: Multi-Frame Cross-Channel attention for multi-speaker ASR in Multi-party meeting scenario; Yu, Fan*; 张, 仕良; Guo, Pengcheng; Liang, Yuhao; Du, Zhihao; Lin, Yuxiao; Xie, Lei
WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration; Koizumi, Yuma*; Yatabe, Kohei; Zen, Heiga; Bacchiani, Michiel
Modular Hybrid Autoregressive Transducer; Meng, Zhong*; Chen, Tongzhou; Prabhavalkar, Rohit; Zhang, Yu; Wang, Yuan; Audhkhasi, Kartik; Emond, Jesse; Strohman, Trevor; Ramabhadran, Bhuvana; Huang, Ronny; Variani, Ehsan; Huang, Yinghui; Moreno, Pedro
SpeechCLIP: Integrating Speech with Pre-trained Vision and Language Model; Shih, Yi-Jen*; Wang, Hsuan-Fu; Chang, Heng-Jui; Berry, Layne; Lee, Hung-yi; Harwath, David
A CONTEXT-AWARE KNOWLEDGE TRANSFERRING STRATEGY FOR CTC-BASED ASR; Lu, Ke-Han*; CHEN, Kuan-Yu
Efficient dynamic filter for robust and low computational feature extraction; Kim, Donghyeon*; Kwak, Jeong-gi; Ko, Hanseok
Exploring WavLM on Speech Enhancement; Song, Hyungchan*; Chen, Sanyuan; Chen, Zhuo; Wu, Yu; Yoshioka, Takuya; Tang, Min; Shin, Jong Won; Liu, Shujie
Pronunciation-aware unique character encoding for RNN Transducer-based Mandarin speech recognition; Shen, Peng*; Lu, Xugang; Kawai, Hisashi
YFACC: A Yoruba Speech-Image Dataset for Cross-lingual Keyword Localisation through Visual Grounding; Olaleye, Kayode K*; Oneață, Dan; Kamper, Herman
ON THE USE OF MODALITY-SPECIFIC LARGE-SCALE PRE-TRAINED ENCODERS FOR MULTIMODAL SENTIMENT ANALYSIS; Ando, Atsushi*; Masumura, Ryo; Takashima, Akihiko; Suzuki, Satoshi; Makishima, Naoki; Suzuki, Keita; Moriya, Takafumi; Ashihara, Takanori; Sato, Hiroshi
Fast Entropy-Based Methods of Word-Level Confidence Estimation for End-To-End Automatic Speech Recognition; Laptev, Aleksandr*; Ginsburg, Boris
BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications; Zuluaga Gomez, Juan Pablo *; Sarfjoo, Seyyed Saeed; Prasad, Amrutha; Nigmatulina, Iuliia; Motlicek, Petr; Ondrej, Karel; Ohneiser, Oliver; Helmke, Hartmut
An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition; Moritz, Niko*; Seide, Frank; Le, Duc; Mahadeokar, Jay; Fuegen, Christian
How Does Pre-trained Wav2Vec2.0 Perform on Domain-Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications; Zuluaga Gomez, Juan Pablo *; Prasad, Amrutha; Nigmatulina, Iuliia; Sarfjoo, Seyyed Saeed; Motlicek, Petr; Kleinert, Matthias; Helmke, Hartmut; Ohneiser, Oliver; Zhan, Qingran
Learning to Jointly Transcribe and Subtitle for End-to-End Spontaneous Speech Recognition; Poncelet, Jakob*; Van hamme, Hugo
GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models; Baas, Matthew*; Kamper, Herman
CONFORMER-BASED ON-DEVICE STREAMING SPEECH RECOGNITION WITH KD COMPRESSION AND TWO-PASS ARCHITECTURE; Park, Jinhwan*; Jin, Sichen; Park, Junmo; Kim, Sungsoo; Sandhyana, Dhairya ; Lee, Changheon; Han, Myoungji; Lee, Jungin; Jung, Seokyeong; Han, Chang Woo; Kim, Chanwoo
Improving Noise Robustness for Spoken Content Retrieval using semi-supervised ASR and N-best transcripts for BERT-based ranking models; Moriya, Yasufumi*; Jones, Gareth
TEA-PSE 2.0: SUB-BAND NETWORK FOR REAL-TIME PERSONALIZED SPEECH ENHANCEMENT; Ju, Yukai*; Zhang, Shimin; Rao, Wei; Wang, Yannan; Yu, Tao; Xie, Lei; Shang, Shi-dong
An Analysis of Semantically-Aligned Speech-Text Embeddings; Huzaifah, Muhammad*; Kukanov, Ivan
Learning Invariant Representation and Risk Minimized for Unsupervised Accent Domain Adaptation; Zhao, Chendong*; Wang, Jianzong; Qu, Xiaoyang; Wang, Haoqian; Xiao, Jing
LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION; Liu, Qinghua; Huang, Yating; Hao, Yunzhe; Xu, Jiaming*; Xu, Bo
Towards visually prompted keyword localisation for zero-resource spoken languages; Nortje, Leanne*; Kamper, Herman
AN ATTENTION-BASED BACKEND ALLOWING EFFICIENT FINE-TUNING OF TRANSFORMER MODELS FOR SPEAKER VERIFICATION; Peng, Junyi*; Plchot, Oldrich; Stafylakis, Themos; Mosner, Ladislav; Burget, Lukas; Cernocky, Jan
Distilling Sequence-to-Sequence Voice Conversion Models For Streaming Conversion Applications; Tanaka, Kou*; Kameoka, Hirokazu; Kaneko, Takuhiro; Seki, Shogo
A Hybrid Acoustic Echo Reduction Approach Using Kalman Filtering and Informed Source Extraction With Improved Training; Mack, Wolfgang*; Habets, Emanuel
Learning accent representation with multi-level VAE towards controllable speech synthesis; Melechovsky, Jan*; Mehrish, Ambuj; Herremans, Dorien; Sisman, Berrak
INTER-DECODER: USING ATTENTION-DECODER LOSSES AS INTERMEDIATE REGULARIZATION FOR CTC-BASED SPEECH RECOGNITION; Komatsu, Tatsuya*; Fujita, Yusuke
Exploration of A Self-Supervised Speech Model: A Study on Emotional Corpora; Li, Yuanchao*; Mohamied, Yumnah; Bell, Peter; Lai, Catherine
STREAMING BILINGUAL END TO END ASR MODEL USING ATTENTION OVER MULTIPLE SOFTMAX; Joshi, Vikas V*; Agrawal, Purvi; Mehta, Rupesh; Patil, Aditya
Weak-Supervised Dysarthria-invariant Features for Spoken Language Understanding using an FHVAE and Adversarial Training; Qi, Jinzi*; Hugo, Van hamme
Monotonic segmental attention for automatic speech recognition; Zeyer, Albert*; Schmitt, Robin; Zhou, Wei; Schlüter, Ralf; Ney, Hermann
Automatic Rating of Spontaneous Speech for Low-Resource Languages; Getman, Yaroslav*; Al-Ghezi, Ragheb; Voskoboinik, Ekaterina; Singh, Mittul; Kurimo, Mikko
SVLDL: Improved Speaker Age Estimation Using Selective Variance Label Distribution Learning; Kang, Zuheng*; Wang, Jianzong; Peng, Junqing; Xiao, Jing
On the Efficiency of Integrating Self-supervised Learning and Meta-learning for User-defined Few-shot Keyword Spotting; Wu, Yuan-Kuei*; Kao, Wei-Tsung; Lee, Hung-yi; Chen, Chia-Ping; Chen, Zhi-Sheng; Tsai, Yu-Pao
Implicit Acoustic Echo Cancellation for Keyword Spotting and Device-Directed Speech Detection; Cornell, Samuele*; Balestri, Thomas; Senechal, Thibaud
Two-stage training method for Japanese electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion; Ma, Ding*; Violeta, Lester Phillip G; Kobayashi, Kazuhiro; Toda, Tomoki
Combining Contrastive and Non-Contrastive Losses for Fine-Tuning Pretrained Models in Speech Analysis; Lux, Florian*; Chen, Ching-Yi; Thang, Vu
Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech; Lux, Florian*; Koch, Julia; Thang, Vu
Accelerator-Aware Training for Transducer-based Speech Recognition; Swaminathan, Rupak Vignesh*; Mumtaj Shakiah, Suhaila; Nguyen, Hieu D; chinta, Raviteja; Afzal, Tariq; Susanj, Nathan ; Mouchtaris, Athanasios ; Strimel, Grant; Rastrow, Ariya
End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation; Masuyama, Yoshiki*; Chang, Xuankai; Cornell, Samuele; Watanabe, Shinji; Ono, Nobutaka
NON-AUTOREGRESSIVE END-TO-END APPROACHES FOR JOINT AUTOMATIC SPEECH RECOGNITION AND SPOKEN LANGUAGE UNDERSTANDING; LI, Mohan*; Doddipatla, Rama S
Residual Adapters for Targeted Updates in RNN-Transducer Based Speech Recognition System; Han, Sungjun; Baby, Deepak; Mendelev, Valentin*
Remap, warp and attend: Non-parallel many-to-many accent conversion with Normalizing Flows; Ezzerg, Abdelhamid*; Merritt, Thomas; Yanagisawa, Kayoko; Bilinski, Piotr; Proszewska, Magdalena; Pokora, Kamil; Korzeniowski, Renard; Barra-Chicote, Roberto; Korzekwa, Daniel
Domain Adaptation of low-resource Target-Domain models using well-trained ASR Conformer Models; Sukhadia, Vrunda N*; Umesh, S
N-BEST HYPOTHESES RERANKING FOR TEXT-TO-SQL SYSTEMS; Zeng, Lu*; Parthasarathi, Sree Hari Krishnan; Hakkani-Tur, Dilek Z
VSAMETER: EVALUATION OF A NEW OPEN-SOURCE TOOL TO MEASURE VOWEL SPACE AREA AND RELATED METRICS; Cao, Tianyu*; Moro-Velazquez, Laureano; Żelasko, Piotr; Villalba, Jesús; Dehak, Najim
On Compressing Sequences for Self-Supervised Speech Models; Meng, Yen*; Chen, Hsuan-Jui; Shi, Jiatong; Watanabe, Shinji; Garcia, Paola; Lee, Hung-yi; Tang, Hao
G-AUGMENT: SEARCHING FOR THE META-STRUCTURE OF DATA AUGMENTATION POLICIES FOR ASR; Wang, Yuan*; Cubuk, Ekin D; Rosenberg, Andrew; Cheng, Shuyang; Weiss, Ron J; Ramabhadran, Bhuvana; Moreno, Pedro; Le, Quoc; Park, Daniel S
Low-Latency Speech Separation Guided Diarization for Telephone Conversations; Morrone, Giovanni*; Cornell, Samuele; Raj, Desh; Serafini, Luca; Zovato, Enrico; Brutti, Alessio; Squartini, Stefano
JOINT OPTIMIZATION OF DIFFUSION PROBABILISTIC-BASED MULTICHANNEL SPEECH ENHANCEMENT WITH FAR-FIELD SPEAKER VERIFICATION; Dowerah, Sandipana*; serizel, romain; Jouvet, Denis; Mohammadamini, Mohammad; Matrouf, Driss
IMPROVED NORMALIZING FLOW-BASED SPEECH ENHANCEMENT USING AN ALL-POLE GAMMATONE FILTERBANK FOR CONDITIONAL INPUT REPRESENTATION; Strauss, Martin*; Torcoli, Matteo; Edler, Bernd
Adaptive-FSN: Integrating full-band extraction and adaptive sub-band encoding for monaural speech enhancement; TSAO, YU-SHENG*; Hsun, Ho Kuan; Hung, Jeih-weih; Chen, Berlin
Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations; Stafylakis, Themos*; Mošner, Ladislav; Kakouros, Sofoklis; Oldřich, Plchot; Burget, Lukas; Cernocky, Jan Honza
Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR; Chen, Zhehuai*; Bapna, Ankur; Rosenberg, Andrew; Zhang, Yu; Ramabhadran, Bhuvana; Moreno, Pedro; Chen, Nanxin
Damage Control during Domain Adaptation for Transducer Based Automatic Speech Recognition; Majumdar, Somshubra*; Acharya, Shantanu; Lavrukhin, Vitaly; Ginsburg, Boris
Internal Language Model Personalization of E2E Automatic Speech Recognition Using Random Encoder Features; Stooke, Adam *; Sim, Khe C; Chua, Mason; Munkhdalai, Tsendsuren; Strohman, Trevor
On granularity of prosodic representations in expressive text-to-speech; Babiański, Mikołaj*; Pokora, Kamil; Shah, Raahil; Sienkiewicz, Rafał; Korzekwa, Daniel; Klimkov, Viacheslav
A STUDY ON THE INTEGRATION OF PRE-TRAINED SSL, ASR, LM AND SLU MODELS FOR SPOKEN LANGUAGE UNDERSTANDING; Peng, Yifan*; Arora, Siddhant; Higuchi, Yosuke; Ueda, Yushi; Kumar, Sujay; Ganesan, Karthik; Dalmia, Siddharth; Chang, Xuankai; Watanabe, Shinji
Phoneme Segmentation Using Self-Supervised Speech Models; Strgar, Luke*; Harwath, David
UNIFIED END-TO-END SPEECH RECOGNITION AND ENDPOINTING FOR FAST AND EFFICIENT SPEECH SYSTEMS; Bijwadia, Shaan*; Chang, Shuo-yiin; Sainath, Tara; Li, Bo; Zhang, Chao; He, Yanzhang
Textual Data Augmentation for Arabic-English Code-Switching Speech Recognition; Hussain, Amir*; Chowdhury, Shammur; Abdelali, Ahmed; Dehak, Najim; Ali, Ahmed; Khudanpur, Sanjeev
Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy; Meyer, Sarina*; Tilli, Pascal; Denisov, Pavel; Lux, Florian; Koch, Julia; Thang, Vu
Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition; Hamed, Injy*; Hussain, Amir; Chellah, Oumnia; Chowdhury, Shammur; Mubarak, Hamdy; Sitaram, Sunayana; Habash, Nizar; Ali, Ahmed
Four-in-One: A Joint Approach to Inverse Text Normalization, Punctuation, Capitalization, and Disfluency for Automatic Speech Recognition; Tan, Sharman W; Behre, Piyush*; Kibre, Nick; Alphonso, Issac; Chang, Shawn
INVESTIGATING THE IMPORTANT TEMPORAL MODULATIONS FOR DEEP-LEARNING-BASED SPEECH ACTIVITY DETECTION; Vuong, Tyler*; Madaan, Nikhil; Panda, Rohan; Stern, Richard M
UNSUPERVISED DOMAIN ADAPTATION OF NEURAL PLDA USING SEGMENT PAIRS FOR SPEAKER VERIFICATION; Ülgen, İsmail Rasim*; Arslan, Mustafa Levent
Context-aware Neural Confidence Estimation for Rare Word Speech Recognition; Qiu, David*; Munkhdalai, Tsendsuren; He, Yanzhang; Sim, Khe C
NAM+: TOWARDS SCALABLE END-TO-END CONTEXTUAL BIASING FOR ADAPTIVE ASR; Wu, Zelin*; Munkhdalai, Tsendsuren; Pundak, Golan; Sim, Khe C; Li, David; Rondon, Pat; Sainath, Tara
INVESTIGATING ACTIVE-LEARNING-BASED TRAINING DATA SELECTION FOR SPEECH SPOOFING COUNTERMEASURE; Wang, Xin*; Yamagishi, Junichi
Learning a Dual-Mode Speech Recognition Model via Self-Pruning; Liu, Chunxi*; Shangguan, Yuan; Yang, Haichuan; Shi, Yangyang; Krishnamoorthi , Raghuraman ; Kalinli, Ozlem
Learning mask scalars for improved robust automatic speech recognition; Narayanan, Arun*; Walker, James; Panchapagesan, Sankaran; Howard, Nathan; Koizumi, Yuma
Efficient Text Analysis with Pre-trained Neural Network Models; Cui, Jia*; Lu, Heng; Wang, Wenjie; Kang, Shiyin; He, Liqiang; Li, Guangzhi; Yu, Dong
Empirical Analysis of Training Strategies of Transformer-based Japanese Chit-chat Systems; Sugiyama, Hiroaki*; Mizukami, Masahiro; Arimoto, Tsunehiro; Narimatsu, Hiromni; Chiba, Yuya; Nakajima, Hideharu; Meguro, Toyomi
Response Timing Estimation for Spoken Dialog Systems based on Syntactic Completeness Prediction; Sakuma, Jin*; Fujie, Shinya; Kobayashi, Tetsunori
E-Branchformer: Branchformer with Enhanced merging for speech recognition; Kim, Kwangyoun*; Wu, Felix; Peng, Yifan; Pan, Jing; Sridhar, Prashant; Han, Kyu Jeong; Watanabe, Shinji
A comprehensive study on self-supervised distillation for speaker representation learning; Chen, Zhengyang*; Qian, Yao; Han, Bing; Qian, Yanmin; Zeng, Michael
TDOA ESTIMATION OF SPEECH SOURCE IN NOISY REVERBERANT ENVIRONMENTS; Bu, Suliang; Zhao, Tuo*; Zhao, Yunxin
On the Utility of Self-supervised Models for Prosody-related Tasks; Lin, Guan-Ting*; Feng, Chi Luen; Huang, Wei-Ping; Tseng, Yuan; Li, Chen An; Lin, Tzu-Han; Lee, Hung-yi; Ward, Nigel
vTTS: visual-text to speech; Nakano, Yoshifumi; Saeki, Takaaki; Takamichi, Shinnosuke*; Sudoh, Katsuhito; Saruwatari, Hiroshi
SPEECH EMOTION RECOGNITION WITH COMPLEMENTARY ACOUSTIC REPRESENTATIONS; Zhang, Xiaoming*; Zhang, Fan; Cui, Xiaodong; Zhang, Wei
EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers; Maiti, Soumi*; Ueda, Yushi; Watanabe, Shinji; zhang, chunlei ; Yu, Meng; Zhang, Shixiong; Xu, Yong
A ZERO-SHOT APPROACH TO IDENTIFYING CHILDREN’S SPEECH IN AUTOMATIC GENDER CLASSIFICATION; Saraf, Amruta; Sivaraman, Ganesh*; Khoury, Elie
STREAMING, FAST AND ACCURATE ON-DEVICE INVERSE TEXT NORMALIZATION FOR AUTOMATIC SPEECH RECOGNITION; Gaur, Yashesh*; Kibre, Nick; Xue, Jian; Shu, Kangyuan; Wang, Yuhui; Alphonso, Issac; Li, Jinyu; Gong, Yifan
Personalization of CTC Speech Recognition Models; Dingliwal, Saket*; Sunkara, Monica; Bodapati, Sravan Babu; Ronanki, Srikanth; Farris, Jeff; Kirchhoff, Katrin
How Do Phonological Properties Affect Bilingual Automatic Speech Recognition?; Jain, Shelly*; Yadavalli, Aditya; Mirishkar, Sai Ganesh; Vuppala, Anil
Building Markovian Generative Architectures over Pretrained LM Backbones for Efficient Task-Oriented Dialog Systems; Liu, Hong*; Cai, Yucheng; Ou, Zhijian; Huang, Yi; Feng, Junlan
Disentangled Speech Representation Learning for One-Shot Cross-lingual Voice Conversion Using $\beta$-VAE; Lu, Hui*; Wang, Disong; Wu, Xixin; Wu, Zhiyong; Liu, Xunying; Meng, Helen
PROFICIENCY ASSESSMENT OF L2 SPOKEN ENGLISH USING WAV2VEC 2.0; Bannò, Stefano*; Matassoni, Marco
IMPROVING LUXEMBOURGISH SPEECH RECOGNITION WITH CROSS-LINGUAL SPEECH REPRESENTATIONS; Nguyen, Le Minh*; Nayak, Shekhar; Coler, Matt
Macro-block dropout for improved regularization in training end-to-end speech recognition models; Kim, Chanwoo*; Indurti, Sathish; Park, Jinhwan; Sung, Wonyong
Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation; Chevi, Rendi*; Prasojo, Radityo Eko; Aji, Alham Fikri; Tjandra, Andros; Sakti, Sakriani
AUTOMATIC PREDICTION OF INTELLIGIBILITY OF WORDS AND PHONEMES PRODUCED ORALLY BY JAPANESE LEARNERS OF ENGLISH; Minematsu, Nobuaki*; Zhu, Chuanbo; Kunihara, Takuya; Saito, Daisuke; Nakanishi, Noriko
PADA: PRUNING ASSISTED DOMAIN ADAPTATION FOR SELF-SUPERVISED SPEECH REPRESENTATIONS; Lodagala, Vasista Sai*; Ghosh, Sreyan; Umesh, S
CCC-WAV2VEC 2.0: CLUSTERING AIDED CROSS CONTRASTIVE SELF-SUPERVISED LEARNING OF SPEECH REPRESENTATIONS; Lodagala, Vasista Sai*; Ghosh, Sreyan; Umesh, S
Effective Mispronunciation Detection and Diagnosis Leveraging Heterogeneous Information Cues; Yan, Bi-Cheng*; Wang, Hsin-Wei; Chen, Berlin
SUPERB @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning; Feng, Tzu-hsun*; Dong, Annie; Yeh, Ching-Feng; Yang, Shu-wen; Lin, Tzu-Quan; Shi, Jiatong; Chang, Kai-Wei; Huang, Zili; Wu, Haibin; Chang, Xuankai; Watanabe, Shinji; Mohamed, Abdel-rahman; Li, Shang-Wen; Lee, Hung-yi
AVSE CHALLENGE: AUDIO-VISUAL SPEECH ENHANCEMENT CHALLENGE; Aldana, Andrea L*; Valentini, Cassia; Klejch, Ondrej; Gogate, Mandar; Dashtipour, Kia K; Hussain, Amir; Bell, Peter

Demo Papers

Title Authors
ISPEAK: INTERACTIVE SPOKEN LANGUAGE UNDERSTANDING SYSTEM FOR CHILDREN WITH SPEECH AND LANGUAGE DISORDERS; Lin, Baihan; Zhang, Xinxin
ON-DEVICE STREAMING TARGET-SPEAKER ASR WITH NEURAL TRANSDUCER; Moriya, Takafumi; Sato, Hiroshi; Ochiai, Tsubasa; Delcroix, Marc; Asami, Taichi
VOICE–ENABLED AUDIOVISUAL AGENT FOR QUESTION ANSWERING IN ENGLISH AND ARABIC; Saz, Oscar; Abdellah, Ahmed; McArthur, Luca; McKenna, Daniel; Shelley, Simon; Zhang, Xinyue
LUX-ASR: BUILDING AN ASR SYSTEM FOR THE LUXEMBOURGISH LANGUAGE; Gilles, Peter; Hosseini-Kivanani, Nina; Hillah, Leopold