引用格式: Ge X, Han J, Long Y, et al. PercepNet+: A Phase and SNR Aware PercepNet for Real-Time Speech Enhancement[J]. arXiv preprint arXiv:2203.02263, 2022.
表2显示了PercepNet+中所有提出的技术在VCTK和我们模拟的DNOISE测试集上的性能比较。与PercepNet相比,在VCTK测试集上,提出的PercepNet+显著提高了PESQ从2.46提高到2.65,STOI从93.43%提高到95.68%。具体而言,附加的复杂特征和增益导致PESQ和STOI的绝对增长分别为0.08和1.11%。在信噪比估计器的帮助下,我们获得了0.04的PESQ和0.27%的STOI改进。当采用信噪比切换后处理(PP)和过衰减损失时,PESQ和STOI分别达到2.62和95.49%。最后,我们看到更新后的TF-GRU结果进一步提高了性能。此外,在D-NOISE测试集上,我们从所有提出的PercepNet+技术中获得了一致的性能收益,总体PESQ为0.15,STOI为2.93%。此外,提出的PercepNet+有8.5M可训练参数,与PercepNet相比增加了0.5M,实时因子(RTF)等于0.351,这是在一台Intel(R) Xeon(R) CPU E5-2650 v2@2.60GHz单线程机器上测试的。因此,我们可以得出结论,PercepNet+在没有显著增加神经网络参数的情况下,已经大大超过了PercepNet。
[1] Y. Xu, J. Du, L. Dai, and C. Lee, A regression approach to speech enhancement based on deep neural networks, in IEEE/ACM Transactions on Acoustics, Speech, and Signal Processing, vol. 23, no. 1, 2015, pp. 7 19.
[2] Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, in IEEE/ACM Transactions on Acoustics, Speech, and Signal Processing, vol. 22, no. 12, 2014, pp. 1849 1858.
[3] Y. Ephraim and D. Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator, in IEEE/ACM Transactions on Acoustics, Speech, and Signal Processing, vol. 33, no. 2, 1985, pp. 443 445.
[4] S. F. Boll, Suppression of acoustic noise in speech using spectral subtraction, in IEEE/ACM Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no. 2, 1979, pp. 113 120.
[5] O. Ronneberger, P. Fischer, and T. Brox, U-net: Convolutional networks for biomedical image segmentation, in International Conference on Medical image computing and computer-assisted intervention, 2015, pp. 234 241.
[6] X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, and P.-A. Heng, Hdenseunet: Hybrid densely connected unet for liver and tumor segmentation from ct volumes, in IEEE Transactions on Medical Imaging, vol. 37, no. 12, 2018, pp. 2663 2674.
[7] Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement, in Proceedings of INTERSPEECH, 2020, pp. 2472 2476.
[8] S. Lv, Y. Hu, S. Zhang, and L. Xie, DCCRN+: Channel-Wise Subband DCCRN with SNR Estimation for Speech Enhancement, in Proceedings of INTERSPEECH, 2021, pp. 2816 2820.
[9] X. Le, H. Chen, K. Chen, and J. Lu, DPCRN: Dual-Path Convolution Recurrent Network for Single Channel Speech Enhancement, in Proceedings of INTERSPEECH, 2021, pp. 2811 2815.
[10] J.-M. Valin, A Hybrid DSP/Deep Learning Approach to RealTime Full-Band Speech Enhancement, in Proceedings of IEEE Multimedia Signal Processing (MMSP), 2018, pp. 1 5.
[11] J.-M. Valin, U. Isik, N. Phansalkar, R. Giri, K. Helwani, and A. Krishnaswamy, A Perceptually-Motivated Approach for LowComplexity, Real-Time Enhancement of Fullband Speech, in Proceedings of INTERSPEECH, 2020, pp. 2482 2486.
[12] R. Giri, S. Venkataramani, J.-M. Valin, U. Isik, and A. Krishnaswamy, Personalized PercepNet: Real-Time, LowComplexity Target Voice Separation and Enhancement, in Proceedings of INTERSPEECH, 2021, pp. 1124 1128.
[13] B. Moore, An introduction to the psychology of hearing, Brill, 2021.
[14] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, On the properties of neural machine translation: Encoder-decoder approaches, in Proceedings of Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8), 2014, pp. 103 111.
[15] I. Rec, P.862.2: Wideband extension to recommendation p.862 for the assessment of wideband telephone networks and speech codecs, International Telecommunication Union,CH Geneva, 2005.
[16] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, An algorithm for intelligibility prediction of time frequency weighted noisy speech, in IEEE/ACM Transactions on Acoustics, Speech, and Signal Processing, vol. 19, no. 7, 2011, pp. 2125 2136.
[17] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, Investigating rnn-based speech enhancement methods for noiserobust text-to-speech, in Proceedings of ISCA Speech Synthesis Workshop (SSW), 2016, pp. 146 152.
[19] J. H. Chen, Gersho, and A., Adaptive postfiltering for quality enhancement of coded speech, in IEEE/ACM Transactions on Acoustics, Speech, and Signal Processing, vol. 3, no. 1, 1995, pp. 59 71.
[20] D. Talkin., A robust algorithm for pitch tracking (RAPT), in Speech Coding and Synthesis, 1995, pp. 495 518.
[21] K. Vos, K. V. Sorensen, S. S. Jensen, and J.-M. Valin., Voice coding with opus, in Proceedings of AES Convention, 2013.
[22] K. K. Paliwal, K. K. W ojcicki, and B. J. Shannon, The importance of phase in speech enhancement, Speech Communication, vol. 53, no. 4, pp. 465 494, 2011.
[23] C. Zheng, X. Peng, Y. Zhang, S. Srinivasan, and Y. Lu, Interactive speech and noise modeling for speech enhancement, in AAAI, 2021, pp. 14 549 14 557.
[24] A. Nicolson and K. K. Paliwal, Masked multi-head self-attention for causal speech enhancement, Speech Communication, vol. 125, no. 3, pp. 80 96, 2020.
[25] A. Li, W. Liu, X. Luo, C. Zheng, and X. Li, ICASSP 2021 deep noise suppression challenge: Decoupling magnitude and phase optimization with a two-stage deep network, in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6628 6632.
[26] Y.-H. Tu, J. Du, L. Sun, and C.-H. Lee, Lstm-based iterative mask estimation and post-processing for multi-channel speech enhancement, in Proceedings of Asia-Pacific Signal and Information Processing Association (APSIPA), 2017, pp. 488 491.
[27] A. Li, W. Liu, X. Luo, G. Yu, C. Zheng, and X. Li, A Simultaneous Denoising and Dereverberation Framework with Target Decoupling, in Proceedings of INTERSPEECH, 2021, pp. 2801 2805.
[28] S. E. Eskimez, T. Yoshioka, H. Wang, X. Wang, Z. Chen, and X. Huang, Personalized speech enhancement: New models and comprehensive evaluation, arXiv preprint arXiv:2110.09625, 2021.
[29] Q. Wang, I. L. Moreno, M. Saglam, K. Wilson, A. Chiao, R. Liu, Y. He, W. Li, J. Pelecanos, M. Nika, and A. Gruenstein, VoiceFilter-Lite: Streaming Targeted Voice Separation for OnDevice Speech Recognition, in Proceedings of INTERSPEECH, 2020, pp. 2677 2681.
[30] J. Li, A. Mohamed, G. Zweig, and Y. Gong, Lstm time and frequency recurrence for automatic speech recognition, in Proceedings of Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 187 191. [31] http://www-mmsp.ece.mcgill.ca/Documents/Data/.
[32] https://www.ntt-at.com/product/artificial/.
[33] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, A study on data augmentation of reverberant speech for robust speech recognition, in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220 5224.
[34] D. B. Paul and J. M. Baker, The design for the wall street journalbased csr corpus, in Proceedings of Second International Conference on Spoken Language Processing (ICSLP), 1992, pp. 357 362. [35] https://jmvalin.ca/demo/rnnoise/.
[36] X. O. Foundation, Vorbis I specification, 2004. [37] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, in Proceedings of International Conference on Learning Representations (ICLR), 2015.