Figure 1: Acoustic matching in XR.
Figure 2: Conditional wave-u-net structure.
Figure 3: Data generation and training.
In this work, we propose a time-domain end-to-end acoustic space transfer approach that matches the acoustic properties between two arbitrarily chosen reverberant speech signals. Our method CWUNET consists of three main components (see Figure 2):
We train the proposed approach using two different loss functions: multi-resolution stft loss (CWUNET-stft) and multi-resolution logmel loss (CWUNET-mel) and compare it against two baselines: (1) a combination of weighted prediction error (WPE) dereverberation [Yoshioka et al., 2010] and DNN-based blind single-channel RIR estimation [Steinmetz et al., 2021] (WPE+FINS), and (2) a combination of DNN-based dereverberation [Schröter et al., 2022] with the same RIR estimation method (DFNET+FINS). Additionally, we evaluate our models against a semi-oracle case that combines estimated RIRs with anechoic signals (oracle+FINS).
Below, we present a audio samples from the listening test comparing the performance of our models against the baselines. The results show that our CWUNET-stft model outperforms the baselines and achieves the best results among all non-oracle models. The CWUNET-mel model, while effective in replicating reverberation, introduces metallic ringing artifacts that are not captured by objective metrics. The WPE+FINS condition received the lowest scores, consistent with objective evaluations.
Figure 4. Results of the MUSHRA listening test: individual ratings overlaid with error bars. The mean of each condition is plotted as a point. The error bars represent the 95% confidence interval around the mean. Asterisks and n.s. indicate statistical significance (n.s. – not significant, * p<0.05, *** p<0.001). All remaining pairwise comparisons between conditions had the highest significance level (***).