Audio Demo - CWUNET for XR

of Speech in Shared XR Environments

Joanna Luberadzka¹, Enric Gusó^1,2, Umut Sayin¹

¹ Eurecat, Centre Tecnològic de Catalunya, Tecnologies Multimèdia, Barcelona

² Universitat Pompeu Fabra, Music Technology Group, Barcelona

ABSTRACT

Mismatch in acoustics between users is a challenge for interaction in shared XR environments. It can be mitigated through acoustic matching, which traditionally involves dereverberation followed by convolution with a room impulse response (RIR) of the target space. However, the target RIR in such settings is usually unavailable. We propose to tackle this problem in an end-to-end manner using a Wave-U-Net encoder-decoder network with potential for real-time operation. We use FiLM layers to condition this network on an embedding extracted by a separate reverb encoder to match the acoustic properties between two arbitrarily chosen signals. We demonstrate that this approach outperforms two baseline methods and provides the flexibility to both dereverberate and rereverberate audio signals.

Figure 1: Acoustic matching in XR.

Figure 2: Conditional wave-u-net structure.

Figure 3: Data generation and training.

In this work, we propose a time-domain end-to-end acoustic space transfer approach that matches the acoustic properties between two arbitrarily chosen reverberant speech signals. Our method CWUNET consists of three main components (see Figure 2):

A time-domain convolutional reverb encoder that extracts information about the acoustic space from the reference speech.
A time-domain convolutional encoder-decoder that modifies the input signal to match the target acoustic properties.
A conditioning mechanism that enables the use of target space information in the transformation process.

We train the proposed approach using two different loss functions: multi-resolution stft loss (CWUNET-stft) and multi-resolution logmel loss (CWUNET-mel) and compare it against two baselines: (1) a combination of weighted prediction error (WPE) dereverberation [Yoshioka et al., 2010] and DNN-based blind single-channel RIR estimation [Steinmetz et al., 2021] (WPE+FINS), and (2) a combination of DNN-based dereverberation [Schröter et al., 2022] with the same RIR estimation method (DFNET+FINS). Additionally, we evaluate our models against a semi-oracle case that combines estimated RIRs with anechoic signals (oracle+FINS).

Below, we present a audio samples from the listening test comparing the performance of our models against the baselines. The results show that our CWUNET-stft model outperforms the baselines and achieves the best results among all non-oracle models. The CWUNET-mel model, while effective in replicating reverberation, introduces metallic ringing artifacts that are not captured by objective metrics. The WPE+FINS condition received the lowest scores, consistent with objective evaluations.

Figure 4. Results of the MUSHRA listening test: individual ratings overlaid with error bars. The mean of each condition is plotted as a point. The error bars represent the 95% confidence interval around the mean. Asterisks and n.s. indicate statistical significance (n.s. – not significant, * p<0.05, *** p<0.001). All remaining pairwise comparisons between conditions had the highest significance level (***).

Conditioned Wave-U-Net for Acoustic Matching

of Speech in Shared XR Environments

ABSTRACT