Authors:
(1) Juan F. Montesinos, Department of Information and Communications Technologies Universitat Pompeu Fabra, Barcelona, Spain {juanfelipe.montesinos@upf.edu};
(2) Olga Slizovskaia, Department of Information and Communications Technologies Universitat Pompeu Fabra, Barcelona, Spain {olga.slizovskaia@upf.edu};
(3) Gloria Haro, Department of Information and Communications Technologies Universitat Pompeu Fabra, Barcelona, Spain {gloria.haro@upf.edu}.
Table of Links
V. CONCLUSIONS
We have presented Solos, a new audio-visual dataset of music recordings of soloists, suitable for different self-supervised learning tasks such as source separation using the mix-and-separate strategy, sound localization, cross-modal generation and finding audio-visual correspondences. There are 13 different instruments in the dataset; those are common instruments in chamber orchestras and the ones included in the University of Rochester Multi-Modal Music Performance (URMP) dataset [1]. The characteristics of URMP – small dataset of real performances with ground truth individual stems – make it a suitable dataset for testing purposes but to the best of our knowledge, to date there is no existing large-scale dataset with the same instruments as in URMP. Two different networks for audio-visual source separation based on the U-Net architecture have been trained in the new dataset and further evaluated in URMP, showing the impact of training on the same set of instruments as the test set. Moreover, Solos provides skeletons and timestamps to video intervals where hands are sufficiently visible. This information could be useful for training purposes and also for learning to solve the task of sound localization.
REFERENCES
[1] B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sharma, “Creating a multitrack classical music performance dataset for multimodal music analysis: Challenges, insights, and applications,” IEEE Transactions on Multimedia, vol. 21, no. 2, pp. 522–535, Feb 2019.
[2] B. Li, K. Dinesh, Z. Duan, and G. Sharma, “See and listen: Scoreinformed association of sound tracks to players in chamber music performance videos,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 2906–2910.
[3] E. C. Cherry, “Some experiments on the recognition of speech, with one and with two ears,” The Journal of the acoustical society of America, vol. 25, no. 5, pp. 975–979, 1953.
[4] A. Hyvarinen and E. Oja, “Independent component analysis: algorithms ¨ and applications,” Neural networks, vol. 13, no. 4-5, pp. 411–430, 2000.
[5] M. Zibulevsky and B. A. Pearlmutter, “Blind source separation by sparse decomposition in a signal dictionary,” Neural computation, vol. 13, no. 4, pp. 863–882, 2001.
[6] T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE transactions on audio, speech, and language processing, vol. 15, no. 3, pp. 1066–1074, 2007.
[7] D. P. W. Ellis, “Prediction-driven computational auditory scene analysis,” Ph.D. dissertation, Massachusetts Institute of Technology, 1996.
[8] P. Smaragdis, B. Raj, and M. Shashanka, “A probabilistic latent variable model for acoustic modeling,” Advances in models for acoustic processing, NIPS, vol. 148, pp. 8–1, 2006.
[9] P. Chandna, M. Miron, J. Janer, and E. Gomez, “Monoaural audio source ´ separation using deep convolutional neural networks,” in International Conference on Latent Variable Analysis and Signal Separation, 2017, pp. 258–266.
[10] D. Stoller, S. Ewert, and S. Dixon, “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” arXiv preprint arXiv:1806.03185, 2018.
[11] J. R. Hershey and J. R. Movellan, “Audio vision: Using audio-visual synchrony to locate sounds,” in Advances in neural information processing systems, 2000, pp. 813–819.
[12] E. Kidron, Y. Y. Schechner, and M. Elad, “Pixels that sound,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, 2005, pp. 88–95.
[13] T. Darrell, J. W. Fisher, and P. Viola, “Audio-visual segmentation and the cocktail party effect,” in Advances in Multimodal InterfacesICMI 2000, 2000, pp. 32–40.
[14] D. Sodoyer, J.-L. Schwartz, L. Girin, J. Klinkisch, and C. Jutten, “Separation of audio-visual speech sources: a new approach exploiting the audio-visual coherence of speech stimuli,” EURASIP Journal on Advances in Signal Processing, vol. 2002, no. 11, p. 382823, 2002.
[15] B. Rivet, L. Girin, and C. Jutten, “Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 1, pp. 96–108, 2007.
[16] B. Li, C. Xu, and Z. Duan, “Audiovisual source association for string ensembles through multi-modal vibrato analysis,” Proc. Sound and Music Computing (SMC), 2017.
[17] S. Parekh, S. Essid, A. Ozerov, N. Q. Duong, P. Perez, and G. Richard, ´ “Guiding audio source separation by video object information,” in Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017 IEEE Workshop on, 2017, pp. 61–65.
[18] R. Gao and K. Grauman, “Co-separating sounds of visual objects,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 3879–3888.
[19] H. Zhao, C. Gan, W.-C. Ma, and A. Torralba, “The sound of motions,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 1735–1744.
[20] X. Xu, B. Dai, and D. Lin, “Recursive visual sound separation using minus-plus net,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 882–891.
[21] B. Li, K. Dinesh, C. Xu, G. Sharma, and Z. Duan, “Online audio-visual source association for chamber music performances,” Transactions of the International Society for Music Information Retrieval, vol. 2, no. 1, 2019.
[22] R. Arandjelovic and A. Zisserman, “Objects that sound,” in ´ Proceedings of the IEEE European Conference on Computer Vision, 2018.
[23] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba, “The sound of pixels,” in The European Conference on Computer Vision (ECCV), September 2018.
[24] A. Owens and A. A. Efros, “Audio-visual scene analysis with selfsupervised multisensory features,” arXiv preprint arXiv:1804.03641, 2018.
[25] B. Korbar, D. Tran, and L. Torresani, “Cooperative learning of audio and video models from self-supervised synchronization,” in Advances in Neural Information Processing Systems, 2018, pp. 7763–7774.
[26] T.-H. Oh, T. Dekel, C. Kim, I. Mosseri, W. T. Freeman, M. Rubinstein, and W. Matusik, “Speech2face: Learning the face behind a voice,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7539–7548.
[27] L. Chen, S. Srivastava, Z. Duan, and C. Xu, “Deep cross-modal audiovisual generation,” in Proceedings of the on Thematic Workshops of ACM Multimedia 2017, 2017, pp. 349–357.
[28] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg, “Visual to sound: Generating natural sound for videos in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3550–3558.
[29] E. Shlizerman, L. M. Dery, H. Schoen, and I. Kemelmacher-Shlizerman, “Audio to body dynamics,” CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2017.
[30] S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik, “Learning individual styles of conversational gesture,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3497–3506.
[31] H. Zhou, Z. Liu, X. Xu, P. Luo, and X. Wang, “Vision-infused deep audio inpainting,” in The IEEE International Conference on Computer Vision (ICCV), October 2019.
[32] C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and A. Torralba, “Music gesture for visual sound separation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 478–10 487.
[33] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
[34] C. S. J. Doire and O. Okubadejo, “Interleaved multitask learning for audio source separation with independent databases,” ArXiv, vol. abs/1908.05182, 2019.
[35] F. Yu, V. Koltun, and T. Funkhouser, “Dilated residual networks,” in Computer Vision and Pattern Recognition (CVPR), 2017.
[36] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde, “Singing voice separation with deep U-Net convolutional networks,” in 18th International Society for Music Information Retrieval Conference, 2017, pp. 23–27.
[37] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
[38] G. Liu, J. Si, Y. Hu, and S. Li, “Photographic image synthesis with improved u-net,” in 2018 Tenth International Conference on Advanced Computational Intelligence (ICACI), March 2018, pp. 402–407.
[39] X. Mao, C. Shen, and Y.-B. Yang, “Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections,” in Advances in neural information processing systems, 2016, pp. 2802–2810.
[40] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” arxiv, 2016.
[41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
[42] “Chapter 7 - frequency domain processing,” in Digital Signal Processing System Design (Second Edition), second edition ed., N. Kehtarnavaz, Ed. Burlington: Academic Press, 2008, pp. 175 – 196.
[43] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.