Yuma Koizumi, with the Media Intelligence Laboratories of the Service Innovation Laboratory Group, and Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, and Kunio Kashino, with the Communication Science Laboratories of the Science and Core Technology Laboratory Group, won 1st place in the Audio Captioning task at the DCASE 2020 Challenge held from March to July this year.
The DCASE* Challenge is an annual international competition officially recognized by the IEEE Audio and Acoustic Signal Processing Technical Committee, and this year's event was the sixth. "Automated audio captioning" is a new task DACE introduced this year. The challenge is to automatically generate appropriate and accurate text descriptions or explanations for given audio signals of various non-speech sounds. Ten teams from around the world competed in the task.
NTT is one of the earliest research institutes in the world that to work on the verbalization of sounds. To tackle the task, we took full advantage of the algorithms and knowledge accumulated by the above members, and combined various ideas ranging from pre-processing to post-processing and automated meta-parameter tuning.
Automated audio captioning is an emerging technology field, but a method for achieving it has not yet been established. The capability to describe all kinds of sounds with texts could bring many benefits to our lives in the near future. NTT will therefore continue its research to further strengthen the technology.