(NTT Press Release)
December 14, 2015
Nippon Telegraph and Telephone Corporation (NTT; head office: Chiyoda-ku, Tokyo, Japan; President and CEO: Hiroo Unoura) has achieved the highest recognition accuracy at CHiME-3, which is an international speech recognition challenge1. The challenge featured speech recognition in public noisy environments, including cafés, street intersections, public transport (buses) and pedestrian areas, recorded using a 6-channel tablet-based microphone array. The top score was achieved by distortionless speech enhancement2 and deep-learning speech recognition techniques.
NTT will present the details of its achievement at the 2015 IEEE Automatic Speech Recognition and Understanding Workshop3 (ASRU 2015) on December 13-17, 2015 in Scottsdale, Arizona, USA.
In recent years, rapid advances in speech recognition techniques have been fueled by the progress of deep learning and widely used for voice-operable devices, including smartphones. Current speech recognition techniques are mainly used in relatively quiet environments. If we can use them even in public noisy environments, the usability of voice-operable devices will be largely extended. For this purpose, speech recognition techniques must be advanced.
To accelerate such advancement, the 3rd CHiME Speech Separation and Recognition Challenge (CHiME-3) has been organized this year. CHiME-3 addressed speech recognition in public noisy environments, including cafés, street junctions, public transport (buses) and pedestrian areas, recorded using a 6-channel tablet-based microphone array. This task was so challenging that the speech recognition accuracy with the current deep-learning speech recognition technique was only 66.6%. CHiME-3 gathered a great deal of attention; 25 worldwide research institutes participated.
Among the 25 submitted systems to CHiME-3, NTT’s developed speech recognition system (Fig. 1) achieved the highest recognition accuracy: 94.2% (Fig. 2). NTT, which has been aware of the importance of noisy speech recognition for more useful voice services for many years, has established many advanced techniques for it. In addition to them, NTT newly developed distortionless speech enhancement and deep-learning speech recognition techniques and achieved the best performance system in CHiME-3.
With just this speech recognition unit, NTT achieved speech recognition accuracy of 84.4% (Fig. 2).
The speech enhancement unit suppresses the noise and reverberant components, which are the main causes of the recognition performance degradation in noisy environments. A deep-learning speech recognition system is very sensitive to speech distortion, which is induced by speech enhancement pre-processing. To handle this issue, NTT has also successfully developed a distortionless speech enhancement technique, which can, in principle, suppress the noise and reverberant components without distorting the speech components in a recording (Fig. 5). By combining distortionless speech enhancement and the above speech recognition units, NTT improved its system’s speech recognition accuracy to 94.2% for the CHiME-3 task.
NTT will keep brushing up the newly developed technologies, aiming of introducing them into our speech recognition services around 2018. Future plans include performance assessment with fewer microphones and real-time implementation of the above techniques.
Nippon Telegraph and Telephone Corporation
Science and Core Technology Laboratory Group, Public Relations
NTT Has Instituted a Logo to Represent R&D Activities.
Information is current as of the date of issue of the individual press release.
Please be advised that information may be outdated after that point.