Create the Video Subtitles Based on Voice Recognition Technology: Test for Some Programs at VTV

Authors

  • Huu Phong Nguyen Information and Television Technology Center, Vietnam
  • Nguyen Quoc Bao Vo Posts and Telecommunications Institude of Technology, Vietnam
  • Minh Trung Tran Vietnam Television, Vietnam

Corressponding author's email:

phongsolo@gmail.com

DOI:

https://doi.org/10.54644/jte.71B.2022.1128

Keywords:

STT, WER, VOD, OTT, CC

Abstract

This paper presents the trial results of Speech-To-Text (STT) recognition tool for VOD (Video On Demand) contents of the VTVgo system at Vietnam Television. In order to evaluate the accuracy of the STT tool, the word error rate (WER) was used to measuring the performance of the automatic speech recognition, the machine translation system. Test results of 10 different types of TV show with 1065 video hours were analyzed. The WER had achieved low level from 2.8% to 4.3% with some genres of news, 19h, weather forecasts, where the majority of speakers, presenters (MC) read standard voices in the Studio. The dialogue from a speaker, less interference from outside noise. Besides, to illustrating the video subtitle application, we had conducted the test on the VTVgo system, integrated the optional subtitle display tool into the VTVgo app. The test Android platform was Smart TV and SmartPhone, to demonstrating the ability to apply video subtitles on the OTT (Over The Top) - the digital content distribution platform.

Downloads: 0

Download data is not yet available.

Author Biographies

Huu Phong Nguyen, Information and Television Technology Center, Vietnam

Phong Nguyen-Huu received the B.E. degree in Telecommunications Engineering from University of Transport and communications–Campus 2 (UTC2), Vietnam in 2006 and Master of Telecom from HCMC Posts and Telecommunications Institute of Technology (PTIT), Vietnam in 2014. From Aug 2016, he has been working toward the PhD. degree in Faculty of Telecommunications, Ho Chi Minh city University of Technology (HCMUT). Currently, he is working for Vietnamese Television (VTV). His research interests include the areas of mobile communication network (Two-way communications, Full-Duplex transmission), energy harvesting, audio/video coding and broadcast technology.

Nguyen Quoc Bao Vo, Posts and Telecommunications Institude of Technology, Vietnam

Vo Nguyen Quoc Bao received the Ph.D. degree in electrical engineering from University of Ulsan, South Korea, in 2010. Dr. Bao is an associate professor of Wireless Communications at Posts and Telecommunications Institute of Technology (PTIT), Vietnam. He is currently serving as Director of the Wireless Communication Laboratory (WCOMM). He is senior member of IEEE. He is the Technical Editor in Chief of REV Journal on Electronics and Communications. He is also serving as an Editor of Transactions on Emerging Telecommunications Technologies (Wiley ETT), and VNU Journal of Computer Science and Communication Engineering. He served as a Technical Program co-chair for ATC (2013, 2014), NAFOSTED-NICS (2014, 2015, 2016), REV-ECIT 2015, ComManTel (2014, 2015), and SigComTel 2017. His research interests include wireless communications and information theory with current emphasis on MIMO systems, cooperative and cognitive communications, physical layer security, and energy harvesting

Minh Trung Tran, Vietnam Television, Vietnam

Tran Minh Trung received his M.Eng. degree in Bachelor of Science at University of Natural Sciences in 1998 in Vietnam. Currently, he is working for vietnamese television station in the south region. He is interested in television technology and its application in life

 

References

G. Galvez, "Closed Captioning and Subtitling for Social Media," in SMPTE 2017 Annual Technical Conference and Exhibition, 2017. DOI: https://doi.org/10.5594/M001804

C. J. Hughes and M. Armstrong, "Automatic retrieval of closed captions for web clips from broadcast TV content," in National Association of Broadcasters Conference, 2015, pp. 318-324.

A. Lambourne, J. Hewitt, C. Lyon, and S. J. I. J. o. S. T. Warren, "Speech-based real-time subtitling services," vol. 7, no. 4, pp. 269-279, 2004. DOI: https://doi.org/10.1023/B:IJST.0000037071.39044.cc

N. Nitta and N. Babaguchi, "Automatic Story Segmentation of Closed-Caption Text for Semantic Content Analysis of Broadcasted Sports Video," in Multimedia information systems, 2002, pp. 110-116.

T. Imai, S. Homma, A. Kobayashi, T. Oku, and S. Sato, "Speech recognition with a seamlessly updated language model for real-time closed-captioning," in Eleventh Annual Conference of the International Speech Communication Association, 2010. DOI: https://doi.org/10.21437/Interspeech.2010-106

M. J. S. M. I. J. Armstrong, "Automatic recovery and verification of subtitles for large collections of video clips," vol. 126, no. 8, pp. 1-7, 2017. DOI: https://doi.org/10.5594/JMI.2017.2732858

P. Bell et al., "The MGB challenge: Evaluating multi-genre broadcast media recognition," in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 687-693: IEEE. DOI: https://doi.org/10.1109/ASRU.2015.7404863

IBM, "AI Closed Captioning Services for Local and State Governments," vol. 2018, pp. 1-7

E. Costa-Montenegro, F. M. García-Doval, J. Juncal-Martínez, and B. J. U. A. i. t. I. S. Barragáns-Martínez, "SubTitleMe, subtitles in cinemas in mobile devices," vol. 15, no. 3, pp. 461-472, 2016. DOI: https://doi.org/10.1007/s10209-015-0420-5

M. Montagud, F. Boronat, J. Pastor, D. J. M. T. Marfil, and Applications, "Web-based platform for a customizable and synchronized presentation of subtitles in single-and multi-screen scenarios," vol. 79, pp. 21889-21923, 2020. DOI: https://doi.org/10.1007/s11042-020-08955-x

K. J. C. Ellis, Politics and Culture, "Netflix closed captions offer an accessible model for the streaming video industry, but what about audio description?," vol. 47, no. 3, pp. 3-20, 2015.

L. N. Y. Tirumala, "Captioning Social Media Video," Public Relations Education vol. 7, no. 1, pp. 169-187, 2021.

E. B. Marrese-Taylor, Jorge A Matsuo, Yutaka, "Mining fine-grained opinions on closed captions of YouTube videos with an attention-RNN," arXiv:02420, 2017. DOI: https://doi.org/10.18653/v1/W17-5213

P. J. L. Romero-Fresco and Communication, "Accessing communication: The quality of live subtitles in the UK," vol. 49, pp. 56-69, 2016. DOI: https://doi.org/10.1016/j.langcom.2016.06.001

J. Jarmulak, "Speech-to-Text Accuracy Benchmark: Word Error Rate for major Speech-to-Text platforms," October 31, 2021.

T. D. Mai Luong, "A Report on the Speech-to-Text Shared Task in VLSP Campaign 2019," presented at the VLSP, 2019.

N. T. M. D. Thanh, Phan Xuan Hay, Nguyen Ngoc Quy, Dao Xuan "Đánh giá các hệ thống nhận dạng giọng nói tiếng việt (vais, viettel, zalo, fpt và google) trong bản tin," Journal of Technical Education Science, no. 63, pp. 28-36, 2021. DOI: https://doi.org/10.54644/jte.63.2021.46

D. C. Tran, D. L. Nguyen, H. S. Ha, and M. F. Hassan, "Speech Recognizing Comparisons Between Web Speech API and FPT. AI API," in Proceedings of the 12th National Technical Seminar on Unmanned System Technology 2020, 2022, pp. 853-865: Springer. DOI: https://doi.org/10.1007/978-981-16-2406-3_64

D. C. Tran, D. L. Nguyen, M. F. J. B. o. E. E. Hassan, and Informatics, "Development and testing of an FPT. AI-based voicebot," vol. 9, no. 6, pp. 2388-2395, 2020. DOI: https://doi.org/10.11591/eei.v9i6.2620

Q. B. Nguyen, B. Q. Dam, and M. H. Le, "Development of a Vietnamese speech recognition system for Viettel call center," in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), 2017, pp. 1-5: IEEE. DOI: https://doi.org/10.1109/ICSDA.2017.8384456

Q. T. Do, "VAIS-Speech: An Overview of Automatic Speech Recognition and Text-to-speech Development at VAIS," in VLSP 2018, Ha Noi, Vietnam, 2018.

G. Saon, B. Ramabhadran, and G. Zweig, "On the effect of word error rate on automated quality monitoring," in 2006 IEEE Spoken Language Technology Workshop, 2006, pp. 106-109: IEEE. DOI: https://doi.org/10.1109/SLT.2006.326828

A. Ali and S. Renals, "Word error rate estimation for speech recognition: e-WER," in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2018, pp. 20-24. DOI: https://doi.org/10.18653/v1/P18-2004

Github. (2021). Available: https://github.com/belambert/asr-evaluation

Published

30-08-2022

How to Cite

[1]
H. P. Nguyen, N. Q. B. Vo, and M. T. Tran, “Create the Video Subtitles Based on Voice Recognition Technology: Test for Some Programs at VTV”, JTE, vol. 17, no. Special Issue 02, pp. 38–48, Aug. 2022.

Similar Articles

You may also start an advanced similarity search for this article.