№2, 2025

SARA: SYNTHETIC AZERBAIJANI RECORDING ALIGNMENT – A UNIFIED TTS FRAMEWORK TRAINED ON SYNTHETIC AND REAL SPEECH

Mirakram Aghalarov, Kavsar Huseynova, Elvin Mammadov, Gurban Guliyev

Speech synthesis is crucial part of the modern assistive AI (Artificial Intelligence). While the development in this field has great burst, solutions for low-resource languages stays in background due to lack of the data. Our paper demonstrates the novel solution to overcome this problem for Azerbaijani Language and proposes a model which can generate fluent voice in Azerbaijani Language. Our solution is based on self-supervised learning with multi-lingual Text-to-Speech (TTS) model to pretrain the VITS (Variational Inference for Text to Speech) architecture from scratch. Additional Fine-tuning dataset has been obtained by segmenting voiced news articles along with text of the news by using whisper. Moreover, obtained datasets for fine-tuning and pre-training has been made available for further development and use-cases as an open-source. Model has been evaluated according to the LLM (Large Language Models) and Human Opinion by using Mean Opinion Scores (MOS) which proved efficiency of the approach and usability of the proposed model for further applications. Model is available at: https://huggingface.co/BHOSAI/SARA_TTS (pp.75-80).

Keywords: Speech Synthetis, Azerbaijani TTS, Deep learning, Voice Processing, Artificial Intelligence.
References
  • Khanam, F., Munmun, F. A., Ritu, N. A., Saha, A. K., & Firoz, M. (2022). Text to speech synthesis: A systematic review, deep learning based architecture and future research direction. Journal of Advances in Information Technology, 13(5), 398–412. https://doi.org/10.12720/jait.13.5.398-412
  • Kim, J., Kong, J., & Son, J. (2021). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proceedings of the International Conference on Machine Learning (ICML) (pp. 5530-5540). https://doi.org/10.48550/arXiv.2106.06103
  • Kim, J., Kim, S., Kong, J., & Yoon, S. (2020). Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. Proceedings of the 34th İnternational Conference on Neural Information Processing Systems (pp. 8067–8077). https://doi.org/10.48550/arXiv.2005.11129
  • Kim, M., Jeong, M., Choi, B. J., Ahn, S., Lee, J. Y., & Kim, N. S. (2022). Transfer learning framework for low-resource text-to-speech using a large-scale unlabeled speech corpus. Proceedings of Interspeech 2022 (pp. 788–792). https://doi.org/10.21437/Interspeech.2022-225
  • Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel-Zarandi, M., Baevski, A., Adi, Y., Zhang, X., Hsu, W.-N., Conneau, A., & Auli, M. (2024). Scaling speech  technology to 1,000+ languages. Journal of Machine Learning Research, 25, 1-52. https://doi.org/10.48550/arXiv.2305.13516
  • Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019). FastSpeech: Fast, robust and controllable text to speech. Proceedings of the 33rd International Conference on Neural Information Processing Systems (pp. 3171-3180). https://doi.org/10.48550/arXiv.1905.09263
  • Valizada, A., Jafarova, S., Sultanov, E., & Rustamov, S. (2021). Development and Evaluation of Speech Synthesis System based on Deep Learning Models. Symmetry 2021, 13(5), 819. https://doi.org/10.3390/sym13050819
  • Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017) Tacotron: Towards end-to-end speech synthesis. Proceedings of Interspeech 2017 (pp. 4006-4010). https://doi.org/10.21437/Interspeech.2017-1452
  • Zeinalov, T., Sen, B., & Aslanova, F. (2024). Text-to-Speech in Azerbaijani language via transfer learning in a low resource environment. Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP) (pp. 434-438). https://aclanthology.org/2024.icnlsp-1.44/