SARA: SYNTHETIC AZERBAIJANI RECORDING ALIGNMENT – A UNIFIED TTS FRAMEWORK TRAINED ON SYNTHETIC AND REAL SPEECH

Mirakram Aghalarov, Kavsar Huseynova, Elvin Mammadov, Gurban Guliyev

doi:http://doi.org/10.25045/jpit.v16.i2.07

About the Journal Editorial Board For authors Publication ethics Archive Abstracting & Indexing Contact us

№2, 2025

SARA: SYNTHETIC AZERBAIJANI RECORDING ALIGNMENT – A UNIFIED TTS FRAMEWORK TRAINED ON SYNTHETIC AND REAL SPEECH

Mirakram Aghalarov, Kavsar Huseynova, Elvin Mammadov, Gurban Guliyev

mirakram.agalarov@bhos.edu.az; kavsar.huseynova.std@bhos.edu.az; elvin.mammadov.std@bhos.edu.az; gurban.guliyev.std@bhos.edu.az

Speech synthesis is crucial part of the modern assistive AI (Artificial Intelligence). While the development in this field has great burst, solutions for low-resource languages stays in background due to lack of the data. Our paper demonstrates the novel solution to overcome this problem for Azerbaijani Language and proposes a model which can generate fluent voice in Azerbaijani Language. Our solution is based on self-supervised learning with multi-lingual Text-to-Speech (TTS) model to pretrain the VITS (Variational Inference for Text to Speech) architecture from scratch. Additional Fine-tuning dataset has been obtained by segmenting voiced news articles along with text of the news by using whisper. Moreover, obtained datasets for fine-tuning and pre-training has been made available for further development and use-cases as an open-source. Model has been evaluated according to the LLM (Large Language Models) and Human Opinion by using Mean Opinion Scores (MOS) which proved efficiency of the approach and usability of the proposed model for further applications. Model is available at: https://huggingface.co/BHOSAI/SARA_TTS (pp.75-80).

Keywords: Speech Synthetis, Azerbaijani TTS, Deep learning, Voice Processing, Artificial Intelligence.

DOI:

http://doi.org/10.25045/jpit.v16.i2.07

View article(700)

References

Khanam, F., Munmun, F. A., Ritu, N. A., Saha, A. K., & Firoz, M. (2022). Text to speech synthesis: A systematic review, deep learning based architecture and future research direction. Journal of Advances in Information Technology, 13(5), 398–412. https://doi.org/10.12720/jait.13.5.398-412
Kim, J., Kong, J., & Son, J. (2021). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proceedings of the International Conference on Machine Learning (ICML) (pp. 5530-5540). https://doi.org/10.48550/arXiv.2106.06103
Kim, J., Kim, S., Kong, J., & Yoon, S. (2020). Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. Proceedings of the 34th İnternational Conference on Neural Information Processing Systems (pp. 8067–8077). https://doi.org/10.48550/arXiv.2005.11129
Kim, M., Jeong, M., Choi, B. J., Ahn, S., Lee, J. Y., & Kim, N. S. (2022). Transfer learning framework for low-resource text-to-speech using a large-scale unlabeled speech corpus. Proceedings of Interspeech 2022 (pp. 788–792). https://doi.org/10.21437/Interspeech.2022-225
Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel-Zarandi, M., Baevski, A., Adi, Y., Zhang, X., Hsu, W.-N., Conneau, A., & Auli, M. (2024). Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25, 1-52. https://doi.org/10.48550/arXiv.2305.13516
Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019). FastSpeech: Fast, robust and controllable text to speech. Proceedings of the 33rd International Conference on Neural Information Processing Systems (pp. 3171-3180). https://doi.org/10.48550/arXiv.1905.09263
Valizada, A., Jafarova, S., Sultanov, E., & Rustamov, S. (2021). Development and Evaluation of Speech Synthesis System based on Deep Learning Models. Symmetry 2021, 13(5), 819. https://doi.org/10.3390/sym13050819
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. A. (2017) Tacotron: Towards end-to-end speech synthesis. Proceedings of Interspeech 2017 (pp. 4006-4010). https://doi.org/10.21437/Interspeech.2017-1452
Zeinalov, T., Sen, B., & Aslanova, F. (2024). Text-to-Speech in Azerbaijani language via transfer learning in a low resource environment. Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP) (pp. 434-438). https://aclanthology.org/2024.icnlsp-1.44/