№1, 2026

SMS DATASET FOR MULTI-CLASS CLASSIFICATION OF HAM, SPAM, AND SMISHING IN AZERBAIJANI LANGUAGE

Vusal Shahbazov

The developing landscape of digital communication necessitates specialized datasets for effective research and development, particularly in under-resourced languages. This study introduces Azerbaijani SMS Collection, a novel multi-class dataset specifically designed for the classification of Azerbaijani SMS messages into legitimate, promotional, and malicious phishing categories. To demonstrate the utility of Azerbaijani SMS Collection, an extensive evaluation was conducted using a broad spectrum of machine learning and deep learning models. Traditional machine learning methods, including Logistic Regression, Linear SVM, Passive Aggressive Classifier, Multinomial Naive Bayes, Decision Tree, Random Forest, and K-Nearest Neighbors, were employed in the assessment. The dataset was further evaluated using deep learning architectures, specifically Convolutional Neural Networks and Recurrent Neural Networks. Among all evaluated models, Convolutional Neural Networks achieved the strongest overall performance, with an accuracy of 0.9393 and an F1-score of 0.8909, while the Passive Aggressive classifier was the best-performing traditional algorithm with an accuracy of 0.9338 and an F1-score of 0.8821. This research provides a valuable foundation for future studies in Azerbaijani text classification and contributes significantly to efforts aimed at enhancing SMS communication security and combating digital fraud in the region (pp.32-39).

Keywords: Azerbaijani SMS, SMS dataset, Dataset creation, Multi-class classification, Smishing, Azerbaijani language, Data collection
References
  • Almeida, T., Hidalgo, J. M. G., & Yamakami, A. (n.d.). UCI SMS Spam Collection. UC Irvine Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
  • Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. Journal of Machine Learning Research, 7, 551–585.
  • Haider, S., Luceri, L., Deb, A., Badawy, A., Peng, N., & Ferrara, E. (2023). Detecting social media manipulation in low-resource languages. 1358.https://doi.org/10.1145/3543873.3587615
  • Joo, J. W., Moon, S. Y., Singh, S., et al. (2017). S-Detector: An enhanced security model for detecting smishing attack for mobile computing. Telecommunication Systems, 66, 29–38.https://doi.org/10.1007/s11235-016-0269-9
  • Li, Y., Zhang, R., Rong, W., & Mi, X. (2024). SpamDam: Towards privacy-preserving and adversary-resistant SMS spam detection. arXiv. https://doi.org/10.48550/ARXIV.2404.09481
  • Martínez-Mendoza, A., Fidalgo, E., Alegre, E., & Fernández-Robles, L. (2025). Building a multi-class Short Message Service dataset for smishing detection using agglomerative clustering and dataset fusion. Engineering Applications of Artificial Intelligence, 163, 112864. https://doi.org/10.1016/j.engappai.2025.112864
  • Nigatu, H. H., Tonja, A. L., Rosman, B., Solorio, T., & Choudhury, M. (2024). The Zeno’s Paradox of ‘Low-Resource’ Languages. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 17753. https://doi.org/10.18653/v1/2024.emnlp-main.983
  • Pakray, P., Gelbukh, A., & Bandyopadhyay, S. (2025). Natural language processing applications for low-resource languages. Natural Language Processing, 31(2), 183. https://doi.org/10.1017/nlp.2024.33
  • Ruder, S., Clark, J. H., Gutkin, A., Kale, M., Min, M., Nicosia, M., Rijhwani, S., Riley, P., Sarr, J.-M. A., Wang, X., Wieting, J., Gupta, N., Katanova, A., Kirov, C., Dickinson, D. A., Roark, B., Samanta, B., Tao, C., Adelani, D. I., … Talukdar, P. (2023). XTREME-UP: A user-centric scarce-data benchmark for under-represented languages. 1856. https://doi.org/10.18653/v1/2023.findings-emnlp.125
  • Saidat, M. R. A., Yerima, S. Y., & Shaalan, K. (2024). Advancements of SMS spam detection: A comprehensive survey of NLP and ML techniques. Procedia Computer Science, 244, 248. https://doi.org/10.1016/j.procs.2024.10.198
  • Salleh, A., Hassan, S. A., Said, M. Y., Sharif, K. Y., Koh, T. W., & Osman, M. H. (2024). A hybrid model for low-resource language text classification and comparative analysis.https://doi.org/10.2139/ssrn.5077336
  • Salman, M., Ikram, M., & Kâafar, M. A. (2022). An empirical analysis of SMS scam detection systems. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2210.10451
  • Samad, S. R. A., Ganesan, P., Rajasekaran, J., Radhakrishnan, M., Ammaippan, H., & Ramamurthy, V. (2023). SmishGuard: Leveraging machine learning and natural language processing for smishing detection. International Journal of Advanced Computer Science and Applications, 14(11). https://doi.org/10.14569/ijacsa.2023.0141160
  • Shahbazov, V. Azerbaijani SMS Classification Dataset and Source Code. GitHub repository. https://github.com/vusalshahbaz/sms-classification-dataset-azerbaijan
  • Tang, S., Mi, X., Li, Y., Wang, X., & Chen, K. (2022). Clues in Tweets: Twitter-guided discovery and analysis of SMS spam. arXiv. https://doi.org/10.48550/ARXIV.2204.01233
  • Taylor, A., & Robert, A. (2025). Using machine learning to detect fraudulent SMSs in Chichewa. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2502.16947
  • Timko, D., & Rahman, M. L. (2023). Commercial anti-smishing tools and their comparative effectiveness against modern threats. https://doi.org/10.1145/3558482.3590173
  • Timko, D., & Rahman, M. L. (2024). Smishing Dataset I: Phishing SMS dataset from Smishtank.com. arXiv. https://doi.org/10.48550/ARXIV.2402.18430