SMS DATASET FOR MULTI-CLASS CLASSIFICATION OF HAM, SPAM, AND SMISHING IN AZERBAIJANI LANGUAGE

Vusal Shahbazov

doi:https://doi.org/10.25045/jpit.v17.i1.04

About the Journal Editorial Board For authors Publication ethics Archive Abstracting & Indexing Contact us

№1, 2026

SMS DATASET FOR MULTI-CLASS CLASSIFICATION OF HAM, SPAM, AND SMISHING IN AZERBAIJANI LANGUAGE

Vusal Shahbazov

vusa.013@gmail.com

The developing landscape of digital communication necessitates specialized datasets for effective research and development, particularly in under-resourced languages. This study introduces Azerbaijani SMS Collection, a novel multi-class dataset specifically designed for the classification of Azerbaijani SMS messages into legitimate, promotional, and malicious phishing categories. To demonstrate the utility of Azerbaijani SMS Collection, an extensive evaluation was conducted using a broad spectrum of machine learning and deep learning models. Traditional machine learning methods, including Logistic Regression, Linear SVM, Passive Aggressive Classifier, Multinomial Naive Bayes, Decision Tree, Random Forest, and K-Nearest Neighbors, were employed in the assessment. The dataset was further evaluated using deep learning architectures, specifically Convolutional Neural Networks and Recurrent Neural Networks. Among all evaluated models, Convolutional Neural Networks achieved the strongest overall performance, with an accuracy of 0.9393 and an F1-score of 0.8909, while the Passive Aggressive classifier was the best-performing traditional algorithm with an accuracy of 0.9338 and an F1-score of 0.8821. This research provides a valuable foundation for future studies in Azerbaijani text classification and contributes significantly to efforts aimed at enhancing SMS communication security and combating digital fraud in the region (pp.32-39).

Keywords: Azerbaijani SMS, SMS dataset, Dataset creation, Multi-class classification, Smishing, Azerbaijani language, Data collection

DOI:

https://doi.org/10.25045/jpit.v17.i1.04

View article(2346)

References

Almeida, T., Hidalgo, J. M. G., & Yamakami, A. (n.d.). UCI SMS Spam Collection. UC Irvine Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. Journal of Machine Learning Research, 7, 551–585.
Haider, S., Luceri, L., Deb, A., Badawy, A., Peng, N., & Ferrara, E. (2023). Detecting social media manipulation in low-resource languages. 1358.https://doi.org/10.1145/3543873.3587615
Joo, J. W., Moon, S. Y., Singh, S., et al. (2017). S-Detector: An enhanced security model for detecting smishing attack for mobile computing. Telecommunication Systems, 66, 29–38.https://doi.org/10.1007/s11235-016-0269-9
Li, Y., Zhang, R., Rong, W., & Mi, X. (2024). SpamDam: Towards privacy-preserving and adversary-resistant SMS spam detection. arXiv. https://doi.org/10.48550/ARXIV.2404.09481
Martínez-Mendoza, A., Fidalgo, E., Alegre, E., & Fernández-Robles, L. (2025). Building a multi-class Short Message Service dataset for smishing detection using agglomerative clustering and dataset fusion. Engineering Applications of Artificial Intelligence, 163, 112864. https://doi.org/10.1016/j.engappai.2025.112864
Nigatu, H. H., Tonja, A. L., Rosman, B., Solorio, T., & Choudhury, M. (2024). The Zeno’s Paradox of ‘Low-Resource’ Languages. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 17753. https://doi.org/10.18653/v1/2024.emnlp-main.983
Pakray, P., Gelbukh, A., & Bandyopadhyay, S. (2025). Natural language processing applications for low-resource languages. Natural Language Processing, 31(2), 183. https://doi.org/10.1017/nlp.2024.33
Ruder, S., Clark, J. H., Gutkin, A., Kale, M., Min, M., Nicosia, M., Rijhwani, S., Riley, P., Sarr, J.-M. A., Wang, X., Wieting, J., Gupta, N., Katanova, A., Kirov, C., Dickinson, D. A., Roark, B., Samanta, B., Tao, C., Adelani, D. I., … Talukdar, P. (2023). XTREME-UP: A user-centric scarce-data benchmark for under-represented languages. 1856. https://doi.org/10.18653/v1/2023.findings-emnlp.125
Saidat, M. R. A., Yerima, S. Y., & Shaalan, K. (2024). Advancements of SMS spam detection: A comprehensive survey of NLP and ML techniques. Procedia Computer Science, 244, 248. https://doi.org/10.1016/j.procs.2024.10.198
Salleh, A., Hassan, S. A., Said, M. Y., Sharif, K. Y., Koh, T. W., & Osman, M. H. (2024). A hybrid model for low-resource language text classification and comparative analysis.https://doi.org/10.2139/ssrn.5077336
Salman, M., Ikram, M., & Kâafar, M. A. (2022). An empirical analysis of SMS scam detection systems. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2210.10451
Samad, S. R. A., Ganesan, P., Rajasekaran, J., Radhakrishnan, M., Ammaippan, H., & Ramamurthy, V. (2023). SmishGuard: Leveraging machine learning and natural language processing for smishing detection. International Journal of Advanced Computer Science and Applications, 14(11). https://doi.org/10.14569/ijacsa.2023.0141160
Shahbazov, V. Azerbaijani SMS Classification Dataset and Source Code. GitHub repository. https://github.com/vusalshahbaz/sms-classification-dataset-azerbaijan
Tang, S., Mi, X., Li, Y., Wang, X., & Chen, K. (2022). Clues in Tweets: Twitter-guided discovery and analysis of SMS spam. arXiv. https://doi.org/10.48550/ARXIV.2204.01233
Taylor, A., & Robert, A. (2025). Using machine learning to detect fraudulent SMSs in Chichewa. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2502.16947
Timko, D., & Rahman, M. L. (2023). Commercial anti-smishing tools and their comparative effectiveness against modern threats. https://doi.org/10.1145/3558482.3590173
Timko, D., & Rahman, M. L. (2024). Smishing Dataset I: Phishing SMS dataset from Smishtank.com. arXiv. https://doi.org/10.48550/ARXIV.2402.18430