№2, 2021


Fargana J. Abdullayeva, Sabira S. Ojagverdiyeva

The article develops an approach to the identification of vulgarism in web content based on machine learning. The increasing number of harmful contents in web-pages makes protection from them even more vital. Encountering vulgarisms (indecent words, jargon, slams, etc) on the internet among users, especially children, and teenagers, shows a negative effect on their psychology. To identify vulgar words, word conjunctions, and expressions, in both social media (Twitter, Facebook, etc.) and online media, it is important to develop new auto-text identification methods, which is vital to solve that matter. The presented paper proposes an approach for the detection of vulgarisms using the N-grams+TF-IDF features. Numerical vectors are generated by applying the n-gram+TF-IDF-based feature extraction method to the predefined vulgar words. Generated numerical vector is passed to the input of the Naive Bayes algorithm. As a result of experiments conducted on different features, the classification based on unigram+TF-IDF features performs better results. The proposed approach, which contains the identification of vulgarism, is important for developing conversation culture and communication skills of children and teenagers. This approach is very important to protect kids from harmful content online and can be used in child safety centers and education systems (pp.89-98).

Keywords: vulgarisms, N-grams, TF-IDF, Naive Bayes, Child safety on the Internet.
  • Alguliyev R.M., Ojagverdieva S.S. Conceptual Model of National Intellectucal System for Children Safety in Internet Environmen // International Journal of Computer Network and Information Security, 2019, vol.11, 3, pp. 40–47.
  • Farajov R.A. Izahlı dilchilik lughati. Bakı, “Maarif”, 1969, 143 s.
  • Aliguliyev R.M., Jafarov Y. Global vulgarizm bazası / Informasiya tahlukasizliyi problemlari uzra I respublika elmi-praktiki konfransı, 2013, s. 60–62.
  • European Commission and IT Companies announce Code of Conduct on illegal online hate speech. https://ec.europa.eu/commission/presscorner/detail/en/IP_16_1937
  • European Commission, “Protection of personal data”, November 24,  http://ec.europa.eu/justice/data-protection/,.
  • “Ushagların zararli informasiyadan gorunması haggında” Azarbayjan Respublikasının Ganunu, http://e-qanun.gov.az/framework/40764
  • Kotenko I., Saenko I., Chechulin A., Desnitsky V., Vitkova L., Pronoza A. Monitoring and Counteraction to Malicious Influences in the Information Space of Social Networks / International Conference on Social Informatics, 2018, pp. 159–167.
  • Watanabe H., Bouazizi M., Ohtsuki T. Hate Speech on Twitter: A Pragmatic Approach to Collect Hateful and Offensive Expressions and Perform Hate Speech Detection / IEEE Access, 2018, vol.6, pp. 13825–13835.
  • Greevy E., Smeaton A.F. Classifying racist texts using a support vector machine / İn Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, 2004, pp. 468–469
  • Gitari N.D., Zhang Z., Hanyurwimfura D., Jun L. A lexicon-based approach for hate speech detection // International Journal of Multimedia and Ubiquitous Engineering, 2015, vol.10, no.4, pp. 215–230.
  • Tulkens S., Hilte L., Lodewyckx E., Verhoeven B., Daelemans W. A dictionary-based approach to racism detection in dutch social media, 2016, pp. 2–8.
  • Burnap P., Matthew L.W. Cyber hate speech on twitter: An application of machine classification and statistical modeling for policy and decision making // Policy & Internet, 2015, vol.7, no.2, pp.223–242.
  • Hatebase, Available from: https://hatebase.org/   
  • Wu L., Morstatter F., Liu H. SlangSD: building, expanding and using a sentiment dictionary of slang words for short-text sentiment classification // Lang Resources & Evaluation, 2018, vol. 52, pp. 839–852 . https://doi.org/10.1007/s10579-018-9416-0
  • Liu Sh., Forss T. New Classification Models for Detecting Hate and Violence Web Content / KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval, 2015, pp. 487–895.
  • Sharif O., Hoque M.M., Kayes A.S.M., Nowrozy R., Sarker I.H. Detecting Suspicious Texts Using Machine Learning Techniques // Applied Science, 2020, vol. 10, no.18: 6527. https://doi.org/10.3390/app10186527
  • Ibrohim M. O., Budi I. A dataset and preliminaries study for abusive language detection in Indonesian social media// Procedia Computer Science, 2018, vol.135, pp. 222–229.
  • Aly E.S., van der Haar D.T. Slang-Based Text Sentiment Analysis in Instagram / Fourth International Congress on Information and Communication Technology, 2020, pp. 321–329.
  • Greevy E., Smeaton A.F. Classifying racist texts using a support vector machine / In Proceedings of the 27th Annual International Conference on Research and Development in Information Retrieval, 2004, pp. 468–469. 
  • Aulia N., Budi I. Hate Speech Detection on Indonesian Long Text Documents Using Machine Learning Approach / Proceedings of the 2019 5th International Conference on Computing and Artificial Intelligence, 2019, pp. 164–169.
  • Nobata C., Tetreault J., Thomas A., Mehdad Y., Chang Y. Abusive language detection in online user content / International World Wide Web Conference Committee (IW3C2), 2016, pp.145–153.
  • Chaudhari A., Parseja A., Patyal A. CNN based Hate-o-Meter: A Hate Speech Detecting Tool / Third International Conference on Smart Systems and Inventive Technology, 2020, pp. 940–944.
  • Zhang H. The Optimality of Naive Bayes / Proceedings of 17th International Florida Artificial Intelligence Research Society Conference, 2004, pp. 562–567.
  • Rennie J. D., Shih L., Teevan J., Karger D.R. Tackling the poor assumptions of Naive Bayes text classifiers / Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge Massachusetts USA Proceedings of the Twentieth International Conference on Machine Learning, 2003, 1–8.
  • Accuracy, Precision, Recall & F1 Score: Interpretation of Performance Measures, https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/
  • Bewick V., Cheek L., Ball J. Statistics review 13: Receiver operating characteristic curves. CritCare 8, 508 (2004). https://doi.org/10.1186/cc3000