№1, 2021

COMPARATIVE ANALYSIS OF METHODS OF AUTOMATIC TERM EXTRACTION FROM TEXTS

Afruz M. Gurbanova

This article provides an applied approach to the automatic terms’ extraction from the corpus for a particular subject area. Terms’ extraction is commonly focused on the determination of the basic vocabulary of a particular field. Unlike the traditional manual terms’ extraction, automatic extraction is a computerized tool to simplify this time-consuming task and is aimed at automating the pre-determination of term-candidates. Currently, the dynamics of the growth of the volume of information that must be processed in many areas (lexicography, terminology, information retrieval, etc.) makes the issue of automatic selection of terms and keywords especially relevant. Automatic Term Extraction is well-established discipline within Natural Language Processing and many different approaches and systems developed. Various sub-issues of automatic terms’ extraction that is corpus collection, unity, definition of terms and variants, and system evaluation are presented. Five methods for automatic terms’ extraction are studied and comparatively analyzed. An experiment is conducted on the corpus of articles included into the journals "Problems of Information Technology" and "Problems of the Information Society". An expert and formal joint assessment methodology is proposed, and the results of the comparative assessment of the automatic terms’ extraction methods are presented (pp.55-69).

Keywords: automatic term extraction, Natural Language Processing, corpus collection, linguistic approaches, statistical approaches, termhood.
DOI : 10.25045/jpit.v12.i1.05
References
  • Heylen K., Hertog D. Automatic Term Extraction. In Hendrik J. Kockaert and Frieda Steurs (eds.) / Handbook of Terminology. John Benjamins Publishing Company, 2015, 1, pp.203–221.
  • Cabré M. Teresa, Estopà R., Vivaldi J. Automatic term detection: a review of current systems // In Recent Advances in Computational Terminology, edited by Didier Bourigault, Christian Jacquemin and Marie-Claude L’Homme, John Benjamins Publishing Company. Natural Language Processing, 2001, vol.2, pp.53–88
  • Gregor T. Making Term Extraction Tools Usable / In Proceedings of the Joint Conference of the 8th Workshop of the European Association for Machine Translation and the 4th Controlled Language Applications Workshop. Dublin: European Association for Machine Translation, 2003. https://www.aclweb.org/anthology/2003.eamt-1.20.
  • Baroni M., Bernardini BootCaT: Bootstrapping Corpora and Terms from the Web / Proceedings of LREC 2004. Lisbon: ELDA, 2004, pp.1313–1316.
  • Kageura K. Computing the potential lexical productivity of head elements in nominal compounds using the textual corpus // Progress in Informatics, 2009, no.6, pp.49–56.
  • Nakagawa H., Tatsunori M. A simple but powerful automatic term extraction method / In Proceedings of the Second International Workshop on Computational Terminology, Stroudsburg, PA, USA: Association for Computational Linguistics, 2002, pp.1–7.
  • Didier B., Jacquemin Ch. Term extraction + term clustering: An integrated platform for computer-aided terminology” / In Proceedings of the ninth conference on European Chapter of the Association for Computational Linguistics (EACL), Bergen, Stroudsburg, PA, USA: Association for Computational Linguistics, 1999, pp.15–22.
  • Chunyu K. Corpus tools for retrieving andderiving termhood evidence / In 5th East Asia Forum of Terminology, Haikou, China, 2002, pp.69–80.
  • Justeson J.S., Slava M.K. Technical terminology: some linguistic properties and an algorithm for identification in text // Natural Language Engineering, 1995, vol.1, issue 1, pp.9–27.
  • Patrick P., Dekang L. A Statistical Corpus-Based Term Extractor / In Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of intelligence: Advances in Artificial intelligence, edited by Eleni Stroulia and Stan Matwin, Lecture Notes In Computer Science, London: Springer-Verlag, 2001, vol.2056, pp.36–46.
  • Drouin P. Termhood: Quantifying the Relevance of a Candidate Term / Modern approaches to terminological theories and applications, 2006, pp.375–391.
  • Yutaka M., Ishizuka M. Keyword extraction from a single document using word co-occurrence statistical information // International Journal on Artificial Intelligence Tools, 2003, vol.13, issue 1, pp.157–169.
  • Dunning T. Accurate methods for the statistics of surprise and coincidence // Computational Linguistics, 1993, vol.19, issue 1, pp.61–74.
  • Manning Ch., Hinrich Sc. Foundations of Statistical Natural Language Processing. Cambridge, MA, USA: MIT Press, 1999, 720 p.
  • Evert S. The Statistics of Word Cooccurrences: Word Pairs and Collocations, PhD diss., 2004, 353 p. elib.uni-stuttgart.de/bitstream/11682/2573/1/Evert2005phd.pdf
  • Wiechmann D. On the Computation of Collostruction Strength: Testing Measures of Association as Expressions of Lexical Bias // Corpus Linguistics and Linguistic Theory, 2008, vol.4, issue 2, pp.253–290.
  • Wermter J., Udo H. Paradigmatic Modifiability Statistics for the Extraction of Complex Multi-Word Terms / In Proceedings of the Human Language Technology Conference and the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2005, pp.843–850.
  • Susan C., Douglas B. The Frequency and Use of Lexical Bundles in Conversation and Academic Prose // Lexicographica, vol.20, issue, 2005, pp.56–71.
  • Kageura K., Umino Bin. Methods of automatic term recognition: a review // Terminology, International Journal of Theoretical and Applied Issues in Specialized Communication, 1996, vol.3, issue 2, pp.259–289.
  • Béatrice D. Study and Implementation of Combined Techniques for Automatic Extraction of Terminology // In The Balancing Act: Combining Symbolic and Statistical Approaches to Language, edited by Philip Resnik and Judith L. Klavans, Cambridge, MA, USA: MIT Press, 1996, pp.49–66.
  • Béatrice D. Variations and application-oriented terminology engineering // Terminology, 2005, vol.11, issue 1, pp.181–197.
  • Salton G., Wong A., Yang Chung-Su. A vector space model for automatic indexing // Communications of the ACM, 1975, vol.18, pp.613–620.
  • Bourigault Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases / Proc. of COLING-92, Nantes, France, August 23–28, 1992, pp.977–981.
  • Aussenac-Gilles, N., Bourigault D., Condamines A., Gros C. How can Knowledge Acquisition benefit from Terminology? / Proceedings of the 9th Knowledge Acquisition for Knowledge-Based Systems Workshop (KAW'95), 1995, pp.11–16.
  • Frantzi, Ananiadou S., Mima H. Automatic recognition of multi-word terms: the C-value/NC-value method // Int J Digit Libr, 2000, vol.3, pp.115–130.
  • Bernardini S., Ferraresi A. Old needs, new solutions Comparable corpora for language professionals //, Publisher: Springer, Editors: Serge Sharoff, Reinhard Rapp, Pierre Zweigenbaum, Pascale Fung, 2013, pp.303–319.
  • ДобровБ.В., Лукашевич Н.В., Сыромятников С.В. Формирование базы терминологических сочетаний по текстам предметной области / Электронные библиотеки: перспективные методы и технологии, электронные коллекции: Труды пятой Всероссийской научной конференции, С.-Петербург, 2003, с.201–210.
  • Синтаксическийанализ. Проект АОТ. http://www.aot.ru/docs/synan.html
  • Terryn A., Hoste V., Lefever E. A Gold Standard for Multilingual Automatic Term Extraction from Comparable Corpora: Term Structureand Translation Equivalents / Proceedings of the Eleventh International Conference on Language Resources and Evaluation, Miyazaki, Japan, ELRA, 2018, pp.1803–1808. https://www.aclweb.org/ anthology/L18-1284.pdf
  • "İnformasiya Texnologiyaları Problemləri" jurnalı. jpit.az
  • "İnformasiya Cəmiyyəti Problemləri” jurnalı. jpis.az
  • Francesco S., Velardi P. Termextractor: a web application to learn the common terminology of interest groups and research communities / In Proceedings of the 7th Conference on Terminology and Artificial Intelligence (TIA-2007), Sophia Antipolis, 2007, pp.85–94.