Vol. 13 No. 1 (2022): The Bibliographic Control in the Digital Ecosystem
Articles

Annif and Finto AI: Developing and Implementing Automated Subject Indexing

Osma Suominen
National Library of Finland
Bio
Juho Inkinen
National Library of Finland
Bio
Mona Lehtinen
National Library of Finland
Bio

Published 2022-01-13

Keywords

  • Automated subject indexing,
  • Artificial intelligence,
  • Machine learning

How to Cite

Suominen, Osma, Juho Inkinen, and Mona Lehtinen. 2022. “Annif and Finto AI: Developing and Implementing Automated Subject Indexing”. JLIS.It 13 (1):265-82. https://doi.org/10.4403/jlis.it-12740.

Abstract

Manually indexing documents for subject-based access is a labour-intensive process that can be automated using AI technology. Algorithms for text classification must be trained and tested with examples of indexed documents, which can be obtained from existing bibliographic databases and digital collections.

The National Library of Finland has created Annif, an open source toolkit for automated subject indexing and classification. Annif is multilingual, independent of the indexing vocabulary, and modular. It integrates many text classification algorithms, including Maui, fastText, Omikuji, and a neural network model based on TensorFlow. Best results can often be obtained by combining several algorithms. Many document corpora have been used for training and evaluating Annif. Finding the algorithms and configurations that give the best quality is an ongoing effort.

In May 2020, we launched Finto AI, a service for automated subject indexing based on Annif. It provides a simple Web form for obtaining subject suggestions for text. The functionality is also available as a REST API. Many document repositories and the cataloguing system for electronic publications at the National Library of Finland are using it to integrate semi-automated subject indexing into their metadata workflows. In the future, we are going to extend Annif with more algorithms and new functionality, and to integrate Finto AI with other metadata management workflows.

Metrics

Metrics Loading ...

References

  1. Golub, Koraljka, Dagobert Soergel, George Buchanan, Douglas Tudhope, Marianne Lykke, and Debra Hiom. 2016. ‘A Framework for Evaluating Automatic Indexing or Classification in the Context of Retrieval’. Journal of the Association for Information Science and Technology 67 (1): 3–16. https://doi.org/10.1002/asi.23600.
  2. Haighton, Thomas, and Sara Veldhoen. 2020. ‘Assisted Keyword Assignment Using Annif. KB Lab: The Hague.’ 2020. http://kbresearch.nl/annif/.
  3. Hulkkonen, Juha, Juho Inkinen, Aleksi Kallio, Markus Koskela, Mikko Lappalainen, Mona Lehtinen, Mats Sjöberg, Osma Suominen, and Laxmana Yetukuri. 2021. ‘Sisällönkuvailun automatisoinnin haasteita ja ratkaisuja kulttuuriperintöorganisaatiossa’. Kansalliskirjaston raportteja ja selvityksiä. http://urn.fi/URN:ISBN:978-951-51-7233-4.
  4. Joulin, Armand, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. ‘Bag of Tricks for Efficient Text Classification’. ArXiv:1607.01759 [Cs], August. http://arxiv.org/abs/1607.01759.
  5. Kasprzik, Anna. 2020. ‘Putting Research-Based Machine Learning Solutions for Subject Indexing into Practice’. In Proceedings of the Conference on Digital Curation Technologies (Qurator 2020). Berlin, Germany. http://ceur-ws.org/Vol-2535/paper_1.pdf.
  6. Khandagale, Sujay, Han Xiao, and Rohit Babbar. 2020. ‘Bonsai: Diverse and Shallow Trees for Extreme Multi-Label Classification’. Machine Learning 109 (11): 2099–2119. https://doi.org/10.1007/s10994-020-05888-2.
  7. Kleppe, Martijn, Sara Veldhoen, Meta van der Waal-Gentenaar, Brigitte den Oudsten, and Dorien Haagsma. 2019. ‘Exploration possibilities Automated Generation of Metadata’. Zenodo. https://doi.org/10.5281/zenodo.3375192.
  8. Lehtinen, Mona, Juho Inkinen, and Osma Suominen. 2019. ‘Aaveita koneessa: Automaattisen sisällönkuvailun arviointia Kirjastoverkkopäivillä 2019’. Tietolinja (blog). 2019. http://urn.fi/URN:NBN:fi-fe2019120445612.
  9. Lehtonen, Tommi, and Juha Piukkula. 2020. ‘Automaattinen asiasanoitus Radio- ja televisio-ohjelmatietokanta Ritvassa’. Informaatiotutkimus 39 (1): 27–45–27–45. https://doi.org/10.23978/inf.88107.
  10. Medelyan, Olena. 2009. ‘Human-Competitive Automatic Topic Indexing’. Thesis, The University of Waikato. https://researchcommons.waikato.ac.nz/handle/10289/3513.
  11. Niininen, Satu, Susanna Nykyri, and Osma Suominen. 2017. ‘The Future of Metadata: Open, Linked, and Multilingual – the YSO Case’. Journal of Documentation 73 (3): 451–65. https://doi.org/10.1108/JD-06-2016-0084.
  12. Nikkarinen, Irene. 2021. ‘Annif <3 Yle 2.0: Annifin osittainen käyttöönotto artikkeleiden koneavusteisessa asiasanoituksessa’. Presented at the Meeting of the Finnish Automatic Indexing Interest Group, March 15. https://www.kiwi.fi/display/tekoalykumppanuus/Automaattisen+kuvailun+verkoston+tapaamiset?preview=/147358597/211911484/Automaattisen%20kuvailun%20verkoston%20tapaaminen%2015.3.2021%20Annif.pdf.
  13. Prabhu, Yashoteja, Anil Kag, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. 2018. ‘Parabel: Partitioned Label Trees for Extreme Classification with Application to Dynamic Search Advertising’. In Proceedings of the 2018 World Wide Web Conference, 993–1002. WWW ’18. Lyon, France. https://doi.org/10.1145/3178876.3185998.
  14. Romein, C. Annemieke, Sara Veldhoen, and Michel de Gruijter. 2020. ‘The Datafication of Early Modern Ordinances’. DH Benelux Journal 2. https://journal.dhbenelux.org/journal/issues/002/article-23-romein/article-23-romein.html.
  15. Stevens, Mary Elizabeth. 1965. Automatic Indexing: A State-of-the-Art Report. NBS Monograph 91. Washington, D.C: United States. Government Printing Office.
  16. Suominen, Osma. 2019. ‘Annif: DIY Automated Subject Indexing Using Multiple Algorithms’. LIBER Quarterly 29 (1): 1. https://doi.org/10.18352/lq.10285.
  17. Suominen, Osma, and Pia Virtanen. 2020. ‘Yle Meets ANNIF – an Open Source Tool for Automated Subject Indexing’. Presented at the EBU MDN Workshop 2020, June 10. https://tech.ebu.ch/contents/publications/events/presentations/mdn2020/yle-meets-annif--an-open-source-tool-for-automated-subject-indexing.
  18. Toepfer, Martin, and Christin Seifert. 2020. ‘Fusion Architectures for Automatic Subject Indexing under Concept Drift: Analysis and Empirical Results on Short Texts’. International Journal on Digital Libraries 21 (2): 169–89. https://doi.org/10.1007/s00799-018-0240-3.
  19. Uhlmann, Sandro. 2020. ‘Automatische Vergabe von GND-Schlagwörtern Mit Annif - Ergebnisse Einer Evaluation Im DNB - Projekt EMa’. Presented at the Erfahrungen und Perspektiven mit dem Toolkit Annif, December 3. https://wiki.dnb.de/display/FNMVE/Erfahrungen+und+Perspektiven+mit+dem+Toolkit+Annif?preview=/181751388/190121925/2-3_Automatische-Vergabe-von-GND-Schlagw%C3%B6rtern_Uhlmann_2020-12-03_final.pdf.
  20. Wilbur, W. John, and Won Kim. 2014. ‘Stochastic Gradient Descent and the Prediction of MeSH for PubMed Records’. AMIA Annual Symposium Proceedings 2014 (November): 1198–1207.