Paper details
Poster Session
Session coordinator: Filip Vojtášek, Albertina icome Praha s.r.o., Czech Republic
Where: 28. 5. 2009, 16.20 - 17.05, Vencovsky Aula
Effects of Start URLs in Focused Web Crawling
Autor: Ari Pirkola, University of Tampere, Finland
Co-authors:
Tuomas Talvensaari, University of Tampere, FinlandFulltext
Abstract:
Web crawling refers to the process of gathering data from the Web. Focused crawlers are programs that selectively download Web documents (pages), restricting the scope of crawling to a pre-defined domain or topic. The downloaded documents can be indexed for a domain specific search engine or a digital library. In this paper we describe the focused crawling technique, review relevant literature, and report novel experimental results. Crawling is often started with URLs that point to the pages of central North-American and European universities, research institutions, and other organizations in North-America and Europe. In the experiments we investigated, first, how strongly this central region of the Web is connected to three other large geographical regions of the Web: Australia (top level domain .au), China (.cn), and five South-American countries (.ar, .br, .cl, .mx, and .uy). Test topics were selected from the domains of genomics and genetics that are typical scientific fields. We found that two focused crawling processes, one started from the central region and the other from the region of Australia / China / South-America, overlap only to a small extent, identifying mainly different relevant documents. Document relevance was assessed (1) by a human judge and (2) by assigning probability scores to documents using a search engine. Second, we investigated the coverage (number) of relevant documents obtained for different focused crawling processes started with URLs from the four different geographical regions. The results showed that all regions considered in this study are good starting points for focused crawling in the domains of genetics and genomics since each of them yielded a high total coverage. As genomics and genetics are typical scientific fields we assume that the obtained results are generalizable to other scientific domains. We discuss what implications the observed results have for the selection of crawling strategy in scientific focused crawling tasks.
About author:
Dr. Ari Pirkola (http://www.uta.fi/~liarpi) received his PhD in 1999 in Information Studies at the University of Tampere, Finland. Since then, he has served as a researcher and teacher in the Department of Information studies at the University of Tampere. Currently he is working as a Finnish Academy research fellow. His research areas are information retrieval (IR), in particular cross-language and multilingual information retrieval, language technology applications in IR, Web crawling and Web IR, and genomics IR. Pirkola has authored over 60 scholarly publications, most of which are published in leading international conferences and journals. He is a reviewer of several international journals and conferences and a board member of the National Language Technology Graduate School and the journal Informaatiotutkimus.
Other papers in this session:
TECH Subject Gateway - Integrated Access to Electronic Information Sources Not Only for Engineers
Author: Alena Brůžková, National Technical Library, Czech Republic
Co-authors:
Jitka Hladká, Andrea Kučerová, National Technical Library, Czech RepublicInfogram: New Information Education Platform
Author: Eva Dohnálková, Czech University of Life Sciences Prague, Czech Republic
Co-authors:
Hana Landová, Ph.D., Czech University of Life Sciences Prague - Study and Information Centre / Charles University in Prague - Faculty of Arts - Institute of Information studies and Librarianship, Czech RepublicEvaluation of Scientific Work through Education in University Library “Svetozar Markovic” in Belgrade
Author: Aleksandra Popovic, University of Belgrade - University Library "Svetozar Markovic", Serbia
Co-authors:
Stela Filipi-Matutinovic, University of Belgrade - University Library "Svetozar Markovic", SerbiaAdvanced Services for Information Resources Using the SFX Technology
Author: Ondřej Fabián, Tomas Bata University in Zlín, Czech Republic
Co-authors:
Lukáš Budínský, Tomas Bata University in Zlín, Czech RepublicEvaluation of Scientific Performance According to Citation Indexes in Serbia
Author: Stela Filipi-Matutinovic, University of Belgrade - University Library "Svetozar Markovic", Serbia
Co-authors:
Aleksandra Popovic, University of Belgrade - University Library "Svetozar Markovic", SerbiaEvaluation of Scientific Performance According to Citation Indexes in Serbia
Author: Stela Filipi-Matutinovic, University of Belgrade - University Library "Svetozar Markovic", Serbia
Co-authors:
Aleksandra Popovic, University of Belgrade - University Library "Svetozar Markovic", SerbiaTECH Subject Gateway - Integrated Access to Electronic Information Sources Not Only for Engineers
Author: Alena Brůžková, National Technical Library, Czech Republic
Co-authors:
Jitka Hladká, Andrea Kučerová, National Technical Library, Czech RepublicRealization of the Project "Creating a Network with the Integration of Scientific, Academic and Special Libraries, Including Their Modernization."
Author: Zuzana Halienová, Slovak National Library, Slovak Republic
Electronic Guide to European Union Information Sources: Hand-Book for Librarians and Information Specialists
Author: Jitka Hradilová, Charles University in Prague - Faculty of Arts - Institute of Information Studies and Librarianship, Czech Republic
Co-authors:
Patrick Overy, University of Exeter, United KingdomRealization of the Project "Creating a Network with the Integration of Scientific, Academic and Special Libraries, Including Their Modernization."
Author: Zuzana Halienová, Slovak National Library, Slovak Republic
Classic and Modern Ontology Use in Creation of Medical Algorithms Knowledge Bases
Author: Adéla Jarolímková, CESNET, Czech Republic
Co-authors:
Petr Lesný, Kryštof Slabý, Jan Vejvalka, Faculty Hospital Motol, Czech RepublicInfogram: New Information Education Platform
Author: Eva Dohnálková, Czech University of Life Sciences Prague, Czech Republic
Co-authors:
Hana Landová, Ph.D., Czech University of Life Sciences Prague - Study and Information Centre / Charles University in Prague - Faculty of Arts - Institute of Information studies and Librarianship, Czech RepublicElectronic Guide to European Union Information Sources: Hand-Book for Librarians and Information Specialists
Author: Jitka Hradilová, Charles University in Prague - Faculty of Arts - Institute of Information Studies and Librarianship, Czech Republic
Co-authors:
Patrick Overy, University of Exeter, United KingdomClassic and Modern Ontology Use in Creation of Medical Algorithms Knowledge Bases
Author: Adéla Jarolímková, CESNET, Czech Republic
Co-authors:
Petr Lesný, Kryštof Slabý, Jan Vejvalka, Faculty Hospital Motol, Czech RepublicEvaluation of Scientific Work through Education in University Library “Svetozar Markovic” in Belgrade
Author: Aleksandra Popovic, University of Belgrade - University Library "Svetozar Markovic", Serbia
Co-authors:
Stela Filipi-Matutinovic, University of Belgrade - University Library "Svetozar Markovic", SerbiaAdvanced Services for Information Resources Using the SFX Technology
Author: Ondřej Fabián, Tomas Bata University in Zlín, Czech Republic
Co-authors:
Lukáš Budínský, Tomas Bata University in Zlín, Czech Republic