Paper details

Poster Session

Session coordinator: Filip Vojtášek, Albertina icome Praha s.r.o., Czech Republic

Where: 28. 5. 2009, 16.20 - 17.05, Vencovsky Aula

Effects of Start URLs in Focused Web Crawling

Autor: Ari Pirkola, University of Tampere, Finland

Co-authors:

Tuomas Talvensaari, University of Tampere, Finland

Fulltext

Abstract:

Web crawling refers to the process of gathering data from the Web. Focused crawlers are programs that selectively download Web documents (pages), restricting the scope of crawling to a pre-defined domain or topic. The downloaded documents can be indexed for a domain specific search engine or a digital library. In this paper we describe the focused crawling technique, review relevant literature, and report novel experimental results. Crawling is often started with URLs that point to the pages of central North-American and European universities, research institutions, and other organizations in North-America and Europe. In the experiments we investigated, first, how strongly this central region of the Web is connected to three other large geographical regions of the Web: Australia (top level domain .au), China (.cn), and five South-American countries (.ar, .br, .cl, .mx, and .uy). Test topics were selected from the domains of genomics and genetics that are typical scientific fields. We found that two focused crawling processes, one started from the central region and the other from the region of Australia / China / South-America, overlap only to a small extent, identifying mainly different relevant documents. Document relevance was assessed (1) by a human judge and (2) by assigning probability scores to documents using a search engine. Second, we investigated the coverage (number) of relevant documents obtained for different focused crawling processes started with URLs from the four different geographical regions. The results showed that all regions considered in this study are good starting points for focused crawling in the domains of genetics and genomics since each of them yielded a high total coverage. As genomics and genetics are typical scientific fields we assume that the obtained results are generalizable to other scientific domains. We discuss what implications the observed results have for the selection of crawling strategy in scientific focused crawling tasks.

About author:

Dr. Ari Pirkola (http://www.uta.fi/~liarpi) received his PhD in 1999 in Information Studies at the University of Tampere, Finland. Since then, he has served as a researcher and teacher in the Department of Information studies at the University of Tampere. Currently he is working as a Finnish Academy research fellow. His research areas are information retrieval (IR), in particular cross-language and multilingual information retrieval, language technology applications in IR, Web crawling and Web IR, and genomics IR. Pirkola has authored over 60 scholarly publications, most of which are published in leading international conferences and journals. He is a reviewer of several international journals and conferences and a board member of the National Language Technology Graduate School and the journal Informaatiotutkimus.


Other papers in this session:

TECH Subject Gateway - Integrated Access to Electronic Information Sources Not Only for Engineers

Author: Alena Brůžková, National Technical Library, Czech Republic

Co-authors:

Jitka Hladká, Andrea Kučerová, National Technical Library, Czech Republic

Infogram: New Information Education Platform

Author: Eva Dohnálková, Czech University of Life Sciences Prague, Czech Republic

Co-authors:

Hana Landová, Ph.D., Czech University of Life Sciences Prague - Study and Information Centre / Charles University in Prague - Faculty of Arts - Institute of Information studies and Librarianship, Czech Republic

Evaluation of Scientific Work through Education in University Library “Svetozar Markovic” in Belgrade

Author: Aleksandra Popovic, University of Belgrade - University Library "Svetozar Markovic", Serbia

Co-authors:

Stela Filipi-Matutinovic, University of Belgrade - University Library "Svetozar Markovic", Serbia

Advanced Services for Information Resources Using the SFX Technology

Author: Ondřej Fabián, Tomas Bata University in Zlín, Czech Republic

Co-authors:

Lukáš Budínský, Tomas Bata University in Zlín, Czech Republic

Evaluation of Scientific Performance According to Citation Indexes in Serbia

Author: Stela Filipi-Matutinovic, University of Belgrade - University Library "Svetozar Markovic", Serbia

Co-authors:

Aleksandra Popovic, University of Belgrade - University Library "Svetozar Markovic", Serbia

Evaluation of Scientific Performance According to Citation Indexes in Serbia

Author: Stela Filipi-Matutinovic, University of Belgrade - University Library "Svetozar Markovic", Serbia

Co-authors:

Aleksandra Popovic, University of Belgrade - University Library "Svetozar Markovic", Serbia

TECH Subject Gateway - Integrated Access to Electronic Information Sources Not Only for Engineers

Author: Alena Brůžková, National Technical Library, Czech Republic

Co-authors:

Jitka Hladká, Andrea Kučerová, National Technical Library, Czech Republic

Realization of the Project "Creating a Network with the Integration of Scientific, Academic and Special Libraries, Including Their Modernization."

Author: Zuzana Halienová, Slovak National Library, Slovak Republic

Electronic Guide to European Union Information Sources: Hand-Book for Librarians and Information Specialists

Author: Jitka Hradilová, Charles University in Prague - Faculty of Arts - Institute of Information Studies and Librarianship, Czech Republic

Co-authors:

Patrick Overy, University of Exeter, United Kingdom

Realization of the Project "Creating a Network with the Integration of Scientific, Academic and Special Libraries, Including Their Modernization."

Author: Zuzana Halienová, Slovak National Library, Slovak Republic

Classic and Modern Ontology Use in Creation of Medical Algorithms Knowledge Bases

Author: Adéla Jarolímková, CESNET, Czech Republic

Co-authors:

Petr Lesný, Kryštof Slabý, Jan Vejvalka, Faculty Hospital Motol, Czech Republic

Infogram: New Information Education Platform

Author: Eva Dohnálková, Czech University of Life Sciences Prague, Czech Republic

Co-authors:

Hana Landová, Ph.D., Czech University of Life Sciences Prague - Study and Information Centre / Charles University in Prague - Faculty of Arts - Institute of Information studies and Librarianship, Czech Republic

Electronic Guide to European Union Information Sources: Hand-Book for Librarians and Information Specialists

Author: Jitka Hradilová, Charles University in Prague - Faculty of Arts - Institute of Information Studies and Librarianship, Czech Republic

Co-authors:

Patrick Overy, University of Exeter, United Kingdom

Classic and Modern Ontology Use in Creation of Medical Algorithms Knowledge Bases

Author: Adéla Jarolímková, CESNET, Czech Republic

Co-authors:

Petr Lesný, Kryštof Slabý, Jan Vejvalka, Faculty Hospital Motol, Czech Republic

Evaluation of Scientific Work through Education in University Library “Svetozar Markovic” in Belgrade

Author: Aleksandra Popovic, University of Belgrade - University Library "Svetozar Markovic", Serbia

Co-authors:

Stela Filipi-Matutinovic, University of Belgrade - University Library "Svetozar Markovic", Serbia

Advanced Services for Information Resources Using the SFX Technology

Author: Ondřej Fabián, Tomas Bata University in Zlín, Czech Republic

Co-authors:

Lukáš Budínský, Tomas Bata University in Zlín, Czech Republic