A NEW FRAMEWORK FOR DOMAINSPECIFIC HIDDEN WEB CRAWLING BASED ON DATA EXTRACTION TECHNIQUES
ABSTRACT: The World Wide Web continues to grow at an exponential rate which makes exploiting all useful information a standing challenge. Search engines like ‘Google’ crawl and index a large amount of information, ignoring valuable data that represent 80% of the content on the Web, this portion of web called hidden web (HW), they are “hidden” in databases behind search interfaces. In this paper, a framework of a HW crawler is proposed to crawl and extract hidden web pages. Two unique features of our framework are 1) the classification phase for grouping HW and Publicly Indexable Web (PIW) pages into distinct classes, so that making our crawler performs well in both the domain-specific and random mode of crawling and 2) the capability of dealing with single-attribute and multi-attribute databases. Three novel algorithms proposed in the framework, one for collecting web pages, one for identifying relevant forms, and one for extracting labels. The effectiveness of proposed algorithms is evaluated through experiments using real web sites. The preliminary results are very promising. For instance, one of these algorithms proves to be accurate (over 99% precision and 100 % recall). Keywords: Hidden Web, Crawling, HTML Forms, web information extraction, search engines.
1.
INTRODUCTION
Accurate information retrieval requires an ability to not only retrieve static documents that exist on the web, this portion is called Publicly Indexable Web (PIW), but also retrieve information that is available as a response to a dynamically issued query to the search interface of a database. Traditional search engines cannot handle such interfaces and ignore the contents of these resources, since they only take advantage of the static link structure of the web to “crawl” and index web pages. Recent studies [17, 20, 21] have noted that a tremendous amount of content on the Web is dynamic, this dynamism takes a number of different forms. Based
1
on these studies, Lawrence and Giles [20] estimated that close to 80% of the content on the Web is dynamically generated, and that this number is continuing to increase. However, little of this dynamic content is being crawled and indexed, so it is important to extract content from the hidden web that also known as "deep web" or “invisible web”, (The Portion of the Web that is hidden behind search forms in large searchable databases available only through querying the HTML forms). Pages in the hidden Web are not directly available for crawling through hyperlinks but, they are dynamically generated in response to queries submitted via the search forms [5, 10]. Other studies [11] estimated that the amount of information “hidden” behind such query interfaces outnumbers the documents of the “ordinary”, “traditional” web by two orders of magnitude. The Hidden web refers to the part of the Web that remains unavailable for standard crawlers, because it is hidden behind search interfaces (i.e. not indexed by the major Web search engines). Hidden Web contains large amounts of high-quality information, this information is buried on dynamically generated sites, and misfortune search engines that use traditional crawlers never find this information. For these reasons the Research in progress on how to "get into” the hidden web. Crawling hidden Web is a very challenging problem for two fundamental reasons. First is the issue of scale; a recent study [17] estimates that the size of the content available through such searchable online databases is about 400 to 500 times larger than the size of the “static Web.” As a result, it does not seem to be prudent to attempt comprehensive coverage of the hidden Web. Second, access to these databases is provided only through restricted search interfaces, intended for use by humans. Hence, “training” a crawler to use this restricted interface to extract relevant content, is a non-trivial problem. This paper proposes a framework of a HW crawler that is capable of performing well in both the domain-specific and random mode of crawling. This crawler can deal with single-attribute and multi-attribute databases.
1.1 Modeling HTML Forms:
The task of harvesting information from the hidden Web depends on the way the hidden web crawler's deals with pages containing forms. The data model we use for representing single and multiple HTML forms is
2
discussed. An HTML form is a section of a document containing normal content, markup, special elements called controls (checkboxes, radio buttons, menus, etc.), and labels on those controls [8]. HTML form is embedded in its web page by a pair of ‘‘begin’’ and ‘‘end’’
tag Yes GetNxtFrm ( ) Go to yextracted form labels, in phase 6. There are many possible mechanisms could be used to populate this sample data repository. ? On the first way we can manually initialize the repository with values for the labels that the crawler is most likely to encounter. For example, when configuring our crawler for the ‘Title of books’ task, we supplied the repository with a list of relevant book titles and associated that list with labels such as “Title”, “Subject”, “Book Name”, etc. ? On the second way, extracting the words in publicable Indexable web (PIW), by using our parser and store them in a sample data repository. Then these words are used to "learn" the crawler to fill-in the forms. ? Finally, we can use Form elements with finite domains, as they are a useful source of (label; Data) pairs. When parsing a form, by the form parser in phase 4, the crawler can glean such pairs from a finite domain element and add them to the Repository, so that they may be used when visiting a different form. The Data Extractor Engine is the basic module in this phase, which is responsible for populating the sample data repository with the data.
4.1.6 Phase 6: Automatic Query generation
The main function of this phase is to automatically generate the queries that will be used to fill in the single-attribute (S-A) and multi-attribute (M-A) forms. Thus, this phase responsibility is divided into two directions, one to fill in the single-attribute forms and the other for the multi-attribute forms. The basic module used in this phase is: Label Matcher: is responsible for assigning (M-A) form labels, extracted in phase 4, to the data in the sample data repository, generated in phase 5, by matching the form labels to the columns of the table, as shown in figure 9. The basic idea is that the query word submitted through the form elements will probably reappear in the corresponding fields of the data objects, since the web sites usually try their best to provide the most relevant data back to the users. Thus, we try to find mappings between form labels and repository attributes looking for matches among them. And if no matches are found, the form is disregarded. To match form labels to data in the repository, an approximate string matching algorithm must be employed. There is a large body of work in the design and analysis of such
16
string matching algorithms [4, 23]. We plan to adapt these algorithms to enhance label matching results, as it will be discussed in our future work. For the (S-A) forms, the query generation method is different; we generate three types of query resources, Query Non Sense (Qns), Query Dictionary (Qdic), and Query Extracted Word (Qexw), where Qns is a set of "non sense" keywords, have no meaning, Qdic is a set of random keywords, randomly selected from the dictionary and Qexw is a set of most relevant extracted words from each class, extracted in phase 5. Equally, for every (S-A) form in a specific class, some keywords are taken from the three different querying resources. We save server responses to Qns, Qdic and Qexw keywords in three distinct sets, in phase 8, and assess the resource relevance by comparing these sets. Recently, works [3, 15] have been published that address the problem of automatically generating queries to the single-attribute forms (keyword interfaces) and examining the results.
Title Author Format
Harry Potter Julia E
…… …..
Form Labels
………….
Oliver twist
S. Smith
Label Matcher
Data without labels
Title
Harry Potter Oliver twist
……..
Author
Julia E S. Smith
……
Format
paperback
...
…..
………..
……..
……
…..
Data table with
Figure 9: Label matcher Job and data representation
4.1.7 Phase 7: Form processing
This phase has two main functions: 1) Filling in the (S-A & M-A) forms indexed, in phase 4, with the queries generated, in phase 5 & 6 and 2) Submitting these form that has been filled. The form filling task resides in finding a mapping between form fields and repository attributes. For every
17
generated query, a further visit is scheduled for the crawler. The parameters (set of field names and values) are stored and a new item is added to the queue of URLs that must be visited by crawler. The crawler, in this phase, is in charge of executing its second goal which is to submit the scheduled queries. To accomplish this it needs an extra feature: the capacity to send parameters in HTTP requests using both the GET and POST methods. We are currently investigating possible approaches for modifying the form submission process to include both GET and POST methods, and not to limit to GET method, as most of recent crawler did [14, 1].
4.1.8 Phase 8: Response page analysis
The aim of this phase is to analyze response pages (the server response to the crawler Query), the main module in this phase is: The response analyzer: its responsible for automatically distinguish between a response page, which contains search results, and one that contains an error message, reporting that no matches were found for the submitted query. The idea is to use this information to tune the crawler’s value assignment strategy.
5. EXPERIMENTAL RESULTS AND ANALYSIS
A number of experiments were performed to validate the effectiveness of the overall proposed framework. This is done by implementing the proposed algorithms within the important phases (collecting web pages and finding relevant forms). At this point we have to note that, improving the performance of each phase separately will result in increasing the overall performance of the framework. In our experiments, we employed the following components: Tidy1 (a library for HTML cleaning), HtmlAgilityPack2 and MSHTML (Html Parser), the communication channel with the internet is about 256 MBPS. Pages took on average 1.9 seconds to parse.
5.1 Collecting web pages
In order to support the tests, the crawler was started up to collect all types of pages and kept up until having thousands of these pages and
18
storing them to have an analysis on pages with forms as described in section 4.1.1. When starting up the crawler, as shown in figure 10, the proposed algorithm is validated by showing the crawler and parser tasks. We examined five categories of web sites, book shopping, job advertisements, car advertisements, software and E-Commerce, and collected 17 web sites for each category from https://www.wendangku.net/doc/183436328.html,3; we used those 85 web sites as a seed (entry points) to the crawler. For each entry point, we used a breadth first crawl of the Web. 9833 pages have been visited by the crawler (average 1966 page per category), (average 117.3 pages per site). 398of these pages contained forms.
Figure 10: Start window of the crawler (testing phase1)
Table 2 shows sample of detailed information about the 85 Web sites, including relevant information on the average number of terms per page (TR/P), forms per page (F/P) and tags per page (TG/P).
1 https://www.wendangku.net/doc/183436328.html,/projects/tidy 2 https://www.wendangku.net/doc/183436328.html,/smourier/archive/2003/06/04/8265.aspx 3 https://www.wendangku.net/doc/183436328.html,
H H
19
Table 2: Web data set summary Category TR/P 1. https://www.wendangku.net/doc/183436328.html, 421 2. https://www.wendangku.net/doc/183436328.html, 526 3. https://www.wendangku.net/doc/183436328.html, 85 4. https://www.wendangku.net/doc/183436328.html,/ 463 5. https://www.wendangku.net/doc/183436328.html, 320 6. https://www.wendangku.net/doc/183436328.html, 789 7. https://www.wendangku.net/doc/183436328.html, 345 1. https://www.wendangku.net/doc/183436328.html, 501 2. https://www.wendangku.net/doc/183436328.html, 471 3. https://www.wendangku.net/doc/183436328.html,/ 329 4. https://www.wendangku.net/doc/183436328.html, 609 5. https://www.wendangku.net/doc/183436328.html, 406 6. https://www.wendangku.net/doc/183436328.html, 481 7. https://www.wendangku.net/doc/183436328.html, 756 1. https://www.wendangku.net/doc/183436328.html, 115 2. https://www.wendangku.net/doc/183436328.html, 490 3. https://www.wendangku.net/doc/183436328.html, 865 4. https://www.wendangku.net/doc/183436328.html, 361 5. https://www.wendangku.net/doc/183436328.html,/ 881 6. https://www.wendangku.net/doc/183436328.html, 98 7. https://www.wendangku.net/doc/183436328.html, 701 1. https://www.wendangku.net/doc/183436328.html, 394 2. https://www.wendangku.net/doc/183436328.html, 583 3. https://www.wendangku.net/doc/183436328.html, 397 4. https://www.wendangku.net/doc/183436328.html, 422 5. https://www.wendangku.net/doc/183436328.html, 741 6. https://www.wendangku.net/doc/183436328.html, 424 7. https://www.wendangku.net/doc/183436328.html, 801 1. https://www.wendangku.net/doc/183436328.html,/ecommerce/ 886 2. https://www.wendangku.net/doc/183436328.html,/ 701 3. https://www.wendangku.net/doc/183436328.html,/ 881 4. https://www.wendangku.net/doc/183436328.html,/ 865 5. https://www.wendangku.net/doc/183436328.html,/ 779 6. https://www.wendangku.net/doc/183436328.html,/ 386 7. https://www.wendangku.net/doc/183436328.html,/StoreFront.bok 361
URL
F/P TG/P
13 8 4 15 6 5 3 2 3 6 8 7 11 5 2 8 4 15 6 9 7 7 5 9 3 11 2 9 3 6 6 2 4 16 15 865 796 1036 935 179 359 895 923 885 561 1020 365 473 321 465 471 358 1009 827 901 384 296 741 394 1101 905 227 710 1106 465 827 981 684 994 1009
To evaluate our extracted information, we rely on two standard measures common in the information retrieval field: precision and recall [3]. In our context, let x=Number of pages have been crawled successfully, y= Number of pages retrieved (correctly and not correctly), and z = Total
20
ECommerce
Software
Car
Job
Book