文档库 最新最全的文档下载
当前位置:文档库 › A NEW FRAMEWORK FOR DOMAIN- SPECIFIC HIDDEN WEB CRAWLING BASED ON DATA EXTRACTION TECHNIQUE

A NEW FRAMEWORK FOR DOMAIN- SPECIFIC HIDDEN WEB CRAWLING BASED ON DATA EXTRACTION TECHNIQUE

A NEW FRAMEWORK FOR DOMAINSPECIFIC HIDDEN WEB CRAWLING BASED ON DATA EXTRACTION TECHNIQUES
ABSTRACT: The World Wide Web continues to grow at an exponential rate which makes exploiting all useful information a standing challenge. Search engines like ‘Google’ crawl and index a large amount of information, ignoring valuable data that represent 80% of the content on the Web, this portion of web called hidden web (HW), they are “hidden” in databases behind search interfaces. In this paper, a framework of a HW crawler is proposed to crawl and extract hidden web pages. Two unique features of our framework are 1) the classification phase for grouping HW and Publicly Indexable Web (PIW) pages into distinct classes, so that making our crawler performs well in both the domain-specific and random mode of crawling and 2) the capability of dealing with single-attribute and multi-attribute databases. Three novel algorithms proposed in the framework, one for collecting web pages, one for identifying relevant forms, and one for extracting labels. The effectiveness of proposed algorithms is evaluated through experiments using real web sites. The preliminary results are very promising. For instance, one of these algorithms proves to be accurate (over 99% precision and 100 % recall). Keywords: Hidden Web, Crawling, HTML Forms, web information extraction, search engines.
1.
INTRODUCTION
Accurate information retrieval requires an ability to not only retrieve static documents that exist on the web, this portion is called Publicly Indexable Web (PIW), but also retrieve information that is available as a response to a dynamically issued query to the search interface of a database. Traditional search engines cannot handle such interfaces and ignore the contents of these resources, since they only take advantage of the static link structure of the web to “crawl” and index web pages. Recent studies [17, 20, 21] have noted that a tremendous amount of content on the Web is dynamic, this dynamism takes a number of different forms. Based
1

on these studies, Lawrence and Giles [20] estimated that close to 80% of the content on the Web is dynamically generated, and that this number is continuing to increase. However, little of this dynamic content is being crawled and indexed, so it is important to extract content from the hidden web that also known as "deep web" or “invisible web”, (The Portion of the Web that is hidden behind search forms in large searchable databases available only through querying the HTML forms). Pages in the hidden Web are not directly available for crawling through hyperlinks but, they are dynamically generated in response to queries submitted via the search forms [5, 10]. Other studies [11] estimated that the amount of information “hidden” behind such query interfaces outnumbers the documents of the “ordinary”, “traditional” web by two orders of magnitude. The Hidden web refers to the part of the Web that remains unavailable for standard crawlers, because it is hidden behind search interfaces (i.e. not indexed by the major Web search engines). Hidden Web contains large amounts of high-quality information, this information is buried on dynamically generated sites, and misfortune search engines that use traditional crawlers never find this information. For these reasons the Research in progress on how to "get into” the hidden web. Crawling hidden Web is a very challenging problem for two fundamental reasons. First is the issue of scale; a recent study [17] estimates that the size of the content available through such searchable online databases is about 400 to 500 times larger than the size of the “static Web.” As a result, it does not seem to be prudent to attempt comprehensive coverage of the hidden Web. Second, access to these databases is provided only through restricted search interfaces, intended for use by humans. Hence, “training” a crawler to use this restricted interface to extract relevant content, is a non-trivial problem. This paper proposes a framework of a HW crawler that is capable of performing well in both the domain-specific and random mode of crawling. This crawler can deal with single-attribute and multi-attribute databases.
1.1 Modeling HTML Forms:
The task of harvesting information from the hidden Web depends on the way the hidden web crawler's deals with pages containing forms. The data model we use for representing single and multiple HTML forms is
2

discussed. An HTML form is a section of a document containing normal content, markup, special elements called controls (checkboxes, radio buttons, menus, etc.), and labels on those controls [8]. HTML form is embedded in its web page by a pair of ‘‘begin’’ and ‘‘end’’

tags. Each form contains a set of form fields and the URL of a server-side program (e.g. a CGI program) that processes the form field's input values and returns a set of result pages. There are two essential attributes of the FORM element that specify how the values submitted with the form are processed: the value of the action attribute corresponds to the URL of a form processing agent (server-side program), and the HTTP method used to submit the form is defined by the method attribute. We consider any HTML form as a tuple F= (Name, U, method, Ck) where name is a form name, U is an absolute Web location of the agent, method is GET or POST, and Ck is a set of controls, where each control is a tuple Ck =(Name, type, MaxLength, ValueClass, default), where name is the control name, type ?{text, checkbox, select, radio, menu, hidden, submit}, MaxLength defines the maximum number of characters for text or password fields, ValueClass defines the space of possible control values and default is one instance from ValueClass or empty. The crawler must do an analysis of the form and extract relevant information automatically, it is not an easy task, but surely the most difficult step is to extract the field’s labels. This is because generally there is not a formal relationship between them in the HTML code. For example, the label for a text field can be placed above it, separated by a BR tag, it can be beside it, or it can be inserted inside table cells. As shown in figure 1(a), there is a label on the left side of a field and another one on the right side, and sometimes above it. All these pieces of data are absolutely necessary to be extracted for surpassing HTML forms and fetching the results page. Users generally fill a form by modifying its controls, form fields, (entering text, selecting menu items, etc.), before submitting the form to an agent for processing (e.g., to a Web server, to a mail server, etc.). As shown in Figure 1(a), a search page1 contains a form for searching new or used Books. The form consists of six text, six select, and 3 checkbox input fields; User must provide an entry in at least one Input Field. The more fields are filled, the more focused the results will be, and one submit and one reset button.
1https://www.wendangku.net/doc/183436328.html,/search.html, Powell's Books is one of the world's great bookstores
HTU UTH
3

The submission of this form generates a web page containing the results of the query, as shown in figure 1(b). A crawler must perform a similar filling process by selecting suitable values from the domain of each finite form element or by generating queries to infinite form element (e.g. text fields); our goal is to automate this process.
Text form Field Labels to the left Labels to the right Submit button
Figure 1(a): A label Form interface of https://www.wendangku.net/doc/183436328.html,
Result matches
Figure 1(b): Result page of https://www.wendangku.net/doc/183436328.html,
4

2. RELATED WORK
The notion of a Hidden (or Deep or Invisible) Web has been a subject of interest for a number of years. A number of researchers have discussed the problems of crawling the contents of hidden Web databases. The problem of automatically discovering and interacting with hidden Web search forms is a problem examined in [22, 2, 6, 1]. Raghavan and GarciaMolina [22] proposes HiWE, a task-specific hidden-Web crawler, the main focus of this work is to learn Hidden-Web query interfaces. Lin and Chen’s solution [14] aims to build up a catalogue of small search engines located in sites and choose which ones are more likely to answer the query. However, forms with more than a text field is not treated. Wang and Lochovsky [13] describes a system called, DeLa, which reconstructs (part of) a “hidden” back-end web database, and it uses the HiWE crawler. There are other approaches that focus on the data extraction. Lage et al. [12] claims to automatically generate agents to collect hidden web pages by filling HTML forms. Liddle et al. [19] performs a study on how Valuable information can be obtained behind web forms, but do not include a crawler to fetch them. Barbosa and Freire [15] experimentally evaluate methods for building multi-keyword queries that can return a large fraction of a document collection. Ntoulas et al. [3] differs from the previous studies, that, it provides a theoretical framework for analyzing the process of generating queries for a database problem of Hidden Web crawling. There exists a large body of work studying how to identify the most relevant database given a user query [16, catg1, catg2]. This body of work is often referred to as meta-searching or database selection problem over the Hidden Web. Based on the adaptation of the recent hidden web crawlers, we propose a framework which combines some of ideas presented in these recent hidden web crawlers in order to enhance the crawling performance. Based on the adaptation of the recent hidden web crawlers, we propose a framework which combines some of ideas presented in these recent hidden web crawlers in order to enhance the crawling performance
5

3. CRAWLING HIDDEN WEB PROBLEM
The problem is that the Current-day crawlers retrieve only Publicly Indexable Web (PIW) ignoring large amounts of high quality information ‘hidden’ behind search forms, This hidden Web is 500 times as large as PIW. The solution is to build a hidden Web crawler can crawl and extract content from hidden databases. Enable indexing, analysis, and mining of hidden Web content, and the content extracted by such crawlers can be used to categorize and classify the hidden databases. Figure 2 shows the idea of the hidden web crawler. Figure 2 (a) illustrates the sequence of steps that take place, when a user uses a search form to submit queries on a hidden database. Figure 2 (b) illustrates the same interaction, with the crawler now playing the role of the human browser combination. Our hidden web crawler goal is to automate the process of viewing, filling in and submitting the HW forms and analyzing the response pages.
Figure 2 (a): User form interaction
Figure 2 (b): Crawler form interaction
6

4. PROPOSED FRAMEWORK
As shown in figure 3, the proposed framework for hidden web crawling is divided into eight phases. Each phase has its function and its own algorithms, which helps our framework to provide a wide Varity of features. Two unique features that unify our crawler are 1) the classification phase for grouping HW and PIW pages into distinct classes, so that making our crawler performs well in both the domain-specific and random mode of crawling, and 2) the capability of dealing with both the single-attribute and multi-attribute databases, unlike most hidden web crawlers that ignore either of them. To support the entire scope of a hidden Web crawling, our framework is designed as a suite of algorithms and heuristics distributed into the eight proposed phases.
4.1 System Layout
Similar to the most recent hidden web crawlers, our framework consists of eight phases; each phase has its functions and algorithms. There are many modules in this architecture, and they are reflected in these eight phases, the function of each phase and its corresponding algorithm will be outlined below:
4.1.1 Phase 1: Collecting web pages
This phase is dedicated to control the entire crawling process and it consists of two basic functional modules: ? Crawler: when start up the crawler URL seed is initialized to a seed set of URLs, it decides which link to visit next, check if URL already visited, so that not to crawl same links again, it opens network connection to retrieve URLs from the web , then crawler hands the downloaded pages over to Html parser. ? Html parser: it takes the HTML pages that have been crawled and indexed, then it parses them looking for two main tags: first, Any HREF tags that refers to links pointed to another sites, they are extracted and then added to be visited by the crawler later.
7

Figure 3: The proposed Hidden web crawler Framework
8

Second, Any FORM tags that refer to presence of HTML forms in the page, then the pages containing these forms are stores in distinct dataset "FormUrl" for subsequent indexing. Thus, for each HTML page retrieved from the Web a logical tree representation of the structure of the page based on the document object model (DOM) [7] is constructed. Then, the tree is analyzed to find the nodes corresponding to the FORM elements. If one or more elements exist, then the pruned tree is constructed for each FORM element. For such tree construction we use only the subtree below the FORM element and the nodes on the path from the FORM to the root. The proposed algorithm in this phase is shown in figure 4, showing the main tasks of the two modules.
Crawler module
Get Next URL to be crawled from URL Seed URL Seed
"No URL"
Wait 10Sec
Phase1
"URL"
URL visited
Already visited
Not visited
URL validation
Dump URL
Valid URL
Open connection to URL
HTML parser module
HTML parser Check 2 main tags
Yes
HREF tag
No
FormURL Yes
FORM tag
No
URL visited
Phase 2
Figure 4: The proposed algorithm for phase1 (collecting web pages)
9

4.1.2 Phase 2: Finding relevant forms
The URLs with forms that already indexed in phase one, few of them are actually relevant forms. The main aim in this phase is to locate any form capable of returning data rich pages. The basic module in this phase is: Form Analyzer that responsible for identifying pages that contain any forms capable of returning data-rich pages (hidden web forms) and to eliminate pages containing register, purchase, or login forms. In practice, different types of pages are found; we distinguish five classes of the pages containing forms. As depicted in figure 5, type A represent all crawled pages (with and without forms). The largest subclass is type B contains pages (with all kind of forms) and we distinguish these pages from all crawled pages by using the algorithm described in phase 1. Type C is the proper subclass of type B and it contains the pages with forms that have at least one input control with text type (text form) and we will recognize this type by using the algorithm described below in figure 6. The registration and login forms are in this class, and these forms are irrelevant to the hidden web forms. Type D is the proper subclass of type C and it contains the pages with (Queryable form), after we apply our algorithm to discard all login or registration forms now we have this class that contains the forms that are textual AND Queryable. Finally, type E which is our ultimate target of the discovery process; it's a subclass of type D and presents pages that contain (Hidden web forms). The important fact here is that any Queryable form is textual form and not vice versa, and any HW form is Queryable AND textual form and not vice versa. Unlike text and Queryable forms, which can be recognized in this phase as shown below in the algorithm, the recognition of HW forms requires further forms analysis and processing. Thus, hidden web forms are recognized in phase 7 which will be discussed later.
10

Phase 1
Type A Pages
Forms detection
Type B Pages
Text forms detection
Type C Pages
Hidden Web Forms
Type E Pages
Phase 7
Type D Pages
Queryable forms detection
Figure 5: Stages of detecting hidden web forms
4.1.3 Phase 3: Classification phase
This main goal of this phase is to group the Hidden Web (HW) and publicable Indexable web (PIW) pages into distinct classes, so that making our crawler more specific. In order to increase the specificity of the crawler, the crawler classifies these pages then stores them in specialized databases according to their category. So that when a user applies a query to our crawler, it must effectively determine which searchable databases are most likely to contain the relevant information for which a user is looking. Thus, this phase give the crawler the capability to perform well in both the domain-specific and random mode of crawling. The basic module in this phase, which performs the phase task, is: The Classifier engine: its main function is to categorize the web pages collected in both of phase 1 (URL visited by the crawler) and phase 2 (URL with relevant Forms) to a set of pre-defined categories.
11

Phase 2
Get next URL with a form to be checked
GetNext (URLi)
Phase 1
FormUR
For every form tag in the URL
For each Fj in URLi FormSubDB= Get (Name, U, method)
FormSubDB
If Type ≠ text
(Label, checkbox, radio, select...etc)
1. ElmNo+=1 2. If Fj in TxtFormDB Fj → considered Else Fj → Rejected 3. Parse (Fj) 4. WordsDB= Get(W) 5. ExtrctLabel(Fj) Phase 4 6. WordsDB= Get (label)
Check every element in every form
For each Ck in Fj
If Type = text OR textarea
1. ElmNo+=1 2. Fj → TxtFormDB 3. TxtFormDB= Get (default, Maxlength) 4. TxtForm+=1 5. Parse(Fj) 6. ExtrctLabel(Fj) Phase 4 7. TxtFormDB= Get (label)
Detect text forms
TxtFormDB
WordsDB
No form Z:
Fj → QFormDB QForm+=1
X:
Check every Fj in TxtFormDB
Detect Queryable forms
A form exist
1. For each Ck in Fj 2. ChckType (type) 3. If type= password { Fj → Rejected Else Fj → QFormDB QForm+=1 } 4. No(password) Goto Y
QFormDB
1. For each Ck in Fj 2. Check (Maxlength) 3. If Maxlength < = 6 { Fj → Rejected Else Fj → QFormDB QForm+=1 } 4. No(Maxlength) Goto Z
Y:
1. For each Ck in Fj 2. ChckLabel (label) 3. If label= password { Fj → Rejected Else Fj → QFormDB QForm+=1 } 4. No(label) Goto X
Phase 7
To recognize HW forms
Figure 6: The proposed algorithm for phase2 (relevant form detection)
The technique for classifying pages over a set of categories starts by training a rule-based document classifier over those categories. Given a set of training, pre-classified documents [16], this tool returns a classifier that might consist of rules for example: Computers IF operating system Computers IF graphics windows Hobbies IF baseball
12

The first rule indicates that if a document contains the term operating system it should be classified in the “Computers” category. A document should also be classified into that category if it has the words graphics and windows. Similarly, if a document has the word baseball, it is a “Hobbies” document. The outcome of categorizing the PIW pages will help in the form filling and label matching operations, i.e. when the documents is classified and indexed in separate databases according to their classes, this will make form analyzing and processing more easily.
4.1.4 Phase 4: Form Analyzing
There are two main functions for this phase parsing hidden web forms and extracting the labels of the html forms. It consists of two functional modules: 1. Form parser: It is Responsible for Parsing hidden web forms to check whether they are single-attribute forms or multi-attribute forms, in order to make our crawler able to deal with all kind of forms and not to limit the crawler functions to any of the single-attribute or the multi-attribute forms. A textual database is a site that mainly contains plain-text documents, such as PubMed database, since these documents do not usually have welldefined structure, most textual databases provide a simple search interface where users type a list of keywords in a single search box, as shown in Figure 7(a).
Figure 7(a): A single-attribute search interface
In contrast, a structured database often contains multi-attribute relational data (e.g., a book on the Amazon Web site may have the fields Author=‘J.K. Rowling’, Title=‘Harry Potter' and ISBN=‘0590353403’) and supports multi-attribute search interfaces, as shown in Figure 7(b).
13

Author: Title: ISBN:
Search now
Figure 7(b): A multi-attribute search interface
After the parser identify the type of search interface, it indexes the forms in two distinct data structures, one for the Single-attribute forms and the other for the multi-attribute forms. The form parser hands the indexed forms over to the second module. 2. Label extractor module: It is responsible for extracting labels from the forms that have been indexed, In spite of having a tag in the HTML specification called label for the declaration of a label, it is almost not used and, therefore, there isn't any explicit mechanism identifying which label is related to each field. Thus, we propose an algorithm to acquire field labels by determining which labels are visually adjacent to a form field. There are two types of label extractor are used, one for the single-attribute forms and the other for the multi-attribute forms, each one has its own algorithm to extract the labels. In both cases, we do not use expensive complex parsing techniques, as in [22]. Rather a simple event driven Parser is used in both cases, and three buffers, one for the left hand side text, one for the right hand side text and one to the above text are used in the multi-attribute case, as it shown in figure 8. The extracted labels are indexed in two separate data structure; one contains the labels of the single-attribute forms and the other for the multiattribute forms. Extracting the labels of the html forms (single and multiple attribute forms) is a very important task, because these labels are the only clues we have to “learn” how to automatic fill in the forms.
14

(1) (2)
Phase 4
FORM PARSER
Classified HW Pages
S-A
S-A forms
Check form attributes
M-A
M-A forms
Phase 3
Y:
LABEL EXTRACTOR
M-A Label extraction S-A Label extraction
Only one label with single textbox Bl→ Ba Go to x
Keyword
To Detect label position
For each PWrd in Fj PWrdj → BL Select OR textarea ValueIn (Bl, Br, Ba) → LBf Clear (Bl, Br, Ba) Go to x Else ValueIn (Bl, Br, Ba) → LBf Clear (Bl, Br, Ba) Go to x
BR or TD
Check Tag Input
Checkbox or radio
Check Type
Z:
For each PWrd in Fj PWrdj → Buff
Else Check Tag Select OR textarea
ValueIn (Bl, Br, Ba) → LBf Br=GetInptValue () ValueIn (Br) →CRLB Clear (CRLB) Go to x
X:
Yes
Check

tag Yes GetNxtFrm ( ) Go to y
No
Go to Z
Buff → LBf
Check another Form
Return
END GetNxtUrl ( )
No END GetNxtUrl ( )
Generate Queries for S-A forms
Label Matcher for M-A forms
Phase 6
Figure 8: The Proposed algorithm for the label extraction
4.1.5 Phase 5: Extracting Data from PIW
This main Goal of this phase is to create a sample data repository contains organized data objects, in which the data instances are rearranged into a table, where rows represent data instances and columns represent attributes, so that each repository attribute will be mapped to a form field using the
15

extracted form labels, in phase 6. There are many possible mechanisms could be used to populate this sample data repository. ? On the first way we can manually initialize the repository with values for the labels that the crawler is most likely to encounter. For example, when configuring our crawler for the ‘Title of books’ task, we supplied the repository with a list of relevant book titles and associated that list with labels such as “Title”, “Subject”, “Book Name”, etc. ? On the second way, extracting the words in publicable Indexable web (PIW), by using our parser and store them in a sample data repository. Then these words are used to "learn" the crawler to fill-in the forms. ? Finally, we can use Form elements with finite domains, as they are a useful source of (label; Data) pairs. When parsing a form, by the form parser in phase 4, the crawler can glean such pairs from a finite domain element and add them to the Repository, so that they may be used when visiting a different form. The Data Extractor Engine is the basic module in this phase, which is responsible for populating the sample data repository with the data.
4.1.6 Phase 6: Automatic Query generation
The main function of this phase is to automatically generate the queries that will be used to fill in the single-attribute (S-A) and multi-attribute (M-A) forms. Thus, this phase responsibility is divided into two directions, one to fill in the single-attribute forms and the other for the multi-attribute forms. The basic module used in this phase is: Label Matcher: is responsible for assigning (M-A) form labels, extracted in phase 4, to the data in the sample data repository, generated in phase 5, by matching the form labels to the columns of the table, as shown in figure 9. The basic idea is that the query word submitted through the form elements will probably reappear in the corresponding fields of the data objects, since the web sites usually try their best to provide the most relevant data back to the users. Thus, we try to find mappings between form labels and repository attributes looking for matches among them. And if no matches are found, the form is disregarded. To match form labels to data in the repository, an approximate string matching algorithm must be employed. There is a large body of work in the design and analysis of such
16

string matching algorithms [4, 23]. We plan to adapt these algorithms to enhance label matching results, as it will be discussed in our future work. For the (S-A) forms, the query generation method is different; we generate three types of query resources, Query Non Sense (Qns), Query Dictionary (Qdic), and Query Extracted Word (Qexw), where Qns is a set of "non sense" keywords, have no meaning, Qdic is a set of random keywords, randomly selected from the dictionary and Qexw is a set of most relevant extracted words from each class, extracted in phase 5. Equally, for every (S-A) form in a specific class, some keywords are taken from the three different querying resources. We save server responses to Qns, Qdic and Qexw keywords in three distinct sets, in phase 8, and assess the resource relevance by comparing these sets. Recently, works [3, 15] have been published that address the problem of automatically generating queries to the single-attribute forms (keyword interfaces) and examining the results.
Title Author Format
Harry Potter Julia E
…… …..
Form Labels
………….
Oliver twist
S. Smith
Label Matcher
Data without labels
Title
Harry Potter Oliver twist
……..
Author
Julia E S. Smith
……
Format
paperback
...
…..
………..
……..
……
…..
Data table with
Figure 9: Label matcher Job and data representation
4.1.7 Phase 7: Form processing
This phase has two main functions: 1) Filling in the (S-A & M-A) forms indexed, in phase 4, with the queries generated, in phase 5 & 6 and 2) Submitting these form that has been filled. The form filling task resides in finding a mapping between form fields and repository attributes. For every
17

generated query, a further visit is scheduled for the crawler. The parameters (set of field names and values) are stored and a new item is added to the queue of URLs that must be visited by crawler. The crawler, in this phase, is in charge of executing its second goal which is to submit the scheduled queries. To accomplish this it needs an extra feature: the capacity to send parameters in HTTP requests using both the GET and POST methods. We are currently investigating possible approaches for modifying the form submission process to include both GET and POST methods, and not to limit to GET method, as most of recent crawler did [14, 1].
4.1.8 Phase 8: Response page analysis
The aim of this phase is to analyze response pages (the server response to the crawler Query), the main module in this phase is: The response analyzer: its responsible for automatically distinguish between a response page, which contains search results, and one that contains an error message, reporting that no matches were found for the submitted query. The idea is to use this information to tune the crawler’s value assignment strategy.
5. EXPERIMENTAL RESULTS AND ANALYSIS
A number of experiments were performed to validate the effectiveness of the overall proposed framework. This is done by implementing the proposed algorithms within the important phases (collecting web pages and finding relevant forms). At this point we have to note that, improving the performance of each phase separately will result in increasing the overall performance of the framework. In our experiments, we employed the following components: Tidy1 (a library for HTML cleaning), HtmlAgilityPack2 and MSHTML (Html Parser), the communication channel with the internet is about 256 MBPS. Pages took on average 1.9 seconds to parse.
5.1 Collecting web pages
In order to support the tests, the crawler was started up to collect all types of pages and kept up until having thousands of these pages and
18

storing them to have an analysis on pages with forms as described in section 4.1.1. When starting up the crawler, as shown in figure 10, the proposed algorithm is validated by showing the crawler and parser tasks. We examined five categories of web sites, book shopping, job advertisements, car advertisements, software and E-Commerce, and collected 17 web sites for each category from https://www.wendangku.net/doc/183436328.html,3; we used those 85 web sites as a seed (entry points) to the crawler. For each entry point, we used a breadth first crawl of the Web. 9833 pages have been visited by the crawler (average 1966 page per category), (average 117.3 pages per site). 398of these pages contained forms.
Figure 10: Start window of the crawler (testing phase1)
Table 2 shows sample of detailed information about the 85 Web sites, including relevant information on the average number of terms per page (TR/P), forms per page (F/P) and tags per page (TG/P).
1 https://www.wendangku.net/doc/183436328.html,/projects/tidy 2 https://www.wendangku.net/doc/183436328.html,/smourier/archive/2003/06/04/8265.aspx 3 https://www.wendangku.net/doc/183436328.html,
H H
19

Table 2: Web data set summary Category TR/P 1. https://www.wendangku.net/doc/183436328.html, 421 2. https://www.wendangku.net/doc/183436328.html, 526 3. https://www.wendangku.net/doc/183436328.html, 85 4. https://www.wendangku.net/doc/183436328.html,/ 463 5. https://www.wendangku.net/doc/183436328.html, 320 6. https://www.wendangku.net/doc/183436328.html, 789 7. https://www.wendangku.net/doc/183436328.html, 345 1. https://www.wendangku.net/doc/183436328.html, 501 2. https://www.wendangku.net/doc/183436328.html, 471 3. https://www.wendangku.net/doc/183436328.html,/ 329 4. https://www.wendangku.net/doc/183436328.html, 609 5. https://www.wendangku.net/doc/183436328.html, 406 6. https://www.wendangku.net/doc/183436328.html, 481 7. https://www.wendangku.net/doc/183436328.html, 756 1. https://www.wendangku.net/doc/183436328.html, 115 2. https://www.wendangku.net/doc/183436328.html, 490 3. https://www.wendangku.net/doc/183436328.html, 865 4. https://www.wendangku.net/doc/183436328.html, 361 5. https://www.wendangku.net/doc/183436328.html,/ 881 6. https://www.wendangku.net/doc/183436328.html, 98 7. https://www.wendangku.net/doc/183436328.html, 701 1. https://www.wendangku.net/doc/183436328.html, 394 2. https://www.wendangku.net/doc/183436328.html, 583 3. https://www.wendangku.net/doc/183436328.html, 397 4. https://www.wendangku.net/doc/183436328.html, 422 5. https://www.wendangku.net/doc/183436328.html, 741 6. https://www.wendangku.net/doc/183436328.html, 424 7. https://www.wendangku.net/doc/183436328.html, 801 1. https://www.wendangku.net/doc/183436328.html,/ecommerce/ 886 2. https://www.wendangku.net/doc/183436328.html,/ 701 3. https://www.wendangku.net/doc/183436328.html,/ 881 4. https://www.wendangku.net/doc/183436328.html,/ 865 5. https://www.wendangku.net/doc/183436328.html,/ 779 6. https://www.wendangku.net/doc/183436328.html,/ 386 7. https://www.wendangku.net/doc/183436328.html,/StoreFront.bok 361
URL
F/P TG/P
13 8 4 15 6 5 3 2 3 6 8 7 11 5 2 8 4 15 6 9 7 7 5 9 3 11 2 9 3 6 6 2 4 16 15 865 796 1036 935 179 359 895 923 885 561 1020 365 473 321 465 471 358 1009 827 901 384 296 741 394 1101 905 227 710 1106 465 827 981 684 994 1009
To evaluate our extracted information, we rely on two standard measures common in the information retrieval field: precision and recall [3]. In our context, let x=Number of pages have been crawled successfully, y= Number of pages retrieved (correctly and not correctly), and z = Total
20
ECommerce
Software
Car
Job
Book

相关文档