WTMS: A System for Collecting and Analyzing Topic-Specific Web Information
Sougata Mukherjea,
C&C Research Laboratories,
NEC USA Inc.,
San Jose, Ca, USA
Abstract:
With the explosive growth of the World-Wide Web, it is becoming increasingly difficult for users to collect and analyze Web pages that are relevant to a particular topic. To address this problem we are developing WTMS, a system for Web Topic Management. In this paper we explain how the WTMS crawler efficiently collects Web pages for a topic. We also introduce the user interface of the system that integrates several techniques for analyzing the collection. Moreover, we present the various views of the interface that allow navigation through the information space. We highlight several examples to show how the system enables the user to gain useful insights about the collection.
Keywords: World-Wide Web, Topic Management, Focussed Crawling, Information Visualization, Graph Algorithms, Hubs, Authorities
1. Introduction
The World-Wide Web is undoubtedly the best source for getting information on any topic. Therefore, more and more people use the Web for topic management [1], the task of gathering, evaluating and organizing information resources on the Web. Users may investigate topics both for professional or personal interests.
Generally the popular portals or search engines like Yahoo and Alta Vista are used for gathering information on the WWW. However, with the explosive growth of the Web, topic management is becoming an increasingly difficult task. On the one hand this leads to a large number of documents being retrieved for most queries. The results are presented as pages of scrolled lists. Going through these pages to retrieve the relevant information is tedious. Moreover, the Web has over 350 million pages and continues to grow rapidly at a million pages per day [4]. Such growth and flux pose basic limits of scale for today's generic search engines. Thus, many relevant information may not have been gathered and some information may not be up-to-date.
Because of these problems, recently there is much awareness that for serious Web users, focussed portholes are more useful than generic portals [9]. Therefore, systems that allow the user to collect and organize the information related to a particular topic and allow easy navigation through this information space is becoming essential. Such a Web Topic Management System should have several features to be really useful:
- Focussed Crawler:
A crawler that allows the collection of resources from the WWW that relates to a user-specified topic is an essential requirement of the system. An effective crawler should have high precision (all the retrieved Web pages should belong to the topic) and recall (most of the relevant information available on the topic should be gathered). Moreover the crawler should be efficient; the relevant information should be collected in the least amount of time possible. - Viewing information at various levels of abstraction:
A search engine generally shows the relevant Web pages as the result of a user's query. However, sometimes authors organize the information as a collection of pages; in these cases presenting the collection may be more useful for the reader. In fact, if a Web site has many relevant pages, presenting the Web site itself may be better. Furthermore, if there are many Web sites for a particular topic, grouping similar Web sites may be more convenient. Therefore, an effective Topic Management system should be able to present the information at various levels of abstraction depending on the user's focus. - Integrate querying and browsing:
The two major ways to access information in the Web are querying and browsing. Querying is appropriate when the user has a well-defined understanding of what information is needed and how to formulate a query. However, in many cases the user is not certain of exactly what information is desired and needs to learn more about the content of the information space. In these cases browsing is an ideal navigational strategy. Browsing can also be combined with querying when the result of the query is too large for the user to comprehend (by letting the user browse through the results) or too small (by showing the user other related information). An effective Topic Management system should allow the user to smoothly integrate querying and browsing to retrieve the relevant information. - Beyond keyword-based querying:
For advanced analysis of a topic, the keyword-based queries that are provided by the search engines may not be adequate. More sophisticated querying techniques based on the contents as well as the structure of the information space are essential. For example, the user should be able to group together related information and filter out unimportant information. - Allow easy sharing of the gathered information:
Once the collection for a particular topic has been gathered and organized, different users with interest in the topic should be able to share the information.
We are building WTMS, a Web Topic Management System to allow the collection and analysis of information on the Web related to a particular topic. This paper discusses the various features of the system. The next section cites related work. Section 3 explains how the focussed crawler of WTMS allows the collection of information from the WWW relevant to a topic. Section 4 presents the various views of the system that allow the user to navigate through the information space. Several graph algorithm based techniques to analyze the collection are introduced in section 5. Finally, section 6 is the conclusion.
2. Related Work
2.1. Visualizing the World-Wide Web
Several systems for visualizing WWW sites have been developed. Examples includes Navigational View Builder [16], Harmony Internet Browser [2] and Narcissus [11]. Visualization techniques for World-Wide Web search results are also beingdeveloped. For example, the WebQuery system [7] visualizes the results of a search query along with all pages that link to or are linked to by any page in the original result set. Another example is WebBook [6], which potentially allows the results of the search to be organized and manipulated in various ways in a 3D space. In this paper our emphasis is to develop views that allow navigation through Web pages about a particular topic and gain useful insights about the collection.
2.2. World-Wide Web Topic Management
In recent times there has been much interest in collecting Web pages related to a particular topic. The shark-search algorithm for collecting topic-specific pages is presented in [12]. Another focussed crawler is presented in [9]. The crawler used in WTMS is similar to these systems. However, it uses several heuristics to improve performance.
Another interesting problem is determining the important pages in a collection. Kleinberg defines the HITS algorithm to identify authority and hub pages in a collection [13]. Authority pages are authorities on a topic and hubs point to many pages relevant to the topic. Therefore, pages with many in-links, specially from hubs, are considered to be good authorities. On the other hand, pages with many out-links, specially from authorities, are considered to be good hubs. The algorithm has been refined in CLEVER [8] and Topic Distillation [5]. We feel that this algorithm is very important for a Web Topic Management system. However, we also believe that determining the hub and authority sites for a topic are more useful than determining the hub and authority pages.
Mapuccino (formerly WebCutter) [15], [12], [3] and TopicShop [20], [1] are two systems that have been developed for WWW Topic Management. Both systems use a crawler for collecting Web pages related to a topic and use various types of visualization to allow the user to navigate through the resultant information space. While Mapuccino presents the information as a collection of Web pages, TopicShop presents the information as a collection of Web sites. As emphasized in the introduction, we believe that it is more effective to present the information at various levels of abstraction depending on the user's focus. Moreover, a topic management system should allow the user to use several techniques to analyze the information space.
3. Crawling
The architecture of the Web Topic Management System is discussed in [18]. The system uses a focussed crawler to collect Web pages related to a user-specified topic. In this section we will discuss the basic strategy for focussed crawling and how we can improve performance by using some heuristics.
3.1. Basic Focussed Crawling Technique
For collecting WWW pages related to a particular topic the user has to first specify some seed urls relevant to the topic. Alternatively, the user can specify the keywords for the topic and the crawler issues a query with the specified keywords to a popular Web search engine and uses the results as the seed urls. The WTMS crawler downloads the seed urls and creates a representative document vector (RDV) based on the frequently occurring keywords in these urls.
The crawler then downloads the pages that are referenced from the seed urls and calculates their similarity to the RDV using the vector space model [19]. If the similarity is above a threshold, the pages are indexed by a text search engine and the links from the pages are added to a queue. The crawler continues to follow the out-links until the queue is empty or an user-specified limit is reached.
The crawler also determines the pages pointing to the seed urls. (A query link:u to search engines like AltaVista and Google returns all pages pointing to url u). These pages are downloaded and if their similarity to the RDV is greater than a threshold, they are indexed and the urls pointing to these pages are added to the queue. The crawler continues to follow the in-links until the queue is empty or an user-specified limit is reached.
After crawling, the collection consists of the seed urls as well as all pages similar to these seed urls that have paths to or from the seeds. We believe that this collection is a good source of information available on the WWW for the user-specified topic. It should be noted that the crawler has a stop url list to avoid downloading some popular pages (like Yahoo's and Netscape's Home pages) as part of the collection.
This focussed crawling strategy is similar to the techniques described in [12] and [9]. Most focussed crawlers start with some seed urls and follows the out-links (sometimes in-links also). Pages that are relevant to the topic of interest are indexed and the links from these pages are also followed. The relevancy can be determined by various techniques like the vector space model (as in [12]) or using a classifier (for example in [9]).
3.2. Using Heuristics to Improve Performance
The main bottleneck of a crawler is the time spent in downloading Web pages. Besides network congestion, a crawler needs to follow the convention of issuing one download request to a site per 30 seconds. Therefore, downloading many pages from a single site may take a long time. For a focussed crawler only Web pages relevant to the topic are important. So many of the downloaded pages may have to be discarded. In fact, using our basic focussed crawler for topics as diverse as World Cup Soccer, Information Visualization and Titanic we found that less than 50% of the downloaded pages were found to be relevant. If we could determine that a page will be irrelevant without examining the contents of the page, we can avoid downloading the page, thereby improving performance. We use two heuristics for the purpose.
3.2.1. Nearness of the current page to the linked page
If a web site has information related to several topics, a page in the Web site for one of the topics may have links to pages relating to the other topics or to the main page of the site for ease of navigation. For example, the page http://www.discovery.com/area/science/titanic/weblinks.html, a page relevant to Titanic has a link to http://www.discovery.com/online.html, the main page of Discovery online, a page not related to the topic. However, since most web sites are well organized, pages that are dissimilar do not occur near each other in the directory hierarchy.
Therefore, when we are examining the pages linked to or linked from a Web page, we need to download the linked page only if it is near the current page. The determination of nearness will be an optimization between the number of irrelevant downloads and the number of relevant pages that are not downloaded. If we use a strict criteria to determine the nearness between two pages, the number of downloads will be lower, but we may miss some relevant pages. On the other hand, a lenient criteria to determine nearness will retrieve all the relevant pages but at the cost of increasing the number of downloads.
Figure 1: Determining the nearness of two pages in the directory hierarchy
Figure 1 shows how nearness is determined by the WTMS crawler. Suppose a page in the directory A/B is the current page. Then pages in the same web site are considered to be near (and therefore downloaded) if and only if they belong to the directories shown in the figure. Thus pages in the parent directory (A) as well as any children directories (C,D) are considered to be near. Sections of sibling directories (E,F,G) are also downloaded. After crawling several web sites, we found that this definition of nearness gives the optimal result. It should be noted that if a page has a link to or from a page in another Web site, we have to download the page (unless it is in the stop url list). Also note that if a url contains any of the topic keywords, the page will be always downloaded. So all pages from http://www.titanicmovie.com will be downloaded for the Titanic collection.
3.2.2. Irrelevant Directories
Because web sites are well organized, generally most pages in the same directory have similar themes. Thus all pages in http://www.murthy.com/txlaw/ talk about tax laws. One of these pages, http://www.murthy.com/txlaw/txwomsoc.html was retrieved by a query to Google with the keywords "world cup soccer" since it talks about visa issuance to the Women's World Cup Soccer tournament. However, none of the other pages of the directory are relevant to the collection on World Cup Soccer. However, in the basic crawler all these pages were downloaded, only to be discarded after determining the similarity to RDV.
To avoid this problem, during crawling, we keep a count on the number of pages that have been indexed and ignored for each directory. If more than 25 pages of a directory are downloaded and 90% of those pages are ignored, we do not download any more pages from that directory.
3.3. Evaluation
Collections | Information Visualization | World Cup Soccer | Titanic |
% Download | 73.54 | 66.27 | 77.4 |
% Nearness | 21.9 | 27.59 | 15.7 |
% Rejected Directories | 4.56 | 6.14 | 6.9 |
% Relevant Pages Missed | 4.26 | 4.04 | 2.56 |
Average Score Pages Missed | 0.261 | 0.279 | 0.254 |
Table 1: Effectiveness of the heuristics in improving crawling performance
Table 1 shows the comparison between the basic crawler and the enhanced crawlers for 3 collections, Information Visualization, World Cup Soccer and Titanic. The following statistics are shown:
- % Download = 100*(Number of pages downloaded by enhanced crawler)/(Number of pages downloaded by basic crawler)
For all the 3 collections there is a significant decrease in the number of urls that were downloaded. Besides reducing the network overhead, the amount of time needed to wait between successive requests to the same server is also decreased. - % Nearness is the number of pages not downloaded because the linked page was not near the original page.
A significant number of pages were not downloaded using this heuristic. - % Rejected Directories is the number of pages not downloaded because a significant number of pages from the directory the page belonged to, were already rejected after downloading.
Unfortunately, only a small number of pages were ignored using this criteria. Maybe extending the criteria to determine irrelevant sites instead of irrelevant directories may be more useful. - % Relevant Pages Missed = 100 * (Number of relevant pages indexed by the basic crawler but ignored by the enhanced crawler)/ (Total number of pages indexed by the basic crawler)
This statistics shows the number of relevant pages that were ignored by the enhanced crawler. It indicates that most of the pages that were not downloaded by the enhanced crawler were not relevant to the topic and were not indexed by the basic crawler also. - Average Score Pages Missed is the average score of the relevant pages indexed by the basic crawler but ignored by the enhanced crawler.
Even if the enhanced crawler misses some relevant pages, for the crawler to be acceptable, none of these pages should be very important to the collection. To measure the importance of the missed pages, we calculate the average similarity scores of these pages. In our crawler all pages whose similarity to the representative document vector is greater than 0.25 are indexed (we use a vector space model where the similarity of a page is a number between 0 and 1). Since the average score of the missed pages is close to 0.25, it shows that these pages were not the most important pages of the collection.
Thus, our enhancements were able to significantly reduce the download time without missing many relevant pages of the collection.
4. Interfaces for Topic Management
After the information about a particular topic has been crawled and indexed, a WTMS server is initialized. Any user can access the WTMS server from a Web browser. The WTMS user interface is opened in the client browser as a Java (Swing) applet. Since most users won't require all the collected information, the applet initially loads only a minimal amount of information. Based on user interactions, the applet can request further information from the WTMS server. XML is used for information exchange between the WTMS server and clients.
The WTMS interface provides various types of views to allow navigation through the gathered information space. This section gives a brief description of the WTMS interface applet.
4.1. Tabular Overview
In WTMS the collected information is organized into an abstraction hierarchy as discussed in [18]. The Web pages relevant to a topic downloaded by the crawler are considered to be the smallest unit of information in the system. The pages can be grouped into their respective physical domains; for example all pages in www.cnn.com can be grouped together. Many large corporations use several Web servers based on functionality or geographic location. For example, www.nec.com and www.nec.co.jp are the Web sites of NEC Corporation in US and Japan respectively. Similarly, shopping.yahoo.com is the shopping component of the Yahoo www.yahoo.com portal. Therefore, we group together related physical domains into logical Web sites by analyzing the urls. (In other cases we may need to break down the domains into smaller logical Web sites. For example, for corporations like Geocities and Tripod who provide free home pages, the Web sites www.geocities.com and members.tripod.com are actually a collection of home pages with minimal relationship among each other. However, this technique has not yet been incorporated into WTMS.)
Most HTML authors organize the information into a series of HTML pages for ease of navigation. Moreover, readers generally want to know what is available on a given site, not on a single page. Therefore, logical Web sites are the basic unit of information in WTMS. Thus, for calculating the hub and authority scores, we apply the algorithm introduced in [13] to logical sites instead of individual Web pages.
Figure 2: A table showing the logical sites relevant to a collection on World Cup Soccer sorted by the authority scores
Figure 2 shows the initial view of the WTMS interface for a collection on World Cup Soccer. It shows a table containing various information about the logical Web sites that were discovered for the topic. Besides the url and the number of pages, it shows the hub and authority scores. (These scores have values between 0 and 1). The table also shows the title of the main page of the sites. The main page is determined by the connectivity and the depth of the page in the site hierarchy as discussed in [17]. Note that generally while calculating the hub and authority scores, intra-site links are ignored [5]. Assuming that all Web pages within a site are by the same author, this removes the author's judgment while determining the global importance of a page within the overall collection. However, while determining the local importance of a page within a site, the author's judgment is also important. So the intra-site links are not ignored while determining the main page of a site.
The table gives a good overview about the collection. Clicking on the labels on the top the user can sort the table by that particular statistics. For example, in Figure 2, it is sorted by the authority scores. Some authorities for information on World Cup Soccer are shown at the top.
4.2. Visualizing Information at Various Levels of Abstraction
A logical site has several physical domains. The domains consists of directories and each directory has several Web pages and sub-directories. WTMS allows the users to view the information about a collection at various levels of abstraction depending on their focus. We discuss some of these visualizations in this subsection. Note that WTMS also allows the user to group related sites as discussed in section 5.
4.2.1. Site Level Views
(a) The logical web site is selected | |
(b) A directory within the site is selected |
Figure 3: Visualizing the details of a logical site
Figure 3 shows the details of the site seawifs.gsfc.nasa.gov, the highest authority for the collection on Titanic. In Figure 3(a) the logical site itself is selected. The constituents of the site as well as sites having links to and from the selected site are shown as a glyphs (graphical elements). The figure shows that the logical site has several physical domains like rsd.gsfc.nasa.gov and seawifs.gsfc.nasa.gov. Notice that if a physical domain has just one directory or site (for example daac.gsfc.nasa.gov) then the glyph for that directory or page is shown. The brightness of a glyph is proportional to the last modification time of the freshest page in the group. Thus, bright red glyphs indicate groups which contain very fresh pages (for example, www.dimensional.com) and black glyphs indicates old pages (for example, www.nationalgeographic.com).
The view also shows the links to and from the currently selected site. If another site has a link to the site, it has an arrow pointing out. Similarly, if a site has a link from the selected site, it has an arrow pointing in. Thus, Figure 3(a) shows that www.marinemeuseum.org has both links to and from seawifs.gsfc.nasa.gov. The arrows to and from the children of the selected site, give an indication of their connectivity. For example, the figure shows that the domain rsd.gsfc.nasa.gov has only out-links while seawifs.gsfc.nasa.gov has both in-links and out-links. The thickness of the arrow is proportional to the number of links. On the other hand, the color of the arrows indicates whether the link is inter-site or intra-site. Green indicates inter-site links. Thus all links from pages in the domain rsd.gsfc.nasa.gov are to other pages within the same logical site. On the other hand, blue indicates inter-site links. Thus all links from the other sites are blue. A combination of inter-site and intra-site links is indicated by cyan. Thus seawifs.gsfc.nasa.gov has both inter-site and intra-site in-links and out-links.
The user can click on a glyph with the right mouse button to see more details. For example, in Figure 3(b) the user has navigated further down the hierarchy and selected the directory seawifs.gsfc.nasa.gov/OCEAN_PLANET/HTML. The relevant web pages in the directory are shown. The directory has a page with a large number of in-links and out-links, titanic.html. Note that the glyph is a circle for a Web page and rectangle for the groups. Further, the label of the currently selected node is highlighted and the rectangles representing the currently selected path is not filled.
4.2.2. Page Level Views
Figure 4: Information about the Web page www.cc.gatech.edu/gvu
Clicking on a glyph with the left mouse button shows information about the corresponding page or site. Thus, Figure 4 shows information about the Georgia Tech Graphics, Visualization & Usability Center Home Page, a page in the Information Visualization collection. This dialog box allows the user to visit the page, see the links to and from the page or see the related pages. For example, Figure 5 shows the url, title as well as the structural and semantic similarity values of pages related to the above page. The table is sorted by the semantic similarity of a page to the selected page which is determined by the vector space model [19]. The structural similarity of a page is determined by the links of the page with respect to the selected page and is calculated as follows:
- For each direct link to and from any page B to A, the structural similarity score of B is increased by 1.0.
- For each indirect link between pages B and A, the structural similarity score is increased by 0.5. There can be 3 types of indirect links:
- Transitive: B links to C and C links to A or vice versa.
- Social Filtering: Both A and B point to C.
- Co-citation: C points to both A and B.
- The scores are normalized to determine the structural similarity.
Figure 5: Related web pages for www.cc.gatech.edu/gvu
5. Analyzing the Site Graph
The WTMS interface allows the user to filter the information space based on various criteria. For example, the user can see only pages modified before or after a particular date. Moreover, in the Related Pages table (Figure 5), the user can see only pages that are co-cited or directly linked to the selected page. Another useful technique is to specify keywords to determine relevant Web pages or sites. For keyword queries, as well as for determining the similar pages to a selected pages, the WTMS interface sends a query to the WTMS server. The results from the server are used in the visualizations.
To make a Web Topic Management system more useful, it should provide analysis beyond traditional keyword or attribute based querying. At present WTMS provides various graph algorithm based techniques to analyze the topic-specific collection. The algorithms are applied on the site graph which is a directed graph with the logical sites as the nodes. If a page in site a has a link to a page in site b, then an edge is created from node a to b in the site graph. It should be emphasized that our analysis techniques are not very complex and thus applicable to collections with a large number of Web sites in real time.
5.1. Connected Components
One useful technique is to consider the site graph as an undirected graph and determine the connected components [10]. Nodes in the same connected component are connected by paths of one or more links and can thus be considered to be related.
(a) World Cup Soccer | (b) Titanic |
Figure 6: The Connected Components of collections
Figure 6 shows the connected components that were discovered for the collections on World Cup Soccer and Titanic. If some connected component contained only one site, the site itself is shown (for example, www.iranian.com in Figure 6(a)). Each component is represented by a glyph whose size is proportional to the number of Web pages the group contains. The glyphs are ordered by the maximum authority score among the sites in the group. Thus the group containing seawifs.gsfc.nasa.gov is at the top for the Titanic collection. The label of a connected component is determined by the site with the highest authority score it contains. The user can click on a glyph to see the logical sites and web pages of the corresponding component. For example, in Figure 6(a), the user is seeing the details of a connected component containing the site www.ee.umd.edu.
Figure 6 shows that for both the collections only a few connected components were formed (even though the collections had thousands of pages). Thus the connected components is a useful mechanism to group the logical sites into meaningful clusters. The major information about the topic can be found in the top few components. The isolated sites found later in the view generally contain information different from the main theme of the collection. For example, as seen from Figure 6(a), we see that the Electrical Department of the University of Maryland is in the collection on World Cup Soccer. On examining the main page of the site (for this collection) www.ee.umd.edu/~dstewart/pinball/PAPA6/, we found that the page talks about the 1998 World Pinball Championship. The search engine retrieved the page, since one of the participating teams was named World Cup Soccer! Similarly, as discussed in section 3.2.2, the site www.murthy.com which talks about tax laws, is in the collection since one of its page talks about visa issuance to the Women's World Cup Soccer tournament. Further, for the Titanic collection shown in Figure 6(b), we found an isolated site for a casino named Titanic. Thus, it is evident that the connected components are effective in isolating pages that are in the fringes of the collection.
5.2. Strongly Connected Components
Figure 7: A strongly connected component in the Information Visualization collection
Considering the site graph as a directed graph we can also determine the strongly connected components [10]. Each pair of nodes of a strongly connected component have bi-directional links between them. Figure 7 shows a strongly connected components for the Information Visualization collection. All the sites shown in the figure are reachable from each other. For a strongly connected component with a large number of nodes and links, the graph is too complex for the user to understand. Therefore WTMS allows various ways to filter the graph:
Figure 8: Showing the paths from www-graphics.stanford.edu in the strongly connected component
- The user can see the paths to or from a selected node to all other nodes in the graph. For example, Figure 8 shows the paths from the site www-graphics.stanford.edu, one of the best hubs on the topic.
- We also allow the integration of keyword queries with structure-based analysis. For example, Figure 9 shows the same strongly connected component after a query with the keywords "stock market" . Sites relevant to visualization of stock markets are only shown. Notice that the site computer.org is isolated since the sites connecting it to the other sites were filtered out by the query.
Figure 9: The strongly connected component after a query with "stock market"
(a) For the Information Visualization collection, 20%of the sites reference each other | |
(b) For the World Cup Soccer collection, 3.5% of the sites reference each other |
Figure 10: Information about the strongly collected components for a collection
The dialog box shown in Figure 10 gives an indication about the strongly connected components of a collection. The figure shows that for an academic topic like Information Visualization more sites are grouped into strongly connected components (20%) than for a topic of more general interest like World Cup Soccer (only 3.5%). On analyzing the collections using the WTMS interface, we determined that this is because the Information Visualization collection mainly consists of Web sites of research communities which sometimes cross-reference each other. On the other hand, the World Cup Soccer collection consists of several commercial Web sites on the topic as well as personal pages of people interested in the topic. The personal pages may point to some commercial sites but not vice versa and of course there is hardly any cyclical referencing among competing commercial sites. Thus, the number of sites that belong to a strongly connected component gives an indication about whether the collection is competitive or collaborative. In a collaborative collection, unlike a competitive collection, there will be several Web sites referencing each other; therefore these sites will be grouped into strongly connected components.
5.3. Optimum Hub/Authority Covers
For most topics, the collections will consist of hundreds of logical sites. Sometimes the user may want to filter the information space based on various criteria. Obviously, the hubs and authorities are some of the most important sites for the collection. So one option is to show the top n or n% hubs and authorities as the important sites, where n is any integer. However, instead of choosing an arbitrary integer, in some situations other techniques might be more appropriate. In this section we will define two techniques for filtering the information space.
We define a hub cover for a site graph with V sites and E links as a subset V_h of V, such that for all links (u,v) in E, u (the source of the link) belongs to V_h. An optimum hub cover is the hub cover of the smallest size for the given graph. That is, the optimum hub cover of a collection is the smallest number of sites that have links to all the sites of the collection. Filtering the collection by showing only the optimum hub cover is useful, because from these sites the user can reach all the sites of the collection.
On the other hand, the authority cover for a site graph with V sites and E links is a subset V_a of V, such that for all links (u,v) in E, v (the destination of the link) belongs to V_a. The optimum authority cover is the authority cover with the smallest size in the site graph. That is, the optimum authority cover of a collection is the smallest number of sites that have links from all the sites of the collection. Obviously, filtering the collection by the optimum authority cover is also useful.
Determining the optimum hub and authority covers for a site graph is similar to the vertex cover problem [10]. The vertex cover problem for an undirected graph G=(V,E) is to determine the smallest possible subset V' of V such that if (u,v) is in E, then at least one of u or v is in V'. Unfortunately, the vertex cover problem is NP-complete [10].
In WTMS we determine the approximate optimum hub and authority covers. The algorithm to determine the approximate optimum hub cover is as follows:
- Let G=(V,E) be the site graph
- V_h = ()
- Let V_s be the nodes of the graph sorted by their increasing in-degrees
- Remove from V_s all nodes with no incoming links
- While V_s is not empty,
- Let v be the next node of V_s
- Let u be the node with the highest hub score among the nodes that have links to v
- Add u to V_h
- Remove from V_s, u as well as all nodes that have links from u
- Return V_h as the approximate optimum hub cover
The algorithm examines the nodes of the graph sorted by their increasing in-degress. For nodes having just one in-link, the source of the link has to be added to the hub cover. For nodes having more than one link, the algorithm adds to the hub cover the link source with the highest hub score. Whenever a node is added to the hub cover, all nodes that have links from that node can be ignored.
Notice that even though we remove the sites with no in-links before the while loop in the above algorithm, these sites can still be in a hub cover. However, sites with both no in-links and out-links will be ignored by the algorithm. Since these sites are in the fringes of the collection (as discussed in Section 5.1), they should not be considered important to the collection.
The approximate optimum authority cover can be determined by a similar algorithm. We can also determine the hub and authority covers for the Web pages by applying the algorithms on the original graph of the collection.
Collections | Information Visualization | World Cup Soccer | Titanic |
Number of Sites | 305 | 140 | 528 |
Size of Approximate Optimal Hub Cover | 35 | 20 | 52 |
Size of Approximate Optimal Authority Cover | 28 | 18 | 68 |
Table 2: Statistics for determining approximate optimum hub and authority covers
Table 2 shows the size of the approximate optimal hub and authority covers that were discovered for the collections. In all cases we could discover a few sites, significantly smaller than the total number of sites, which have links to/from all the sites of the collections.
(a) Hub Cover | (b) Authority Cover | (c) Nodes directly reachable from the main hub www.anancyweb.com |
Figure 11: The Hub and Authority Covers for the World Cup Soccer collection
Figure 11(a) shows the approximate optimal hub cover for the World Cup Soccer collection, sorted by the hub scores. Similarly, Figure 11(b) shows the approximate optimal authority cover sorted by the authority scores. These views are useful because starting from these sites the user can visit all the sites of the collection. Clicking on one of the sites in the Hub cover view shows the sites that can be directly reached from the selected site. Thus, Figure 11(c) shows the sites with links from the main hub for the World Cup Soccer collection www.anancyweb.com. Similarly, clicking on a site in the Authority cover view shows the sites that have links to the selected site.
6. Conclusion
In this paper, we have presented WTMS, a Web Topic Management system. WTMS uses a crawler to gather information related to a particular topic from the WWW. The enhancements to the crawler has resulted in improved performance by reducing the number of pages that need to be downloaded by more than 20% while missing only a few insignificant pages.
The WTMS interface provides various visualizations to navigate through the information space at different levels of abstraction. For example, the user can see the details of a logical site or a Web page. The system allows the user to integrate searching and browsing. Besides traditional keyword-based search, structural analysis techniques are provided which allow the user to gain several useful insights about the collection. This kind of analysis is not easy using traditional search engines or the previous systems for topic management. We have also introduced the concept of optimum hub and authority covers as a technique to filter the information space. The example scenarios presented in the paper indicates the usefulness of the system for Web Topic Management.
Future work is planned along various directions:
- Incorporating Other Analysis Techniques: We need to integrate other techniques of analysis to enable the users to understand the information space from different perspectives. For example, we may determine the cores or bi-partite sub graphs in the site graph to understand the communities within the collection as discussed in [14]. Techniques that examine the contents of the collection are also required.
- Allowing Collaboration: We plan to allow distributed users to collaborate over the internet while navigating through the same information space. For example, colleagues can collaborate while performing literature survey on a particular topic. Our client-server architecture will facilitate such a collaborative environment.
- Usability Studies: Although the initial reactions to the system has been positive, more formal user studies are required to determine the effectiveness of the WTMS system for analyzing information about a particular topic. Such feedback will help us to improve the system.
We believe that as the WWW grows bigger, systems like the WTMS will become essential for retrieving useful information.
References
- B. Amento, W. Hill, L. Terveen, D. Hix, and P. Ju.
An Empirical Evaluation of User Interfaces for Topic Management of Web Sites.
In Proceedings of the ACM SIGCHI '99 Conference on Human Factors in Computing Systems, pages 552-559, Pittsburgh, Pa, May 1999. - K. Andrews.
Visualizing Cyberspace: Information Visualization in the Harmony Internet Browser.
In Proceedings of the 1995 Information Visualization Symposium, pages 97-104, Atlanta, GA, 1995. - I. Ben-Shaul, M. Hersovici, M. Jacovi, Y. Maarek, D. Pelleg, M. Shtalheim, V. Soroka, and S. Ur.
Adding Support for Dynamic and Focussed Search with Fetuccino.
In Proceedings of the Eight International World-Wide Web Conference, pages 575-588, Toronto, Canada, May 1999. - K. Bharat and A. Broder.
A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines .
Computer Networks and ISDN Systems. Special Issue on the Seventh International World-Wide Web Conference, Brisbane, Australia, 30(1-7), April 1998. - K. Bharat and M. Henzinger.
Improved Algorithms for Topic Distillation in a Hyperlinked Environment.
In Proceedings of the ACM SIGIR '98 Conference on Research and Development in Information Retrieval, pages 104-111, Melbourne, Australia, August 1998. - S. Card, G. Robertson, and W. York.
The WebBook and the Web Forager: An Information Workspace for the World-Wide Web.
In Proceedings of the ACM SIGCHI '96 Conference on Human Factors in Computing Systems, pages 112-117, Vancouver, Canada, April 1996. - J. Carriere and R. Kazman.
Searching and Visualizing the Web through Connectivity.
In Proceedings of the Sixth International World-Wide Web Conference, pages 701-711, Santa Clara, CA, April 1997. - S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan.
Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text.
Computer Networks and ISDN Systems. Special Issue on the Seventh International World-Wide Web Conference, Brisbane, Australia, 30(1-7), April 1998. - S. Chakrabarti, M. van den Berg, and B. Dom.
Focussed Crawling: a New Approach to Topic-specific Web Resource Discovery.
In Proceedings of the Eight International World-Wide Web Conference, pages 545-562, Toronto, Canada, May 1999. - T. Cormen, C. Leiserson, and R. Rivest.
Introduction to Algorithms.
The MIT Press, 1992. - R. Hendley, N. Drew, A. Wood, and R. Beale.
Narcissus: Visualizing Information.
In Proceedings of the 1995 Information Visualization Symposium, pages 90-96, Atlanta, GA, 1995. - M. Hersovici, M. Jacovi, Y. Maarek, D. Pelleg, M. Shtalheim, and S. Ur.
The Shark-Search Algorithm - an Application: Tailored Web Site Mapping.
Computer Networks and ISDN Systems. Special Issue on the Seventh International World-Wide Web Conference, Brisbane, Australia, 30(1-7), April 1998. - J. Kleinberg.
Authoritative Sources in a Hyperlinked Environment.
In Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, May 1998. - R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins.
Trawling the Web for Emerging Cyber-communities.
In Proceedings of the Eight International World-Wide Web Conference, pages 403-415, Toronto, Canada, May 1999. - Y. Maarek and I. Shaul.
WebCutter: A System for Dynamic and Tailorable Site Mapping.
In Proceedings of the Sixth International World-Wide Web Conference, pages 713-722, Santa Clara, CA, April 1997. - S. Mukherjea and J. Foley.
Visualizing the World-Wide Web with the Navigational View Builder.
Computer Networks and ISDN Systems. Special Issue on the Third International World-Wide Web Conference, Darmstadt, Germany, 27(6):1075-1087, April 1995. - S. Mukherjea and Y. Hara.
Focus+Context Views of World-Wide Web Nodes.
In Proceedings of the Eight ACM Conference on Hypertext, pages 187-196, Southampton, England, April 1997. - S. Mukherjea
Organizing Topic-Specific Web Information.
To appear in Proceedings of the Eleventh ACM Conference on Hypertext, San Antonio, TX, May 2000. - G. Salton and M. McGill.
Introduction to Modern Information Retrieval.
McGraw-Hill, 1983. - L. Terveen and H. Will.
Finding and Visualizing Inter-site Clan Graphs.
In Proceedings of the ACM SIGCHI '98 Conference on Human Factors in Computing Systems, pages 448-455, Los Angeles, Ca, April 1998.
Vitae
Sougata Mukherjea received his Bachelors in Computer Science & Engineering from Jadavpur University, Calcutta, India in 1988 and MS in Computer Science from Northeastern University, Boston, Ma in 1991. He then obtained his PhD in Computer Science from Georgia Institute of Technology, Atlanta, Ga in 1995. Till 1999 he was working at NEC's C&C Research Laboratories in San Jose, Ca. Presently he is working for Icarian, a start-up company in Sunnyvale, Ca. His main research interests are in Web Technologies like crawling and link analysis, Information Retrieval and Informtion Visualization. He can be reached by e-mail at sougata@icarian.com or sougata_m@yahoo.com. |