Change Password

Please enter the password.
Please enter the password. Between 8-64 characters. Not identical to your email address. Contain at least 3 of uppercase, lowercase, numbers, and special characters (such as @*&#).
Please enter the password.
Submit

Change Nickname

Current Nickname:
Submit

v4.2
Search
中文EN
v4.2

    HITS

    Overview

    HITS (Hyperlink-Induced Topic Search), like PageRank, is an iterative algorithm based on the link structure to rank web pages in a hyperlinked network, such as the World Wide Web (WWW). HITS makes use of the mutual reinforcing relationship between authorities and hubs to evaluate the web pages, it was developed by L.M. Kleinberg in 1999.

    Related material of the algorithm:

    Basic Concept

    Authority and Hub

    Hyperlink represents a latent human judgment: the creator of page p, by including a link to page q, has in some measure conferred authority on q. Hence, it is precise to define the pages with large in-degree as authorities.

    In the search engine, for a query like 'graph database', relevant authorities can be found by a text-based search, i.e., to search pages that contain the query string, then rank the search results according to the number of its backlinks.

    However, text-based relevant authorities are not always really relevant. There are two observations:

    1. A large number of links are created primarily for navigational purposes (e.g., 'Return to home page'), others represent paid advertisements.
    2. Some authoritative pages for the query topic may not contain the query string (e.g., google.com does not contain query string 'search engine').

    The idea of HITS algorithm is, some relevant authorities, although not be in the results of text-based search, it is quite likely to be pointed to by at least one page in the set of text-relevant authorities. The pages that have links to multiple relevant authoritative pages are referred to as hubs. As illustrated in the graph below, red nodes are good authorities, green nodes are good hubs.

    To review, authoritative pages relevant to the query should not only have large in-degree, there should also be considerable hub pages pointing to them.

    Focused Subgraph

    In the search engine, for a broad-topic query particularly, there are large volumes of text-based relevant pages that could be returned, which will entail considerable computational cost. Thus, HITS algorithm only collects the top t (typically set to about 200) authoritative pages into a root set.

    Next, expand the root set to include (a) the pages that pages in root set point to, and (b) the pages pointing to pages in root set. Since there could be quite a lot pages of the latter kind, the algorithm supports restriction on the number of pages of the latter kind to be allowed to add. The grown root set is called as base set. Base set is kept relatively small, but rich in relevant authorities.

    In the induced subgraph of pages in base set, there are two types of links: Link between pages with different domain names, and link between pages with the same domain name. Since the latter links very often exist purely to navigate inside a site, they convey much less information than the former links about the authority of the pages they point to. Thus, delete all the latter links from the subgraph; this results in the focused subgraph.

    Authority and Hub Weights

    In search engine application, the algorithm is applied on the focused subgraph. In general applications, the algorithm is applied on the whole graph.

    HITS algorithm associates each page (node) with a non-negative authority weight x and a non-negative hub weight y. Pages with larger x and y weights are viewed as better authorities and hubs respectively.

    In a directed graph G = (V, E), the x and y weights for all nodes are set to 1 initially. Then the algorithm iterates, in each round: for each node p ∈ V, (1) update its x and y weights as follows:

    Here is an example:

    (2) Normalize both weights to meet the invariant below:

    Iterate until the convergence below a tolerance is met, or the number of iterations is reached. In the experiments of the original author, it was found that the convergence is quite rapid, 20 iterations is normally sufficient.

    Special Case

    Isolated Node, Disconnected Graph

    Authority weight and hub Weight of isolated node are both 0.

    Multiple authorities are supposed to be found in the same connected component.

    Self-loop Edge

    In HITS algorithm, self-loop edges are ignored.

    Directed Edge

    The direction of edge is essential in the analysis of the link structure in HITS algorithm. Authority weights of nodes with no inbound edge are 0, hub weights of nodes with no outbound edge are 0.

    Command and Configuration

    • Command: algo(hits_centrality)
    • Configurations for the parameter params():
    Name
    Type
    Default
    Specification
    Description
    max_loop_num int 20 >=1 Maximum rounds of iterations
    tolerance float 0.001 (0,1) Tolerance of the convergence; the algorithm ends when the change of authority weights of all nodes and the change of hub weights of all nodes are both smaller than the value of tolerance
    limit int -1 >=-1 Number of results to return; return all results if sets to -1

    Example

    Task Writeback

    1. File Writeback

    Configuration Data in Each Row
    filename _id,authority,hub

    Example: Run the HITS algorithm, maximum iteration round is 20, tolerance is 0.001, write the algorithm results back file

    algo(hits_centrality).params({}).write({
      file: {
        filename: "rank"
      }
    })
    

    2. Property Writeback

    Configuration Writeback Content Type Data Type
    authority authority Node property float
    hub hub Node property float

    Example: Run the HITS algorithm, maximum iteration round is 20, tolerance is 0.0001, write the algorithm results back to node properties

    algo(hits_centrality).params({
      max_loop_num: 20,
      tolerance: 0.0001
    }).write({
      db: {
        authority: "aa",
        hub: "bb"
      }
    })
    

    3. Statistics Writeback

    This algorithm has no statistics.

    Direct Return

    Alias Ordinal
    Type
    Description Column Name
    0 []perNode Node and its authority and hub weight _uuid, authority, hub

    Example: Run the HITS algorithm, maximum iteration round is 20, tolerance is 0.0001, return the results with the descending authority weights

    algo(hits_centrality).params({
      max_loop_num: 20,
      tolerance: 0.0001
    }) as rank
    return rank order by rank.authority desc
    

    Streaming Return

    Alias Ordinal
    Type
    Description Column Name
    0 []perNode Node and its authority and hub weight _uuid, authority, hub

    Example: Run the HITS algorithm, maximum iteration round is 20, tolerance is 0.0001, return ID and hub weight of nodes

    algo(hits_centrality).params({
      max_loop_num: 20,
      tolerance: 0.0001
    }).stream() as rank
    find().nodes({_uuid == rank._uuid}) as nodes
    return table(nodes._id, rank.hub)
    

    Real-time Statistics

    This algorithm has no statistics.

    Please complete the following information to download this book
    *
    公司名称不能为空
    *
    公司邮箱必须填写
    *
    你的名字必须填写
    *
    你的电话必须填写
    *
    你的电话必须填写