Overview
HITS (Hyperlink-Induced Topic Search), like PageRank, is an iterative algorithm based on the link structure to rank web pages in a hyperlinked network, such as the World Wide Web (WWW). HITS makes use of the mutual reinforcing relationship between authorities and hubs to evaluate the web pages, it was developed by L.M. Kleinberg in 1999.
Related material of the algorithm:
- L.M. Kleinberg, Authoritative Sources in a Hyperlinked Environment (1999)
Basic Concept
Authority and Hub
Hyperlink represents a latent human judgment: the creator of page p, by including a link to page q, has in some measure conferred authority on q. Hence, it is precise to define the pages with large in-degree as authorities.
In the search engine, for a query like 'graph database', relevant authorities can be found by a text-based search, i.e., to search pages that contain the query string, then rank the search results according to the number of its backlinks.
However, text-based relevant authorities are not always really relevant. There are two observations:
- A large number of links are created primarily for navigational purposes (e.g., 'Return to home page'), others represent paid advertisements.
- Some authoritative pages for the query topic may not contain the query string (e.g., google.com does not contain query string 'search engine').
The idea of HITS algorithm is, some relevant authorities, although not be in the results of text-based search, it is quite likely to be pointed to by at least one page in the set of text-relevant authorities. The pages that have links to multiple relevant authoritative pages are referred to as hubs. As illustrated in the graph below, red nodes are good authorities, green nodes are good hubs.

To review, authoritative pages relevant to the query should not only have large in-degree, there should also be considerable hub pages pointing to them.
Focused Subgraph
In the search engine, for a broad-topic query particularly, there are large volumes of text-based relevant pages that could be returned, which will entail considerable computational cost. Thus, HITS algorithm only collects the top t (typically set to about 200) authoritative pages into a root set.
Next, expand the root set to include (a) the pages that pages in root set point to, and (b) the pages pointing to pages in root set. Since there could be quite a lot pages of the latter kind, the algorithm supports restriction on the number of pages of the latter kind to be allowed to add. The grown root set is called as base set. Base set is kept relatively small, but rich in relevant authorities.

In the induced subgraph of pages in base set, there are two types of links: Link between pages with different domain names, and link between pages with the same domain name. Since the latter links very often exist purely to navigate inside a site, they convey much less information than the former links about the authority of the pages they point to. Thus, delete all the latter links from the subgraph; this results in the focused subgraph.
Authority and Hub Weights
In search engine application, the algorithm is applied on the focused subgraph. In general applications, the algorithm is applied on the whole graph.
HITS algorithm associates each page (node) with a non-negative authority weight x and a non-negative hub weight y. Pages with larger x and y weights are viewed as better authorities and hubs respectively.
In a directed graph G = (V, E), the x and y weights for all nodes are set to 1 initially. Then the algorithm iterates, in each round: for each node p ∈ V, (1) update its x and y weights as follows:


(2) Normalize both weights to meet the invariant below:

Iterate until the convergence below a tolerance is met, or the number of iterations is reached. In the experiments of the original author, it was found that the convergence is quite rapid, 20 iterations is normally sufficient.
Special Case
Isolated Node, Disconnected Graph
Authority weight and hub Weight of isolated node are both 0.
Multiple authorities are supposed to be found in the same connected component.
Self-loop Edge
In HITS algorithm, self-loop edges are ignored.
Directed Edge
The direction of edge is essential in the analysis of the link structure in HITS algorithm. Authority weights of nodes with no inbound edge are 0, hub weights of nodes with no outbound edge are 0.
Command and Configuration
- Command:
algo(hits_centrality)
- Configurations for the parameter
params()
:
Name |
Type |
Default |
Specification |
Description |
---|---|---|---|---|
max_loop_num | int | 20 | >=1 | Maximum rounds of iterations |
tolerance | float | 0.001 | (0,1) | Tolerance of the convergence; the algorithm ends when the change of authority weights of all nodes and the change of hub weights of all nodes are both smaller than the value of tolerance |
limit | int | -1 | >=-1 | Number of results to return; return all results if sets to -1 |
Example
Task Writeback
1. File Writeback
Configuration | Data in Each Row |
---|---|
filename | _id ,authority ,hub |
Example: Run the HITS algorithm, maximum iteration round is 20, tolerance is 0.001, write the algorithm results back file
algo(hits_centrality).params({}).write({
file: {
filename: "rank"
}
})
2. Property Writeback
Configuration | Writeback Content | Type | Data Type |
---|---|---|---|
authority | authority |
Node property | float |
hub | hub |
Node property | float |
Example: Run the HITS algorithm, maximum iteration round is 20, tolerance is 0.0001, write the algorithm results back to node properties
algo(hits_centrality).params({
max_loop_num: 20,
tolerance: 0.0001
}).write({
db: {
authority: "aa",
hub: "bb"
}
})
3. Statistics Writeback
This algorithm has no statistics.
Direct Return
Alias Ordinal |
Type |
Description | Column Name |
---|---|---|---|
0 | []perNode | Node and its authority and hub weight | _uuid , authority , hub |
Example: Run the HITS algorithm, maximum iteration round is 20, tolerance is 0.0001, return the results with the descending authority weights
algo(hits_centrality).params({
max_loop_num: 20,
tolerance: 0.0001
}) as rank
return rank order by rank.authority desc
Streaming Return
Alias Ordinal |
Type |
Description | Column Name |
---|---|---|---|
0 | []perNode | Node and its authority and hub weight | _uuid , authority , hub |
Example: Run the HITS algorithm, maximum iteration round is 20, tolerance is 0.0001, return ID and hub weight of nodes
algo(hits_centrality).params({
max_loop_num: 20,
tolerance: 0.0001
}).stream() as rank
find().nodes({_uuid == rank._uuid}) as nodes
return table(nodes._id, rank.hub)
Real-time Statistics
This algorithm has no statistics.