This vignette shows how to cluster
the vertices of a weighted graph using the random walk method developed
by Harel and Koren [1] with the Rwclust
package using an
example.
The random walk algorithm is based on the concepts of connected components and cut edges. Let G = (V, E) be a graph where V is the set of vertices and E is the set of edges. A connected component is a subset of vertices that are mutually reachable. A graph can have one connected component constituting the entire graph, or several connected components. If a graph has more than one connected component, the graph is said to be disconnected. In this context, the connected components correspond to clusters.
A set of cut edges is a set E′ ⊂ E such that G − E is disconnected. Essentially, E′ contains a set of edges whose removal results in the creation of separate clusters of vertices.
The random walk algorithm finds a set of cut edges but “sharpening” the difference between the weights of edges which connect vertices in within a cluster and the weights of edges that run between clusters. All edges with weights below a certain user-defined threshold are deleted and the resulting connected components become the clusters.
The high-level steps in the algorithm are:
We will use an example graph taken [1]. The data is contained in a
dataframe. We use igraph
to create a graph object a display
it along with the edge weights.
data(example1, package="Rwclust")
head(example1)
#> from to weight
#> 1 1 2 1
#> 2 1 3 1
#> 3 1 4 1
#> 4 2 3 1
#> 5 3 4 1
#> 6 2 5 1
labels <- c(1,1,1,1,2,2,2,2,3,3,3,3,3,3,3,3,3,4,4,4,4)
G <- igraph::graph_from_edgelist(as.matrix(example1[, c(1,2)]), directed=FALSE)
G <- igraph::set_edge_attr(G, "weight", value=example1$weight)
plot(G, edge.label=E(G)$weight, vertex.color=labels, layout=layout_with_fr)
Before edges are deleted and the connected components calculated, the user must select a cutoff. To do this we plot a histogram of the edge weights. Note that there appear to be several edges with very small edge weights. 25 seems to be an appropriate cutoff.
The next step is to remove the edges that are below the threshold and compute the connected components.
# delete edges with weights below the threshold
edges_to_keep <- which(weights(result) > 25)
example1_c <- example1[edges_to_keep, ]
example1_c$weight <- weights(result)[edges_to_keep]
G_c <- igraph::graph_from_edgelist(as.matrix(example1_c[, c(1,2)]), directed=FALSE)
# compute the connected components
clusters <- igraph::components(G_c)$membership
plot(G_c, vertex.color=clusters)
Harel, David, and Yehuda Koren. “On clustering using random walks.” International Conference on Foundations of Software Technology and Theoretical Computer Science. Springer, Berlin, Heidelberg, 2001.