Clustering approach and MCMC practicalities of stochastic block models

Clement Lee

2019-06-13 (Thu)

Background

Review paper on ArXiv: 1903.00114
Models for clustering graphs/networks
1. Type of graph
2. Inference approach
3. Clustering approach
4. Number of groups / model selection
5. Longitudinal modelling?
More recent experience

A Bigger Picture

Three main types of models for networks

Generative models
- Preferential attachment model (Barabasi and Albert, 1999)
- Small world model (Watts and Strogatz, 1998)
Exponential random graph models
Latent models
- Stochastic block models (SBMs)
- Latent feature models (Miller, Griffiths and Jordan, 2009; Morup, Schmidt and Hansen, 2011)
- Latent class analysis models (Ng and Murphy, 2018)
- Latent space models (Hoff, Raftery and Handcock, 2002; Handcock, Raftery and Tantrum, 2007)

Stochastic Block Models

Clustering nodes in topological space
Each node belongs to one group (for now)
Structural equivalence
- Connectivity of nodes in one group with any group is similar
- Not necessarily high (within the same group) or low (with another group)
Different from community detection

Notation

Undirected graph: \(~\mathcal{G}\)
Number of nodes: \(~n\)
Adjacency matrix: \(~\boldsymbol{Y}:=\{\boldsymbol{Y}_{pq}\}_{n\times{}n}\)
Number of groups: \(~K\)
Membership of node \(p\), if it belongs to group \(i\):
- \(\boldsymbol{Z}_p=(\underbrace{0~\cdots~0}_{(i-1)}~1~\underbrace{0~\cdots~0}_{(K-i)})^T\)
- Instead of \(\boldsymbol{Z}_p=i\)
All memberships: \(\boldsymbol{Z}:=(\boldsymbol{Z}_1~\boldsymbol{Z}_2\cdots~\boldsymbol{Z}_n)^T\)
- \(n\times{}K\) matrix
- \(\boldsymbol{Z}_{pi}=1\)
Block matrix: \(~\boldsymbol{C}:=\{\boldsymbol{C}_{ij}\}_{K\times{}K}\)

Structural Equivalence

The edge probability of two nodes depends on their groups only, given the groups
For nodes \(p\) and \(q\), \(\Pr(\boldsymbol{Y}_{pq}=1|~p\text{ in group }i,q\text{ in group }j)=\boldsymbol{C}_{ij}\)
Equivalently, \(\Pr(\boldsymbol{Y}_{pq}=1|~\boldsymbol{Z}_p,\boldsymbol{Z}_q)=\boldsymbol{Z}_p^T\boldsymbol{C}\boldsymbol{Z}_q\)
Likelihood: \[f(\boldsymbol{Y}|\boldsymbol{Z},\boldsymbol{C})=\prod_{p<q}\left[\left(\boldsymbol{Z}_p^T\boldsymbol{C}\boldsymbol{Z}_q\right)^{\boldsymbol{Y}_{pq}}\left(1-\boldsymbol{Z}_p^T\boldsymbol{C}\boldsymbol{Z}_q\right)^{(1-\boldsymbol{Y}_{pq})}\right]\]

Bayesian Inference

Prior of \(\boldsymbol{Z}_p\): multinomial\((\boldsymbol{\theta})\)
- Independence between nodes
- \(\boldsymbol{\theta}\) also a \(K\)-vector
Prior of \(\boldsymbol{\theta}\): Dirichlet\((\alpha)\)
Prior of \(\boldsymbol{C}\): \(\boldsymbol{C}_{ij}\overset{\text{i.i.d.}}{\sim}\text{Beta}(\boldsymbol{A}_{ij},\boldsymbol{B}_{ij})\)
- \(\boldsymbol{A}\) & \(\boldsymbol{B}\) also \(K\times{}K\) matrices
Prior of \(\alpha\): Gamma\((a,b)\)
Posterior: \[ \pi(\boldsymbol{Z},\boldsymbol{\theta},\boldsymbol{C},\alpha|\boldsymbol{Y})\propto\pi(\boldsymbol{Y},\boldsymbol{Z},\boldsymbol{\theta},\boldsymbol{C},\alpha)\\ = f(\boldsymbol{Y}|\boldsymbol{Z},\boldsymbol{C})\times\pi(\boldsymbol{Z}|\boldsymbol{\theta})\times\pi(\boldsymbol{\theta}|\alpha)\times\pi(\boldsymbol{C})\times\pi(\alpha) \]

MCMC Algorithms

Algorithmic simplicity: regular Gibbs sampler
- Parameter (except \(\alpha\)) & latent variables updated via individual Gibbs steps
Integrating \(\boldsymbol{C}\) out: collapsed Gibbs sampler
- Similar to Griffiths and Steyvers (2004) for topic modelling
Computational efficiency:
- Morup, Schmidt and Hansen (2011)
- Li, Ahn and Welling (2016)

Comparing Models

Hard clustering
- \(\boldsymbol{Z}_p=(0~\cdots~0~1~0~\cdots~0)^T\)
- Only one \(1\), all others zeros
Latent feature model
- \(\boldsymbol{Z}_p=(1~\cdots~0~1~1~\cdots~0)^T\)
- Can have multiple \(1\)’s
Soft clustering
- \(\boldsymbol{Z}_p=(0.2~\cdots~0.0~~0.5~~0.3~\cdots~0.0)^T\)
- Non-negative weights sum to 1

Mixed Membership SBM

Airoldi, Blei, Fienberg and Xing (2008)

Node \(p\) can belong to different groups when interacting with different nodes
- Potentially two different groups as far as \(\boldsymbol{Y}_{pq}\) and \(\boldsymbol{Y}_{pr}\) are concerned \((r\neq{}q)\)
\(\boldsymbol{Z}_{pq}\) corresponds to each possible dyad \((p,q)\) instead of each node
\(\boldsymbol{Z}\) now an \(n\times{}n\times{}K\) array of latent variables
\(\boldsymbol{\theta}_p\): group weights for node \(p\), \(K\)-vector
- i.i.d., Dirichlet\((\alpha)\) distribution
\(\boldsymbol{\Theta} := (\boldsymbol{\theta}_1~\boldsymbol{\theta}_2~\cdots~\boldsymbol{\theta}_n)^T\)
- \(n\times{}K\) matrix

Likelihood & Inference

Edge probability: \(\Pr(\boldsymbol{Y}_{pq}=1|\boldsymbol{Z}_{pq},\boldsymbol{Z}_{qp})=\boldsymbol{Z}_{pq}^T\boldsymbol{C}\boldsymbol{Z}_{qp}\)
Posterior: \[ \pi(\boldsymbol{Z},\boldsymbol{\Theta},\boldsymbol{C},\alpha|\boldsymbol{Y})\propto\pi(\boldsymbol{Y},\boldsymbol{Z},\boldsymbol{\Theta},\boldsymbol{C},\alpha)\\ = f(\boldsymbol{Y}|\boldsymbol{Z},\boldsymbol{C})\times\pi(\boldsymbol{Z}|\boldsymbol{\Theta})\times\pi(\boldsymbol{\Theta}|\alpha)\times\pi(\boldsymbol{C})\times\pi(\alpha) \]
Very similar to hard clustering version
Why not commonly used?

Comparing with Clustering Non-relational Data

Latent Dirichlet Allocation (for example)

\(O(n)\)
Collapsed Gibbs sampler
- Parameters can be integrated out

Mixed membership SBM

\(O(n^2)\) because of the dimensions of \(\boldsymbol{Z} (n\times{}n\times{}K)\)
Collapsed Gibbs sampler?
- \(\boldsymbol{C}\) and \(\boldsymbol{D}\) can be integrated out
- Statistically simpler, computationally (much) less efficient

Going Back to the Basics?

Hard: \(\boldsymbol{Z}_p\) is the membership
Soft: \(\boldsymbol{\theta}_p\) is the mixed membership
- \(\boldsymbol{Z}\) could be integrated out
- Inefficient without data augmentation

Some Modifications

To the model

Assortative mixed membership SBM: Gopalan et al. (2012)
- Connectivity high within and low between groups
- Number of parameters in \(\boldsymbol{C}\) reduced from \(K^2\) to \(K+1\)
For directed acyclic graph: Lee and Wilkinson (2018)
- An extra parameter as the topological order of the nodes
- Number of latent variables halved compared to the original version for directed graphs

To the MCMC algorithm

Stochastic Gradient MCMC: Li, Ahn and Welling (2016)
- Sub-sampling a graph

3. Back to Hard Clustering

Poisson Approximation

Previously (Bernoulli SBM)

\(\boldsymbol{Y}_{pq}|\boldsymbol{Z},\boldsymbol{C}\sim\text{Bernoulli}(\boldsymbol{Z}_p^T\boldsymbol{C}\boldsymbol{Z}_q)\)
\(\boldsymbol{Z}_p\) is the membership and cannot be integrated out

Karrer and Newman(2011); Peixoto (2018)

\(\boldsymbol{Y}_{pq}|\boldsymbol{Z},\boldsymbol{C}\sim\text{Poisson}(\boldsymbol{Z}_p^T\boldsymbol{C}\boldsymbol{Z}_q)\)
Support of \(\boldsymbol{C}_{ij}\) no longer \([0,1]\)
\(\boldsymbol{C}_{ij}|\boldsymbol{Z},\mu\sim\text{Exp}\left(\frac{\boldsymbol{N}_i\boldsymbol{N}_j}{\mu}\right)\)
- \(\boldsymbol{N}_i\): the size of group \(i\)
- \(\mu\): extra scalar parameter

Comparing the Priors

Bernoulli SBM

\(\pi(\boldsymbol{Z},\boldsymbol{C},\boldsymbol{\theta},\alpha)=\pi(\boldsymbol{Z}|\boldsymbol{\theta})~\pi(\boldsymbol{\theta}|\alpha)~\pi(\boldsymbol{C})~\pi(\alpha)\)
Block matrix \(\boldsymbol{C}\) independent of latent variables \(\boldsymbol{Z}\)

Poisson SBM

\(\pi(\boldsymbol{Z},\boldsymbol{C},\mu)=\pi(\boldsymbol{C}|\boldsymbol{Z},\mu)~\pi(\boldsymbol{Z})~\pi(\mu)\)
Block matrix \(\boldsymbol{C}\) depends on latent variables \(\boldsymbol{Z}\) (and \(\mu\))

Integrating out \(\boldsymbol{C}\)

Likelihood: \[ f(\boldsymbol{Y}|\boldsymbol{Z},\mu)=\int{}f(\boldsymbol{Y}|\boldsymbol{C},\boldsymbol{Z},\mu)\pi(\boldsymbol{C}|\boldsymbol{Z},\mu)d\boldsymbol{C}\\=\int\left[\prod_{p,q}(\boldsymbol{Z}_p^T\boldsymbol{C}\boldsymbol{Z}_q)^{\boldsymbol{Y}_{pq}}\exp(-\boldsymbol{Z}_p^T\boldsymbol{C}\boldsymbol{Z}_q)(\boldsymbol{Y}_{pq}!)^{-1}\times\prod_{i,j}\frac{\boldsymbol{N}_i\boldsymbol{N}_j}{\mu}\exp\left(-\frac{\boldsymbol{N}_i\boldsymbol{N}_j}{\mu}\boldsymbol{C}_{ij}\right)\right]d\boldsymbol{C}\\=\cdots=\frac{\prod_{i,j}\boldsymbol{E}_{ij}!}{\prod_{p,q}\boldsymbol{Y}_{pq}!\prod_{i}\boldsymbol{N}_i^{\boldsymbol{E}_{i\cdot}+\boldsymbol{E}_{\cdot{}i}}}\times\frac{\mu^m}{(1+\mu)^{m+K^2}}\]
Edge matrix: \(\boldsymbol{E}:=\{\boldsymbol{E}_{ij}\}_{K\times{}K}\)
Total number of edges: \(m\)
\(\boldsymbol{Z}\) influences the likelihood through \(\boldsymbol{N}\) and \(\boldsymbol{E}\)

Inference

Further integrating \(\mu\) out: \[f(\boldsymbol{Y}|\boldsymbol{Z})=\int{}f(\boldsymbol{Y}|\boldsymbol{Z},\mu)\pi(\mu)d\mu\\=\frac{\prod_{i,j}\boldsymbol{E}_{ij}!}{\prod_{p,q}\boldsymbol{Y}_{pq}!\prod_{i}\boldsymbol{N}_i^{\boldsymbol{E}_{i\cdot}+\boldsymbol{E}_{\cdot{}i}}}\times\underbrace{\int\frac{\mu^m}{(1+\mu)^{m+K^2}}\pi(\mu)d\mu}_{\text{1-dim integration}}\]
Posterior: \[ \pi(\boldsymbol{Z}|\boldsymbol{Y})\propto\underbrace{\pi(\boldsymbol{Y},\boldsymbol{Z})=f(\boldsymbol{Y}|\boldsymbol{Z})\pi(\boldsymbol{Z})}_{\text{Integrated Complete Likelihood (ICL)}} \]

Prior of \(\boldsymbol{Z}\)

Once we specify \(\boldsymbol{Z}\)’s prior, we are able to not only do inference but also calculate ICL
If \(\boldsymbol{Z}\) is known, so are \(\boldsymbol{N}\) and \(\boldsymbol{E}\)
Instead of directly assigning a prior, Peixoto (2018) put an SBM on \(\boldsymbol{E}\)
- Treating the edge matrix as the adjacency matrix one level up
- Nested/hierarchical SBM

Summary of SBMs

	Bernoulli	Mixed membership	Poisson
Clustering	Hard	Soft	Hard
Quantity of interest	\(\boldsymbol{Z}\)	\(\boldsymbol{\Theta}\)	\(\boldsymbol{Z}\)
Marginalisation?	\(\boldsymbol{C}\)	\(\boldsymbol{C}\) & \(\boldsymbol{D}\), or \(\boldsymbol{Z}\)	\(\boldsymbol{C}\) & \(\mu\)
Remarks		Neither marginalisation particularly useful	Exponential priors for \(\boldsymbol{C}\) with dep. on \(\boldsymbol{Z}\)
		Quadratic computational cost	Can extend to nested version

Component-wise Moves

Metropolis

Component-wise Moves

Gibbs

Bigger Moves

Gibbs + M3 (McDaid, Murphy, Friel and Hurley, 2013)

M3 & Label Switching

Propose to randomise nodes in two groups

All nodes are placed in a randomly reordered list
Each node is placed in one group according to some assignment probability
Nobile and Fearnside (2007): Choose the ratio of the assignment probabilities as the ratio of the two posterior probabilities resulting from the assignments of the previous nodes
- Heuristic, should lead to “good” choices
The same list traversed to calculate reverse proposal probability
Check agreement between current and proposed \(\boldsymbol{Z}\); switch labels if agreement < 50%

Informed Proposals

Zanella (2019)

Incorporate local information on discrete spaces
Similar to gradient-based methods for continuous spaces
Not directly useful - struggle to move across well-separated modes

Multiple Local Modes

Metropolis-coupled MCMC [(MC)\(^3\)] / Parallel Tempering

Run \(d\) chains in parallel, the \(k\)-th chain sampling from \(\pi(\boldsymbol{Z}|\boldsymbol{Y})^{\beta_k}\)
\(1=\beta_1\geq\beta_2\geq\cdots\geq\beta_d>0\)
\(\pi(\boldsymbol{Z}|\boldsymbol{Y})^{\beta}\) is “flatter” with smaller \(\beta\), making it easier to move to other modes
Propose to swap the chains after updating \(\boldsymbol{Z}\) in each iteration
The component-wise Gibbs moves no longer always have acceptance probability 1
See, e.g., Geyer(1991), Atchade, Roberts and Rosenthal (2011)

Model Selection / K

Modelled

Dirichlet process (Morup, Schmidt and Hansen, 2011)
Indian buffet process for latent feature models (Miller, Griffiths and Jordan, 2009)

Estimated

Allow merging, adding or deleting groups in MCMC
McDaid, Murphy, Friel and Hurley (2013); Peixoto (2018)

Criterion

BIC: Airoldi, Blei, Fienberg and Xing (2008)
Perplexity: DuBois, Butts and Smyth (2013)
ICL: Come and Latouche (2015); Matias and Miele (2017)

Model Selection / K

Criterion: marginal likelihood

\[ f(\boldsymbol{Y})=\frac{f(\boldsymbol{Y}|\boldsymbol{Z})\pi(\boldsymbol{Z})}{\pi(\boldsymbol{Z}|\boldsymbol{Y})}\] \[ \log{}f(\boldsymbol{Y})=\underbrace{\log\left[f(\boldsymbol{Y}|\boldsymbol{Z})\pi(\boldsymbol{Z})\right]}_{\text{log-ICL}} - \log\pi(\boldsymbol{Z}|\boldsymbol{Y}) \]

\(\log\pi(\boldsymbol{Z}|\boldsymbol{Y})\) can be estimated (e.g. Ritter and Tanner, 1992)
Can be computed offline
Less unscalable than methods by Chib (1995), Chib and Jeliazkov (2001)
Power posterior (Friel and Pettitt, 2008)?

Clustering approach and MCMC practicalities of stochastic block models

1. Introduction

Background

A Bigger Picture

Three main types of models for networks

Stochastic Block Models

Notation

Structural Equivalence

Bayesian Inference

MCMC Algorithms

2. Soft Clustering

Comparing Models

Mixed Membership SBM

Airoldi, Blei, Fienberg and Xing (2008)

Likelihood & Inference

Comparing with Clustering Non-relational Data

Latent Dirichlet Allocation (for example)

Mixed membership SBM

Going Back to the Basics?

Some Modifications

To the model

To the MCMC algorithm

3. Back to Hard Clustering

Poisson Approximation

Previously (Bernoulli SBM)

Karrer and Newman(2011); Peixoto (2018)

Comparing the Priors

Bernoulli SBM

Poisson SBM

Integrating out \(\boldsymbol{C}\)

Inference

Prior of \(\boldsymbol{Z}\)

Summary of SBMs

4. Practicalities

Component-wise Moves

Metropolis

Component-wise Moves

Gibbs

Bigger Moves

Gibbs + M3 (McDaid, Murphy, Friel and Hurley, 2013)

M3 & Label Switching

Propose to randomise nodes in two groups

Informed Proposals

Zanella (2019)

Multiple Local Modes

Metropolis-coupled MCMC [(MC)\(^3\)] / Parallel Tempering

Model Selection / K

Modelled

Estimated

Criterion

Model Selection / K

Criterion: marginal likelihood

Summary

Next steps

Questions? Comments?