Research achievements and plans

Clement Lee

2019-02-01 (Tue)

Overview

  • Postdoc researcher at Newcastle University
  • Open Lab, School of Computing

Research focus

  • Applied Statistics
  • Bayesian Modelling
  • Data Visualisation
  • Network Analysis

Outline

  • PhD + 1st Postdoc
  • Current: Statistical Modelling
  • Current: Data Visualisation
  • Summary & Plans

PhD + 1st Postdoc

House Price Index

  • Price Paid Data from Land Registry

  • ~0.6 million transactions
    • Coordinates of the house
    • Time of transaction
    • No other covariates available

House Price Index

Repeat Sales model in spatial econometrics

  • To infer the overall trend of house prices
  • Attributes of specific houses not required
  • Granularity: one index for whole county
  • No time series model considered

House Price Index

A spatio-temporal model

  • Temporally
    • An autoregressive model for the index at one location
  • Spatially
    • A Gaussian process for the indices at each time point
  • Computational aspects
    • Price index at each location & time point
    • Gibbs sampler for all latent variables (and parameters)
    • Allows predictions of future indices

The independence sampler*

  • For updating augmented data in MCMC algorithms

  • Models with a large number of latent variables

  • Update 1, \(k\), or all of them in each iteration?

  • Optimal when (\(k~\times\) acceptance rate) is maximised

* C. Lee and P. Neal (2018). Optimal Scaling of the Independence Sampler: Theory and Practice, Bernoulli 24 (3), 1636-1652.

The independence sampler

Optimal \(k\) is such that acceptance rate \(\approx\) 23.4%

Current: Statistical Modelling

Network epidemic

App Movement

  • Start a movement, and share on social networks
  • Potential users support and share
  • App generated automatically by using the templates
  • Launch the app

Network epidemic

Sharing of movement \(\approx\) spreading an “epidemic”

Network epidemic*

  • Existing works based on Bernoulli random graph
    • Constant & independent probability of connecting between any two individuals
  • Generate a network by preferential attachment rules

  • Spread the epidemic by a Susceptible-Infected model

* C. Lee, A. Garbett and D. J. Wilkinson (2018). A network epidemic model for online commissioning data, Statistics and Computing 28 (4), 891-904.

Network epidemic

Computational issues

  • Connections between two individuals only observed if there were infections

  • Inferring the missing connections, treated as latent variables

  • Simulation study reveals identifiability issues

Retweets*

  • To understand temporal behaviour on Twitter

  • To model short-term growth of retweets

  • Data: ~4 hours of tweets & retweets with #gots7
    on 2017-07-16, Game of Thrones Season 7 premiere

* C. Lee and D. J. Wilkinson (2018). A hierarchical model of non-homogeneous Poisson processes for Twitter retweets, ArXiv e-prints.

Retweets

Retweet growth over time

Retweets

Retweet count vs Follower count

Retweets

A hierarchical Bayesian model

  • The retweets of \(i\)th original tweet is fit by a non-homogeneous Poisson process

  • The process intensity \(h_i(t)\) is a product of:
    • \(t^{-\lambda}\)exp\((-\theta t)\)
    • exp(linear predictor of follower count\({}_i\))
  • Estimation: latent variables & universal parameters

Retweets

Actual data vs Simulated data

Network of social network analysis papers

  • To review > 100 papers on social network analysis (SNA)

  • To cluster them according to how they cite each other

  • Allowing papers to belong to multiple groups as a lot of them are interdisciplinary

Network of SNA papers

Network of SNA papers*

  • Stochastic block model for clustering nodes in a network

  • Mixed membership version - soft clustering

  • A citation network is a directed acyclic graph (DAG)

  • The data & the application equally important

* C. Lee and D. J. Wilkinson (2018). A social network analysis of articles on social network analysis, ArXiv e-prints.

Network of SNA papers

Network of SNA papers

Current: Data Visualisation

Home detection*

  • To understand large-scale human behaviour from mobile phone data

  • Calls of ~18 million Orange users in 2007 in France
    • Who called whom
    • Cell towers involved
    • Time of call

* M. Vanhoof, C. Lee and Z. Smoreda (To appear). Performance and sensitivities of home detection on mobile phone data, Big Data Meets Survey Science, Monograph of BigSurv18 Conference.

Home detection

Assigning home location by calling patterns

Home detection

Validating with census data

Home detection

Monthly variation

Network of computer science conference papers

  • The digital library of the Association for Computing Machinery (ACM): http://dl.acm.org

  • CHI Conference on Human Factors in Computing Systems
    • 6239 papers, 1981-2018
    • Metadata such as abstract & publication year
    • References/citations
  • What are the emerging/hot topics?

Network of CHI papers

Network of CHI papers

Network of CHI papers

Network of CHI papers

Summary & Plans

Research focus

  • Applied Statistics
  • Bayesian Modelling
  • Data Visualisation
  • Network Analysis
Data Visualisation/Empirical Modelling
Temporal Retweets
Spatial + Temporal Home detection House price index
Network Network of SNA papers
Network + Temporal Network of CHI papers Network epidemic
Future work —>

Future work

  • Extend the current model for citation network by incorporating:
    • Topic/text modelling
    • Dynamic modelling
  • The same ultimate goal: model-based clustering

  • Apply to the network of \((>6000)\) CHI papers

Challenges

  • Modelling
    • Evolution of groups
    • Creation of new topics
    • Optimal number of topics
  • Computational
    • Complexity (At least) \(n^2\)
    • Mixed membership version much more expensive

Thank you!