Data Visualisations

Clement Lee
CSC8111, Newcastle University

2018-12-12 (Wed)

Flow of Statistical Analysis

  1. Data scraping/compilation

  2. Exploratory visualisation

  3. Modelling & analysis

  4. Repeat 2 & 3 (if necessary)

  5. Result visualisation

Outline

  1. Temporal data

  2. Spatial data

  3. Network data

  4. Network + temporal data

  5. Principles & tips

Temporal data

Background & Data

  • To understand temporal behaviour on Twitter

  • Game of Thrones Season 7 premiere on 2017-07-16

  • Data: ~4 hours of tweets & retweets with #gots7
    before the premiere

Exploratory visualisation

Count of tweets of each minute

Exploratory visualisation

Retweet growth over absolute time

Exploratory visualisation

Retweet growth over relative time

Exploratory visualisation

Retweet count vs Follower count

Questions

  • Can we describe retweet growth with an elegant statistical model?

  • Does follower count play a direct role?
    If so, can we incoporate its effect?

  • How do we justify our model? That is, does it predict well?

Result visualisation

Actual data vs Simulated data

Spatial data

Background & Data

  • Joint work with Maarten Vanhoof

  • To understand large-scale human behaviour from mobile phone data

  • Data: ~18 million Orange users in 2007 in France

  • All calls over 5 months
    • Who called whom
    • Cell towers involved
    • Time of call

Background & Data

Home detection

  • Assign home location by calling patterns

  • Example: Cell tower from which calls were made for the highest number of days

  • Aggregate users by cell tower

Exploratory visualisation

Questions

  • How good are our home detection algorithms?

  • Can we validate them with some kind of ground truth e.g. census data?

  • Is it possible to draw insights into mobility behaviour that census data cannot?

Result visualisation

Result visualisation

Network data

Background & Data

  • To review > 100 papers on social network analysis

  • To cluster them according to their topics/themes

  • Data: Citation network of these papers

Exploratory visualisation

Network plot

Questions

  • Topic modelling: word frequencies are analysed to identify topics/themes

  • Can we cluster them according to how they cite each other instead?

  • As a lot of work are disciplinary, can papers belong to multiple groups i.e. be soft clustered?

Result visualisation

Result visualisation

Network + temporal data

Background & Data

  • To enrich the analyses on citation networks

  • Data: ACM CHI Conference on Human Factors in Computing Systems
    • 6239 papers, 1981-2018
    • Metadata such as abstract & publication year
    • References/citations!

Exploratory visualisation

Questions

  • Can we cluster the papers, taking publication year into account?

  • What are the emerging/hot topics?

  • Can new topics be discovered?

Result visualisation

Result visualisation

Result visualisation

Principle & tips

The Grammar of Graphics*

  • Add one layer on top of the other
    • Type: shape for point, linetype for line
    • Size
    • Colour (type or fill)
    • Opacity (alpha)
  • Other aspects
    • Axes & legends
    • Facetting
    • Coordinate systems

* Wilkinson, Leland (2005). The Grammar of Graphics, 2nd Edition, Springer.

Tidy data*

  • Also called long format

  • Each observation is a row

  • Each variable is a column

  • No data in column names

* Wickham, Hadley (2014). Tidy data, The Journal of Statistical Software, 59.

Untidy data

  • Also called wide format

  • Examples
    • Repeated measurements for same subject
    • Pivot table in Excel (link)

From untidy data …

Daily closing prices of major European stock indices in 1991

##    day     DAX    SMI    CAC   FTSE
## 1    1 1628.75 1678.1 1772.8 2443.6
## 2    2 1613.63 1688.5 1750.5 2460.2
## 3    3 1606.51 1678.6 1718.0 2448.2
## 4    4 1621.04 1684.1 1708.1 2470.4
## 5    5 1618.16 1686.6 1723.1 2484.7
## 6    6 1610.61 1671.6 1714.3 2466.8
## 7    7 1630.75 1682.9 1734.5 2487.9
## 8    8 1640.17 1703.6 1757.4 2508.4
## 9    9 1635.47 1697.5 1754.0 2510.5
## 10  10 1645.89 1716.3 1754.3 2497.4
## 11  11 1647.84 1723.8 1759.8 2532.5
## 12  12 1638.35 1730.5 1755.5 2556.8
## 13  13 1629.93 1727.4 1758.1 2561.0
## 14  14 1621.49 1733.3 1757.5 2547.3
## 15  15 1624.74 1734.0 1763.5 2541.5

… to tidy data

##    day index   price
## 1    1   CAC 1772.80
## 2    1   DAX 1628.75
## 3    1  FTSE 2443.60
## 4    1   SMI 1678.10
## 5    2   CAC 1750.50
## 6    2   DAX 1613.63
## 7    2  FTSE 2460.20
## 8    2   SMI 1688.50
## 9    3   CAC 1718.00
## 10   3   DAX 1606.51
## 11   3  FTSE 2448.20
## 12   3   SMI 1678.60
## 13   4   CAC 1708.10
## 14   4   DAX 1621.04
## 15   4  FTSE 2470.40

But why bother converting?

  • If you apply grammar of graphics to tidy data
    you can almost always visualise what you want

Even this

* Torres-Manzanera, Emilio (in press). xkcd: An R package for plotting XKCD graphs, The Journal of Statistical Software.

Summary

  • Think about the most important features / aspects
    • Don’t use the same plot for all data
    • Consider interactive visualisations if possible
  • Less dependence on scale of data
    • Sophisticated ML algorithms may be needed
    • But principles (such as tidy data) remain the same
  • The challenge lies in combining all aspects
    • Temporal
    • Spatial
    • Network

Some useful resources

Thank you for listening!