Data Visualisations

Clement Lee
CSC8111, Newcastle University

2018-12-12 (Wed)

Flow of Statistical Analysis

Data scraping/compilation
Exploratory visualisation
Modelling & analysis
Repeat 2 & 3 (if necessary)
Result visualisation

Outline

Temporal data
Spatial data
Network data
Network + temporal data
Principles & tips

Temporal data

Background & Data

To understand temporal behaviour on Twitter
Game of Thrones Season 7 premiere on 2017-07-16
Data: ~4 hours of tweets & retweets with #gots7
before the premiere

Exploratory visualisation

Count of tweets of each minute

Exploratory visualisation

Retweet growth over absolute time

Exploratory visualisation

Retweet growth over relative time

Exploratory visualisation

Retweet count vs Follower count

Questions

Can we describe retweet growth with an elegant statistical model?
Does follower count play a direct role?
If so, can we incoporate its effect?
How do we justify our model? That is, does it predict well?

Result visualisation

Actual data vs Simulated data

Spatial data

Background & Data

Joint work with Maarten Vanhoof
To understand large-scale human behaviour from mobile phone data
Data: ~18 million Orange users in 2007 in France
All calls over 5 months
- Who called whom
- Cell towers involved
- Time of call

Background & Data

Home detection

Assign home location by calling patterns
Example: Cell tower from which calls were made for the highest number of days
Aggregate users by cell tower

Exploratory visualisation

Questions

How good are our home detection algorithms?
Can we validate them with some kind of ground truth e.g. census data?
Is it possible to draw insights into mobility behaviour that census data cannot?

Result visualisation

Network data

Background & Data

To review > 100 papers on social network analysis
To cluster them according to their topics/themes
Data: Citation network of these papers

Exploratory visualisation

Network plot

Questions

Topic modelling: word frequencies are analysed to identify topics/themes
Can we cluster them according to how they cite each other instead?
As a lot of work are disciplinary, can papers belong to multiple groups i.e. be soft clustered?

Result visualisation

Network + temporal data

Background & Data

To enrich the analyses on citation networks
Data: ACM CHI Conference on Human Factors in Computing Systems
- 6239 papers, 1981-2018
- Metadata such as abstract & publication year
- References/citations!

Exploratory visualisation

Questions

Can we cluster the papers, taking publication year into account?
What are the emerging/hot topics?
Can new topics be discovered?

Result visualisation

Principle & tips

The Grammar of Graphics*

Add one layer on top of the other
- Type: shape for point, linetype for line
- Size
- Colour (type or fill)
- Opacity (alpha)
Other aspects
- Axes & legends
- Facetting
- Coordinate systems

* Wilkinson, Leland (2005). The Grammar of Graphics, 2nd Edition, Springer.

Tidy data*

Also called long format
Each observation is a row
Each variable is a column
No data in column names

* Wickham, Hadley (2014). Tidy data, The Journal of Statistical Software, 59.

Untidy data

Also called wide format
Examples
- Repeated measurements for same subject
- Pivot table in Excel (link)

From untidy data …

Daily closing prices of major European stock indices in 1991

##    day     DAX    SMI    CAC   FTSE
## 1    1 1628.75 1678.1 1772.8 2443.6
## 2    2 1613.63 1688.5 1750.5 2460.2
## 3    3 1606.51 1678.6 1718.0 2448.2
## 4    4 1621.04 1684.1 1708.1 2470.4
## 5    5 1618.16 1686.6 1723.1 2484.7
## 6    6 1610.61 1671.6 1714.3 2466.8
## 7    7 1630.75 1682.9 1734.5 2487.9
## 8    8 1640.17 1703.6 1757.4 2508.4
## 9    9 1635.47 1697.5 1754.0 2510.5
## 10  10 1645.89 1716.3 1754.3 2497.4
## 11  11 1647.84 1723.8 1759.8 2532.5
## 12  12 1638.35 1730.5 1755.5 2556.8
## 13  13 1629.93 1727.4 1758.1 2561.0
## 14  14 1621.49 1733.3 1757.5 2547.3
## 15  15 1624.74 1734.0 1763.5 2541.5

… to tidy data

##    day index   price
## 1    1   CAC 1772.80
## 2    1   DAX 1628.75
## 3    1  FTSE 2443.60
## 4    1   SMI 1678.10
## 5    2   CAC 1750.50
## 6    2   DAX 1613.63
## 7    2  FTSE 2460.20
## 8    2   SMI 1688.50
## 9    3   CAC 1718.00
## 10   3   DAX 1606.51
## 11   3  FTSE 2448.20
## 12   3   SMI 1678.60
## 13   4   CAC 1708.10
## 14   4   DAX 1621.04
## 15   4  FTSE 2470.40

But why bother converting?

If you apply grammar of graphics to tidy data
you can almost always visualise what you want

Even this

* Torres-Manzanera, Emilio (in press). xkcd: An R package for plotting XKCD graphs, The Journal of Statistical Software.

Summary

Think about the most important features / aspects
- Don’t use the same plot for all data
- Consider interactive visualisations if possible
Less dependence on scale of data
- Sophisticated ML algorithms may be needed
- But principles (such as tidy data) remain the same
The challenge lies in combining all aspects
- Temporal
- Spatial
- Network

Some useful resources

Thank you for listening!

Data Visualisations

Clement Lee CSC8111, Newcastle University

2018-12-12 (Wed)

Flow of Statistical Analysis

Outline

Temporal data

Background & Data

Exploratory visualisation

Exploratory visualisation

Exploratory visualisation

Exploratory visualisation

Questions

Result visualisation

Spatial data

Background & Data

Background & Data

Exploratory visualisation

Questions

Result visualisation

Result visualisation

Network data

Background & Data

Exploratory visualisation

Questions

Result visualisation

Result visualisation

Network + temporal data

Background & Data

Exploratory visualisation

Questions

Result visualisation

Result visualisation

Result visualisation

Principle & tips

The Grammar of Graphics*

Tidy data*

Untidy data

From untidy data …

… to tidy data

But why bother converting?

Even this

Summary

Some useful resources

Clement Lee
CSC8111, Newcastle University