Data scraping/compilation
Exploratory visualisation
Modelling & analysis
Repeat 2 & 3 (if necessary)
Result visualisation
Temporal data
Spatial data
Network data
Network + temporal data
Principles & tips
To understand temporal behaviour on Twitter
Game of Thrones Season 7 premiere on 2017-07-16
Data: ~4 hours of tweets & retweets with #gots7
before the premiere
Count of tweets of each minute
Retweet growth over absolute time
Retweet growth over relative time
Retweet count vs Follower count
Can we describe retweet growth with an elegant statistical model?
Does follower count play a direct role?
If so, can we incoporate its effect?
How do we justify our model? That is, does it predict well?
Actual data vs Simulated data
Joint work with Maarten Vanhoof
To understand large-scale human behaviour from mobile phone data
Data: ~18 million Orange users in 2007 in France
Home detection
Assign home location by calling patterns
Example: Cell tower from which calls were made for the highest number of days
How good are our home detection algorithms?
Can we validate them with some kind of ground truth e.g. census data?
Is it possible to draw insights into mobility behaviour that census data cannot?
To review > 100 papers on social network analysis
To cluster them according to their topics/themes
Data: Citation network of these papers
Network plot
Topic modelling: word frequencies are analysed to identify topics/themes
Can we cluster them according to how they cite each other instead?
As a lot of work are disciplinary, can papers belong to multiple groups i.e. be soft clustered?
To enrich the analyses on citation networks
Can we cluster the papers, taking publication year into account?
What are the emerging/hot topics?
Can new topics be discovered?
* Wilkinson, Leland (2005). The Grammar of Graphics, 2nd Edition, Springer.
Also called long format
Each observation is a row
Each variable is a column
No data in column names
* Wickham, Hadley (2014). Tidy data, The Journal of Statistical Software, 59.
Also called wide format
Daily closing prices of major European stock indices in 1991
## day DAX SMI CAC FTSE
## 1 1 1628.75 1678.1 1772.8 2443.6
## 2 2 1613.63 1688.5 1750.5 2460.2
## 3 3 1606.51 1678.6 1718.0 2448.2
## 4 4 1621.04 1684.1 1708.1 2470.4
## 5 5 1618.16 1686.6 1723.1 2484.7
## 6 6 1610.61 1671.6 1714.3 2466.8
## 7 7 1630.75 1682.9 1734.5 2487.9
## 8 8 1640.17 1703.6 1757.4 2508.4
## 9 9 1635.47 1697.5 1754.0 2510.5
## 10 10 1645.89 1716.3 1754.3 2497.4
## 11 11 1647.84 1723.8 1759.8 2532.5
## 12 12 1638.35 1730.5 1755.5 2556.8
## 13 13 1629.93 1727.4 1758.1 2561.0
## 14 14 1621.49 1733.3 1757.5 2547.3
## 15 15 1624.74 1734.0 1763.5 2541.5
## day index price
## 1 1 CAC 1772.80
## 2 1 DAX 1628.75
## 3 1 FTSE 2443.60
## 4 1 SMI 1678.10
## 5 2 CAC 1750.50
## 6 2 DAX 1613.63
## 7 2 FTSE 2460.20
## 8 2 SMI 1688.50
## 9 3 CAC 1718.00
## 10 3 DAX 1606.51
## 11 3 FTSE 2448.20
## 12 3 SMI 1678.60
## 13 4 CAC 1708.10
## 14 4 DAX 1621.04
## 15 4 FTSE 2470.40
* Torres-Manzanera, Emilio (in press). xkcd: An R package for plotting XKCD graphs, The Journal of Statistical Software.
Thank you for listening!