Presentation at Socialcom2014: Gauging Heterogeneity in Online Consumer Behaviour Data: A Proximity Graph Approach
- 1. Gauging Heterogeneity in Online
Consumer Behaviour Data:
A Proximity Graph Approach
Natalie de Vries, Ahmed Shamsul Arefin, Pablo Moscato
The Priority Research Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine (CIBM)
School of Electrical Engineering and Computer Science
Faculty of Engineering and Built Environment
The University of Newcastle, Australia
- 2. Agenda
• Introduction and objectives
• Dataset characteristics
• Outline of the study
• Methodology
• Results
• Significance of the work and future research
directions
• Questions
- 3. Introduction
• Increase in online behaviours towards brands
• Increasing importance of social media in marketing strategies
• High levels of heterogeneity amongst consumers
• Need for clustering consumers or objects into similar groups
- 5. Objectives of this Study
• Create an understanding of the natural groupings in a
consumer cohort based on their online consumer behaviours
towards a particular brand
• Find a suitable distance measure for analysing a specific
dataset in a specific context
• Explore the use of meta-features for finding a more accurate
partitioning of respondents
• Uncover the best way to cluster consumers; e.g. using raw
data or using a form of meta-features and using either; intra-
or inter-construct relationships
- 6. Methodology: Dataset collection and
preparation
Construct Source Code
Number of
Items
Usage Intensity
(Jahn and
Kunz 2012)
UI 3
Functional Value FUV 4
Hedonic Value HED 4
Social Interaction
Value
SOC 4
Customer
Engagement
CE 5
Customer Loyalty LO 6
Brand Involvement
(Carlson and
O'Cass 2012)
INV 6
Co-Creation Value
(O'Cass and
Ngo 2011)
CCV 6
SNS-Specific Loyalty
Behaviours
(O'Cass and
Carlson
2012)
ON 3
Self-Brand-
Congruency
(Hohenstein,
Sirgy et al.
2007)
SBC 5
Survey Constructs
Category
No.
Explanation
Percentage
of sample
1 Fashion Brands 31.54%
2
Community, Charities, Personality and Sports
Fan Pages
23.99%
3 Other Services 19.68%
4 Other Consumer Goods 8.09%
5 Hospitality (Restaurants, Cafes, Bars) 7.28%
6 Consumer Electronics 7.01%
7 Automotive 2.43%
Respondents’ chosen brands’ categories
- 8. Methodology: Difference Meta-features
The difference of values
between two measured
features might be capable to
distinguish between two
given categories, even when
those features are not able to
do so alone (De Paula et al, 2011)
Previous successful
application of difference
meta-features in Alzheimer’s
Disease biomarker detection
(De Paula et al. 2011) and (Arefin et al.
2012), both in PLoS ONE.
Data collection
and pre-
processing
Meta-features:
Pair-wise
differences
Meta-features:
Pair-wise
products
Intra- and
inter-construct
relationships
Distance
Computation
Data preparation
-6
-4
-2
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11
f1
f2
Meta-f
Class A Class B
-6
-4
-2
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11 12
f1
f2
Meta-f
Class A Class B
- 9. Methodology: Product Meta-features
The product of values between
two measured features might be
capable to distinguish between
two given categories, even when
those features are not able to do
so alone.
This study is the first to trial the
application of this idea.
Left, the values of f1 (blue) and
f2 (red) do not distinguish the
classes well but their product
(meta-feature in green) does.
Data collection
and pre-
processing
Meta-features:
Pair-wise
differences
Meta-features:
Pair-wise
products
Intra- and
inter-construct
relationships
Distance
Computation
Data preparation
0
2
4
6
8
10
12
14
16
18
1 2 3 4 5 6 7 8 9 10 11 12
f1
f2
Meta-f
Class A Class B0
2
4
6
8
10
12
14
16
18
1 2 3 4 5 6 7 8 9 10 11 12
f1
f2
Meta-f
Class A Class B
- 10. Methodology: Distance Computation
and Dataset Variations
• Distance matrices computed for all 7 datasets
• Various distance/correlations metrics used on each
of the dataset variations
X X X
Distance Metrics:
• Pearson
• Spearman
• Robust
• Euclidean
• Cosine
Various datasets:
• Original
• Difference
meta-features
• Product meta-
features
Interactions:
• Intra-construct
item
relationships
• Inter-construct
item
relationships
Values of k for
kNN Cliques:
k=3
k=4
k=5
k=6
= 7 datasets and
140 graphs
- 11. Methodology: MST-kNN and kNN Cliques
Complete graph Minimum Spanning Tree Select and remove edges
that are not k-Nearest
Neigbors
Final forest (a
forest is a
set of trees) =
clusters
Previous applications of the MST-kNN method
• U.S. Stock market time series data (Inostroza-Ponta, Berretta, & Moscato, 2011)
• Yeast gene expression data (Inostroza-Ponta, Mendes, Berretta, & Moscato, 2007)
• Alzheimer’s disease data - in the order of 1 million data elements (Arefin, Mathieson, Johnstone, Berretta, &
Moscato, 2012)
• Prostate cancer data (Capp et al., 2009)
These examples show the methodology proposed here has a proven scalability for larger
datasets
- 15. Results: Clustering and Significance Values
Data Rows selected
Distance
Metric
MST-kNN merged
with the kNN cliques of
size
p-values
Wilcoxon’s Test Kruskal-Wallis
Original All
Robust 5NN 0.021187 0.042364
Spearman 6NN 0.025987 0.051962
Robust 6NN 0.028565 0.057117
Pearson 3NN 0.030232 0.060451
Spearman 3NN 0.040661 0.081306
Euclidean 6NN 0.041232 0.082448
Difference
Metafeatures
‘Intra’ constructs
Robust 3NN 0.016551 0.033095
Robust 6NN 0.017177 0.03434
Pearson 3NN 0.018628 0.0372481
Pearson 6NN 0.019066 0.038124
Pearson 5NN 0.019656 0.039303
All Pearson 3NN 0.020594 0.041180
Product
Metafeatures
‘Inter’ Constructs
Spearman 3NN 0.016949 0.033891
Pearson 4NN 0.01757 0.035132
All Pearson 4NN 0.017721 0.035433
‘Inter’ Constructs
Pearson 6NN 0.01781 0.035611
Pearson 3NN 0.017816 0.035624
‘Inter’ Constructs Robust 4NN 0.017998 0.035988
- 16. Results: Analysis of clusters
Cluster
No. of
respondents
Avg.
Age
Age
range
% Males/
Females
1 103 20.5 17-32 39.8 / 60.2
2 92 21.3 18-36 39.1 / 60.9
3 31 23.4 19-49 51.6 / 48.4
4 71 21.0 18-44 40.8 / 59.2
5 4 22.3 20-24 75 / 25
6 18 21.1 18-26 33.3 / 66.7
7 10 22.5 18-29 20 / 80
8 5 21 20-24 80 / 20
9 20 23 19-44 45 / 55
10 12 22 18-45 41.7 / 58.3
11 5 26.4 20-46 0 / 100
Clusters’ demographic informationThis figure presents the frequencies of the
respondents’ chosen brand categories for
two of the largest clusters
The difference in degrees of heterogeneity
between different clusters can be seen in
these figures.
Furthermore, these two clusters highlight
the differences in brand preferences
amongst respondents that do exist within
each cluster of similar consumers
Heterogeneous spread of respondents’
chosen brand categories
- 17. Contribution and Significance
• Methodological guide for the investigation of several distance
measures, meta-features, relationships of theoretical
construct items to find ‘best’ clustering results
• Expanded on the MST-kNN clustering method for increased
potential to find statistically significant clusters of categories
of consumers and their chosen brands
• The clustering methodology used in this study highlights the
high levels of heterogeneity found in consumer’s online
behaviours towards brands
- 18. Future Research Directions
• Various domains and contexts to apply the novel process outlined
in this study
• Combine a study using survey data as well as ‘live’ behaviour data
from social networking sites (real-time interactions)
• Further exploration of meta-features in both survey data and ‘real’
online behaviour clustering studies; ‘differences’ meta-features in
this study yielded better results
• This study guides the development of future feature selection
models to identify group of consumers according to higher-order
characteristics.
- 19. Thank you
Questions?
We would like to thank Dr. Jamie Carlson and Mr. Benjamin Lucas for their advise and proofreading.
Dr. Jamie Carlson supervised Ms. de Vries’ thesis project and the initial collection and analysis of this data.
Thanks to Mario Inostroza-Ponta for the use of his MST-kNN images.
- 20. References (from paper)
• [1] I. P. Cvijikj and F. Michahelles, "Online engagement factors on Facebook brand pages," Social Network Analysis and Mining, vol. 3, pp. 843-
861, 2013.
• [2] B. Jahn and W. Kunz, "How to transform consumers into fans of your brand," Journal of Service Management, vol. 23, pp. 344-361, 2012.
• [3] T. S. Chung and M. Wedel, "Adaptive personalization of mobile information services," in Handbook of Service Marketing Research, R. T.
Rust and M.-H. Huang, Eds., ed Cheltenham: Edward Elgar Publishing Limited, 2014.
• [4] N. J. de Vries, J. Carlson, and P. Moscato, "A Data-Driven Approach to Reverse Engineering Customer Engagement Models: Towards
Functional Constructs," PLoS ONE, vol. 9, p. e102768, 2014.
• [5] B. Jahn and W. Kunz, "How to Transform Consumers into Fans of your Brand," Journal of Service Management, vol. 23, pp. 344-361, 2012.
• [6] J. Carlson and A. O'Cass, "Optimizing the Online Channel in Professional Sport to Create Trusting and Loyal Consumers: The Role of the
Professional Sports Team Brand and Service Quality. ," Journal of Sport Management, vol. 26, p. 463, 2012.
• [7] N. Hohenstein, M. J. Sirgy, A. Herrmann, and M. Heitmann, "Self-Congruity: Antecedents and Consequences," in 34th La Londe
International Research Conference in Marketing Communications and Consumer Behaviour Aix en Provance: France University Paul Cezanne, 2007,
pp. 118-130.
• [8] A. O'Cass and L. Ngo, "Examining the Firm’s Value Creation Process: A Managerial Perspective of the Firm’s Value Offering Strategy and
Performance," British Journal of Management, vol. 22, pp. 646-671, 2011.
• [9] A. O'Cass and J. Carlson, "An Empirical Assessment of Consumers' Evaluations of Web Site Service Quality: Conceptualizing and Testing a
Formative Model," Journal of Services Marketing, vol. 26, pp. 419-434, 2012.
• [10] L. D. Peters, "Theory Testing in Social Research," The Marketing Review, vol. 3, pp. 65-82, 2002.
• [11] M. R. de Paula, M. G. Ravetti, R. Berretta, and P. Moscato, "Differences in Abundances of Cell-Signalling Proteins in Blood Reveal Novel
Biomarkers for Early Detection Of Clinical Alzheimer’s Disease," PLoS ONE, vol. 6, pp. 1-14, 2011.
• [12] A. S. Arefin, L. Mathieson, D. Johnstone, R. Berretta, and P. Moscato, "Unveiling Clusters of RNA Transcript Pairs Associated with Markers
of Alzheimer's Disease Progression," PLoS ONE, vol. 7, Sep 21 2012.
• [13] M. Inostroza-Ponta, R. Berretta, A. Mendes, and P. Moscato, "An automatic graph layout procedure to visualize correlated data," in
Artificial Intelligence in Theory and Practice, ed: Springer, 2006, pp. 179-188.
• [14] A. S. Arefin, L. Mathieson, D. Johnstone, R. Berretta, and P. Moscato, "Unveiling clusters of RNA transcript pairs associated with markers of
Alzheimer’s disease progression," PLoS ONE, vol. 7, p. e45535, 2012.
• [15] A. Capp, M. Inostroza-Ponta, D. Bill, P. Moscato, C. Lai, D. Christie, et al., "Is there more than one proctitis syndrome? A revisitation using
data from the TROG 96.01 trial," Radiotherapy and oncology, vol. 90, pp. 400-407, 2009.
• [16] M. Inostroza-Ponta, A. Mendes, R. Berretta, and P. Moscato, "An integrated QAP-based approach to visualize patterns of gene expression
similarity," in Progress in Artificial Life, ed: Springer, 2007, pp. 156-167.
• [17] M. Inostroza-Ponta, R. Berretta, and P. Moscato, "QAPgrid: A two level QAP-based approach for large-scale data analysis and
visualization," PloS one, vol. 6, p. e14468, 2011.
• [18] A. S. Arefin, M. Inostroza-Ponta, L. Mathieson, R. Berretta, and P. Moscato, "Clustering nodes in large-scale biological networks using
external memory algorithms," in Algorithms and Architectures for Parallel Processing, ed: Springer, 2011, pp. 375-386.
• [19] A. S. Arefin, C. Riveros, R. Berretta, and P. Moscato, "Gpu-fs-knn: A software tool for fast and scalable knn computation using GPUs," PLoS
ONE, vol. 7, p. e44000, 2012.
- 21. • [20] A. S. Arefin, C. Riveros, R. Berretta, and P. Moscato, "kNN-Borůvka-GPU: A Fast and Scalable MST Construction from kNN Graphs on
GPU," in Computational Science and Its Applications–ICCSA 2012, ed: Springer, 2012, pp. 71-86.
• [21] A. S. Arefin, C. Riveros, R. Berretta, and P. Moscato, "kNN-MST-Agglomerative: A fast and scalable graph-based data clustering
approach on GPU," in Computer Science & Education (ICCSE), 2012 7th International Conference on, 2012, pp. 585-590.
• [22] E. J. Chesler and M. A. Langston, Combinatorial genetic regulatory network analysis tools for high throughput transcriptomic data:
Springer, 2006.
• [23] M. Hollander, D. A. Wolfe, and E. Chicken, Nonparametric statistical methods vol. 751: John Wiley & Sons, 2013.
• [24] C. E. Shannon, "The mathematical theory of communication," The Bell System Technical Journal, vol. 27, pp. 379-423 & 623-656,
1948.
• [25] A. W. Kruglanski, "The Human Subject in the Psychology Experiment: Fact and Artifact," in Advances in Experimental Social
Psychology vol. 8, L. Berkowittz, Ed., ed New York: Academic Press, 1975, pp. 101-147.
• [26] H. Krasnova, S. Spiekermann, K. Koroleva, and T. Hildebrand, "Online Social Networks: Why we Disclose," Journal of Information
Technology, vol. 25, pp. 109-125, 2010.
• [27] S. C. Chu and Y. Kim, "Determinants of Consumer Engagement in electronic Word of Mouth (eWoM) in Social Networking Sites,"
International Journal of Advertising, vol. 30, pp. 47-75, 2011.
• [28] J. M. Pinho and A. M. Soares, "Examining the Technology Acceptance Model in the Adoption of Social Networks," Journal of
Research in Interactive Marketing, vol. 5, pp. 116-129, 2011.
Additional reference from presentation:
• Arefin AS, Mathieson L, Johnstone D, Berretta R, Moscato P (2012) Unveiling Clusters of RNA Transcript Pairs Associated with Markers of
Alzheimer’s Disease Progression. PLoS ONE 7(9): e45535. doi: 10.1371/journal.pone.0045535
• Rocha de Paula M, Gómez Ravetti M, Berretta R, Moscato P (2011) Differences in Abundances of Cell-Signalling Proteins in Blood Reveal Novel
Biomarkers for Early Detection Of Clinical Alzheimer's Disease. PLoS ONE 6(3): e17481. doi: 10.1371/journal.pone.0017481
References cont.