SlideShare a Scribd company logo
Jiang Zhu and Sean Wang

Dec 5th, 2011




                          1
•  Monitor and track user behavior on smartphones using various
 on-device sensors
•  Convert sensory traces and other context information to Personal
 Behavior Features
•  Build Risk Analysis Trees with these features and use it for
 calculation of Certainty Scores
•  Trigger various Authentication Schemes when certain application
 is launched.




                                                                      2
3
4
60%                                                                   •  “The 329 organizations
                                                                          polled had collectively lost
50%                                                                       more than 86,000 devices
                                                                          … with average cost of lost
40%                                                                       data at $49,246 per device,
30%
                                                                          worth $2.1 billion or $6.4
                                                                          million per organization.
 20%

 10%
                                                                        "The Billion Dollar Lost-Laptop Study,"
   0%                                                                      conducted by Intel Corporation and the
                                                                           Ponemon Institute, analyzed the scope
                                                                           and circumstances of missing laptop
                  Mobile Device Loss or theft                              PCs.



Strategy One Survey conducted among a U.S. sample of 3017 adults age 18 years older in September
   21-28, 2010, with an oversample in the top 20 cities (based on population).

                                                                                                                    5
Application
     Password                                      Different
                                          applications may
                                             have different
A major source of
                                               sensitivities
security vulnerabilities.
Easy to guess, reuse,
forgotten, shared
                                Usability
                            Authentication too-often or
                                 sometimes too loose




                                                               6
7
Application
Access
Control




              8
•  MobiSens app collects sensor data
   •  Motion sensors
   •  GPS and WiFi Scanning
   •  In-use applications and their traffic patterns

•  SenSec module build user behavior models
   •  Unsupervised Activity Segmentation and model the sequence using
   Language model
   •  Building Risk Analysis Tree (DT) to detect anomaly
   •  Combine above to estimate risk (online): certainty score

•  SenSec broadcast certainty score to other applications

•  Application Access Control Module uses broadcast receiver



                                                                        9
•  Feature vector calculated from a step window represent the
 behavior state within a given time window
   •  surrounding environment: GPS location, WiFi signal
   •  activity: motions, applications in use
   •  communication: network traffic

•  Using Decision Tree to detect anomaly in behaviors
   •  Each node represents a feature dimension
   •  Leaves can be one of the following
    •  Owner Detection: owner [0,1], 0: Anomaly, 1: Normal
    •  User Identification: user id [0,1,…. N], user’s identification, i.e. IMEI

•  Multiple trees can be built with subset of feature space
   •  Weighted average
   •  Voting

                                                                                   10
•  Convert feature vector series to label streams – dimension reduction

•  Using n-gram to model sequence of label stream for each sensory
 dimension – current state and transition captured
•  Step window with assigned length


                 A1           A2           A1          A4

                    G2             G5           G2          G2

               W2                  W1                  W2

                    P1          P3       P6          P1


                         A2 G2G5 W1 P1P3 A1A4 G2 W1W2 P1

                                                                     11
•  User behavior at time t depends only on the last n-1 behaviors

•  Sequence of behaviors can be predicted by n consecutive
 location in the past


•  Maximum Likelihood Estimation from training data by counting:



•  MLE assign zero probability to unseen n-grams
   Incorporate smoothing function (Katz)
    Discount probability for observed grams
    Reserve probability for unseen grams




                                                                    12
•  Feed sequence of the past behaviors in a stepping window of size
 N to n-gram model for testing
•  For a testing sequence of behavior labels



•  Estimate the average log probability this sequence is generated
 from the n-gram




•  If this likelihood drops below a threshold, flag an anomaly alert


                                                                       13
14
Anomaly
                      Preprocessing
                                                   Detection

                             Behavior Text
                                                    N-gram
                              Generation
                                Fusion              Model
MobiSens    Extract
 Trace     Features


 Sensing                     Decision Trees            ~



                                       Threshold       >

                               Anomaly Y/N

                                                               15
16
•  Total data set size: 4GB
             Dataset                •  Remove 2 heavy users
Numer of users     50
                                    •  Remove users with very
Device             Android phones    limited data duration
                                    •  Remove users that don’t
Location           Bay area
                                     have application and traffic
Averag period      30 days           data due to older MobiSens
                                     version
Number of data
                   7
types                               •  25 users with comparable
Finest sampling                      dataset size
interval (motion   200 ms
sensors)                            •  Data duration: 4 hour ~ 2.5
                                     days

                                                                     17
•  Motion Sensors (100)
   •  Used to summarize
      acceleration stream
   •  Calculated separately for each
     dimension [x,y,z,m]

•  GPS: location label via density based clustering (1)

•  WiFi: (SSIDs, RSSIs) pairs ranked by signal strength (6)

•  Applications: Bitmap of well-known applications (60 + 1)

•  Application Traffic Pattern: Tx/Rx traffic vectors (120 + 2)

•  Step Window Size: 5 seconds



                                                                  18
•  User Identification Test and Owner Detection Test for randomly
 selected partial data set (4 users) with 1:1 training/test split
   •  ~ 99% accuracy
   •  number of leaves: 56 , size of tree: 111

•  Using non-motion attributes yields lower accuracy (96%)
   •  Significant tree size reduction, number of leaves: 3, size of tree: 5
   •  Cross entropy may be significant to easily distinguish users using some
   features.

•  Using only motion attributes can distinguish different users
   •  ~ 98% accuracy
   •  very large tree, number of leaves: 267, size of tree 533
   •  may cause performance issues on mobile platform



                                                                                19
•  Apply cross-entropy filter to remove users that could be identified
 easily using a small set of features
•  12 users with 210k data instances

•  User identification : train RAT model on 66% instances and rest
 as testing
                    84.8%           83.5                79.3
  100
                                                               7649
   80
   60                                                                 Accuracy
   40                                                                 Size Factor
   20                221                 35

    0
              All           Non-Motion        Motion-Only
                                                                                20
21
•  Experiments to discover anomaly usage with ~80% accuracy with
 only days of training data
                                                                   22
•  Extended data set for feature construction
   TCP, UDP traffic; sound; ambient lighting; battery status, etc.

•  Data and Modeling
   Gain more insights into the data, features and factorized relationships among
   various sensors
   Try other classification methods and compare results: LR, SVM, Random
   Forest, etc

•  Enhanced security of SenSec components
   Integration with Android security framework and other applications

•  Privacy challenges
   Data collection, model training, privacy policy, etc.

•  Energy efficiency


                                                                                   23
24
Thank you.
26
!

    27
•  Data Collection                    9.=$(1/6'9.=$;1'
                                                               (1/6$/<'                 9.=$(1/6'7+"@1/:
   •  Running app list
                                        !55;$"+#$./                                                                    A$21;.<<1,'
                                                               C./#,.;
                                     D0                                     31%$"1'                                   !55;$"+#$./6
   •  Per-app traffic pattern                                             4,.2$;1'!40
                                                                                                                        !"#$%$#&'
                                                                           9166+<1'                                  ()**+,$-+#$./'

•  IPC Interface                   !"#$%$#&'                                4..;                                       0/#1,2+"1
                                (1<*1/#+#$./       31%$"1'
                                                  C./#,.;;1,    9.:1;
                                                                                             (#.,+<1'               718+%$.,'9.:1;$/<'
   •  Certainty Score                                          4)68$/<
                                                                          B1=(1,%$"1'        (&6#1*                    !;<.,$#8*6
                                    3+#+'
   Broadcast mechanism           !<<,1<+#.,                                  3+#+'
                                                                           >?"8+/<1'                    9.=$(1/6'
                                                                              !40                         3+#+'
                                                                3+#+                                                3+#+'4,15,."166.,
                                                                                                        >?"8+/<1'
                                (1/6.,'                        D5;.+:                                      !40
                                B$:<1#6


                            E+F'9.=$(1/6'9.=$;1'!55;$"+#$./                             E=F'G$1,'H                       E"F'G$1,'I




•  Offline-Model Push via Data Exchange API
   •  Risk Analysis Tree can be trained using global data on the MobiSens Server
   and pushed back to the mobile device

                                                                                                                                      28
•  MobiSens Server
   •  Offline Clustering
    •  K-means package from Weka Data Mining Toolkit
    •  Using aggregated data from all users
   •  Offline RAT training
    •  Decision Tree package from Weka Data Mining Toolkit
    •  Construct training data set and design evaluation strategy

•  MobiSens Client
   •  Retrive RAT model from MobiSens Server
   •  On-device n-gram label sequence construction (n=1,2,3; window size =5s)
   •  RAT inference using Weka Toolkit on device
   •  Status bar notification based on certainty value




                                                                                29
•  Reactive API to Team Access
   API call from Team Access to SenSec to retrieve the current Certainty Score
   given the context

   getCertaintyScore(SenSecContextType ctx, count)


•  Proactive API to Team Acess and other equivalent modules
   Broadcast Receiver on Certainty Score

   certaintyScore{
       CertaintyScoreType scores[];
       WindowSizeType window_size;
       SenSecContextType ctx;
   }

                                                                                 30

More Related Content

SenSec: Mobile Application Security through Passive Sensing

  • 1. Jiang Zhu and Sean Wang Dec 5th, 2011 1
  • 2. •  Monitor and track user behavior on smartphones using various on-device sensors •  Convert sensory traces and other context information to Personal Behavior Features •  Build Risk Analysis Trees with these features and use it for calculation of Certainty Scores •  Trigger various Authentication Schemes when certain application is launched. 2
  • 3. 3
  • 4. 4
  • 5. 60% •  “The 329 organizations polled had collectively lost 50% more than 86,000 devices … with average cost of lost 40% data at $49,246 per device, 30% worth $2.1 billion or $6.4 million per organization. 20% 10% "The Billion Dollar Lost-Laptop Study," 0% conducted by Intel Corporation and the Ponemon Institute, analyzed the scope and circumstances of missing laptop Mobile Device Loss or theft PCs. Strategy One Survey conducted among a U.S. sample of 3017 adults age 18 years older in September 21-28, 2010, with an oversample in the top 20 cities (based on population). 5
  • 6. Application Password Different applications may have different A major source of sensitivities security vulnerabilities. Easy to guess, reuse, forgotten, shared Usability Authentication too-often or sometimes too loose 6
  • 7. 7
  • 9. •  MobiSens app collects sensor data •  Motion sensors •  GPS and WiFi Scanning •  In-use applications and their traffic patterns •  SenSec module build user behavior models •  Unsupervised Activity Segmentation and model the sequence using Language model •  Building Risk Analysis Tree (DT) to detect anomaly •  Combine above to estimate risk (online): certainty score •  SenSec broadcast certainty score to other applications •  Application Access Control Module uses broadcast receiver 9
  • 10. •  Feature vector calculated from a step window represent the behavior state within a given time window •  surrounding environment: GPS location, WiFi signal •  activity: motions, applications in use •  communication: network traffic •  Using Decision Tree to detect anomaly in behaviors •  Each node represents a feature dimension •  Leaves can be one of the following •  Owner Detection: owner [0,1], 0: Anomaly, 1: Normal •  User Identification: user id [0,1,…. N], user’s identification, i.e. IMEI •  Multiple trees can be built with subset of feature space •  Weighted average •  Voting 10
  • 11. •  Convert feature vector series to label streams – dimension reduction •  Using n-gram to model sequence of label stream for each sensory dimension – current state and transition captured •  Step window with assigned length A1 A2 A1 A4 G2 G5 G2 G2 W2 W1 W2 P1 P3 P6 P1 A2 G2G5 W1 P1P3 A1A4 G2 W1W2 P1 11
  • 12. •  User behavior at time t depends only on the last n-1 behaviors •  Sequence of behaviors can be predicted by n consecutive location in the past •  Maximum Likelihood Estimation from training data by counting: •  MLE assign zero probability to unseen n-grams Incorporate smoothing function (Katz) Discount probability for observed grams Reserve probability for unseen grams 12
  • 13. •  Feed sequence of the past behaviors in a stepping window of size N to n-gram model for testing •  For a testing sequence of behavior labels •  Estimate the average log probability this sequence is generated from the n-gram •  If this likelihood drops below a threshold, flag an anomaly alert 13
  • 14. 14
  • 15. Anomaly Preprocessing Detection Behavior Text N-gram Generation Fusion Model MobiSens Extract Trace Features Sensing Decision Trees ~ Threshold > Anomaly Y/N 15
  • 16. 16
  • 17. •  Total data set size: 4GB Dataset •  Remove 2 heavy users Numer of users 50 •  Remove users with very Device Android phones limited data duration •  Remove users that don’t Location Bay area have application and traffic Averag period 30 days data due to older MobiSens version Number of data 7 types •  25 users with comparable Finest sampling dataset size interval (motion 200 ms sensors) •  Data duration: 4 hour ~ 2.5 days 17
  • 18. •  Motion Sensors (100) •  Used to summarize acceleration stream •  Calculated separately for each dimension [x,y,z,m] •  GPS: location label via density based clustering (1) •  WiFi: (SSIDs, RSSIs) pairs ranked by signal strength (6) •  Applications: Bitmap of well-known applications (60 + 1) •  Application Traffic Pattern: Tx/Rx traffic vectors (120 + 2) •  Step Window Size: 5 seconds 18
  • 19. •  User Identification Test and Owner Detection Test for randomly selected partial data set (4 users) with 1:1 training/test split •  ~ 99% accuracy •  number of leaves: 56 , size of tree: 111 •  Using non-motion attributes yields lower accuracy (96%) •  Significant tree size reduction, number of leaves: 3, size of tree: 5 •  Cross entropy may be significant to easily distinguish users using some features. •  Using only motion attributes can distinguish different users •  ~ 98% accuracy •  very large tree, number of leaves: 267, size of tree 533 •  may cause performance issues on mobile platform 19
  • 20. •  Apply cross-entropy filter to remove users that could be identified easily using a small set of features •  12 users with 210k data instances •  User identification : train RAT model on 66% instances and rest as testing 84.8% 83.5 79.3 100 7649 80 60 Accuracy 40 Size Factor 20 221 35 0 All Non-Motion Motion-Only 20
  • 21. 21
  • 22. •  Experiments to discover anomaly usage with ~80% accuracy with only days of training data 22
  • 23. •  Extended data set for feature construction TCP, UDP traffic; sound; ambient lighting; battery status, etc. •  Data and Modeling Gain more insights into the data, features and factorized relationships among various sensors Try other classification methods and compare results: LR, SVM, Random Forest, etc •  Enhanced security of SenSec components Integration with Android security framework and other applications •  Privacy challenges Data collection, model training, privacy policy, etc. •  Energy efficiency 23
  • 24. 24
  • 26. 26
  • 27. ! 27
  • 28. •  Data Collection 9.=$(1/6'9.=$;1' (1/6$/<' 9.=$(1/6'7+"@1/: •  Running app list !55;$"+#$./ A$21;.<<1,' C./#,.; D0 31%$"1' !55;$"+#$./6 •  Per-app traffic pattern 4,.2$;1'!40 !"#$%$#&' 9166+<1' ()**+,$-+#$./' •  IPC Interface !"#$%$#&' 4..; 0/#1,2+"1 (1<*1/#+#$./ 31%$"1' C./#,.;;1, 9.:1; (#.,+<1' 718+%$.,'9.:1;$/<' •  Certainty Score 4)68$/< B1=(1,%$"1' (&6#1* !;<.,$#8*6 3+#+' Broadcast mechanism !<<,1<+#., 3+#+' >?"8+/<1' 9.=$(1/6' !40 3+#+' 3+#+ 3+#+'4,15,."166., >?"8+/<1' (1/6.,' D5;.+: !40 B$:<1#6 E+F'9.=$(1/6'9.=$;1'!55;$"+#$./ E=F'G$1,'H E"F'G$1,'I •  Offline-Model Push via Data Exchange API •  Risk Analysis Tree can be trained using global data on the MobiSens Server and pushed back to the mobile device 28
  • 29. •  MobiSens Server •  Offline Clustering •  K-means package from Weka Data Mining Toolkit •  Using aggregated data from all users •  Offline RAT training •  Decision Tree package from Weka Data Mining Toolkit •  Construct training data set and design evaluation strategy •  MobiSens Client •  Retrive RAT model from MobiSens Server •  On-device n-gram label sequence construction (n=1,2,3; window size =5s) •  RAT inference using Weka Toolkit on device •  Status bar notification based on certainty value 29
  • 30. •  Reactive API to Team Access API call from Team Access to SenSec to retrieve the current Certainty Score given the context getCertaintyScore(SenSecContextType ctx, count) •  Proactive API to Team Acess and other equivalent modules Broadcast Receiver on Certainty Score certaintyScore{ CertaintyScoreType scores[]; WindowSizeType window_size; SenSecContextType ctx; } 30