SlideShare a Scribd company logo
Profiling a Person With  Log Data Jim Jansen College of Information Sciences and Technology  The Pennsylvania State University  [email_address]   Interested in how much  descriptive  information we can generate about a  people  by leveraging  search log data .
What Did We Find Out? We can tell quite a lot!
The State of Web Search
The Power of Search and the Web  Search is  the   top online activity Search drives over  7 billion monthly  queries in the U.S. Online activity has a  huge impact  on people’s daily lives: 70 minutes less with family 30 minutes less TV 8.5 minutes less sleep Sources: comScore, U.S., Feb. ’06, Stanford Institute for the Quantitative Study of Society, Nov. ‘05
Analysis of Search Marketplace  Holding  fairly stable  over the last year or so, albeit with some  Bing flux
Search Logs Contains the  trace data  recorded when a person visits the search engine, submits a query, views results, etc On one hand, logs have been  criticized   for  not being rich enough  (i.e., only have behaviors but  not  the  ‘why ’ factors) On the other hand, logs have been  criticized  for  recording too much  about us (i.e., logging a lot of  personal  information about a person) search logs How much we can  learn  about a person from the data stored in search logs? Specifically, how rich of a searcher profile can we build of  what  a person is doing, of  why  they are doing it, and to  predict  what are they going to do next?
An illustrative example
How much can we tell from a single query?  ASIS&T  is an acronym for the American Society of Information Science and Technology  Good  probability  that this user is an  academic , a researcher, a librarian, or a student in one of these disciplines  Leveraging  demographic information : 57 percent female / 43 percent male probability  66.2 percent chance works in the information science field 55.6 percent probability this user has master’s degree
How much can we tell from a single query?  Leveraging  demographic information  (cont’d): 32.3 percent probability this user has a doctorate 53 percent likelihood works in academia.  Using  IP , we can locate the geographical area Based on  time , could infer that: this person is searching for the conference’s schedule (if the query is submitted prior to the meeting) for travel or looking for presentations or papers from the meeting (if the query is submitted after the conference).  Theoretically,  we can tell a lot ! However, with  billions of queries  per month, we can’t do the analysis  by hand  like this example. To develop user profiles, we need  automated methods . Research Question -  How complete of a  profile  can one develop for a Web search engine  user  from search  log  data?  [(a) what the user is doing, (b) what the user is interested in, and (c) what the user intends to do]
Specific aspects with automated methods …  Location  Geographical interest Topical interest Topical complexity Content desires Commercial intent Purchase intent Potential to click on a link Gender User identification –  where the user is at –  where the user is going –  what the user is interested in –  how motivated is the user –  Info, Nav, Transactional –  eCommerce related –  getting ready to buy –  will user click on link - demographic targeting/personalization - specific user targeting –  IP look-up script –  query term usage –  tools like Open Calais –  n-grams pattern analysis –  binary tree, k-mans clustering –  tools like MSN adLabs –  session analysis –  time series analysis - tools like MSN adLabs (need a whole lot of data)
A comment about user identification  we can tell a lot  about  a person within a group of people with search logs (i.e., behaviors) … … identifying  a particular individual is much more difficult with just search logs (probably takes ~12 – 18 months of data). Given a group of folks who use a search engine, …
User Profiling Framework  Classify user aspects into two levels:  internal  and  external .  Internal  aspects refer to  attributes  of the users themselves.  External  aspects relate to the  behavior or interest  of the users.  Interaction  between  internal  and  external  aspects. Can  infer   external  aspects from  internal  aspects.  External  aspects  reflect   internal  aspects
Thank you! (open for questions and further discussion) Jim Jansen College of Information Sciences and Technology  The Pennsylvania State University  [email_address]
Search Logs has some common fields, such as time, queries, results, etc. We can enrich the log with additional fields. Back Back
Back
Back

More Related Content

Profiling a Person With Search Log Data

  • 1. Profiling a Person With Log Data Jim Jansen College of Information Sciences and Technology The Pennsylvania State University [email_address] Interested in how much descriptive information we can generate about a people by leveraging search log data .
  • 2. What Did We Find Out? We can tell quite a lot!
  • 3. The State of Web Search
  • 4. The Power of Search and the Web Search is the top online activity Search drives over 7 billion monthly queries in the U.S. Online activity has a huge impact on people’s daily lives: 70 minutes less with family 30 minutes less TV 8.5 minutes less sleep Sources: comScore, U.S., Feb. ’06, Stanford Institute for the Quantitative Study of Society, Nov. ‘05
  • 5. Analysis of Search Marketplace Holding fairly stable over the last year or so, albeit with some Bing flux
  • 6. Search Logs Contains the trace data recorded when a person visits the search engine, submits a query, views results, etc On one hand, logs have been criticized for not being rich enough (i.e., only have behaviors but not the ‘why ’ factors) On the other hand, logs have been criticized for recording too much about us (i.e., logging a lot of personal information about a person) search logs How much we can learn about a person from the data stored in search logs? Specifically, how rich of a searcher profile can we build of what a person is doing, of why they are doing it, and to predict what are they going to do next?
  • 8. How much can we tell from a single query? ASIS&T is an acronym for the American Society of Information Science and Technology Good probability that this user is an academic , a researcher, a librarian, or a student in one of these disciplines Leveraging demographic information : 57 percent female / 43 percent male probability 66.2 percent chance works in the information science field 55.6 percent probability this user has master’s degree
  • 9. How much can we tell from a single query? Leveraging demographic information (cont’d): 32.3 percent probability this user has a doctorate 53 percent likelihood works in academia. Using IP , we can locate the geographical area Based on time , could infer that: this person is searching for the conference’s schedule (if the query is submitted prior to the meeting) for travel or looking for presentations or papers from the meeting (if the query is submitted after the conference). Theoretically, we can tell a lot ! However, with billions of queries per month, we can’t do the analysis by hand like this example. To develop user profiles, we need automated methods . Research Question - How complete of a profile can one develop for a Web search engine user from search log data? [(a) what the user is doing, (b) what the user is interested in, and (c) what the user intends to do]
  • 10. Specific aspects with automated methods … Location Geographical interest Topical interest Topical complexity Content desires Commercial intent Purchase intent Potential to click on a link Gender User identification – where the user is at – where the user is going – what the user is interested in – how motivated is the user – Info, Nav, Transactional – eCommerce related – getting ready to buy – will user click on link - demographic targeting/personalization - specific user targeting – IP look-up script – query term usage – tools like Open Calais – n-grams pattern analysis – binary tree, k-mans clustering – tools like MSN adLabs – session analysis – time series analysis - tools like MSN adLabs (need a whole lot of data)
  • 11. A comment about user identification we can tell a lot about a person within a group of people with search logs (i.e., behaviors) … … identifying a particular individual is much more difficult with just search logs (probably takes ~12 – 18 months of data). Given a group of folks who use a search engine, …
  • 12. User Profiling Framework Classify user aspects into two levels: internal and external . Internal aspects refer to attributes of the users themselves. External aspects relate to the behavior or interest of the users. Interaction between internal and external aspects. Can infer external aspects from internal aspects. External aspects reflect internal aspects
  • 13. Thank you! (open for questions and further discussion) Jim Jansen College of Information Sciences and Technology The Pennsylvania State University [email_address]
  • 14. Search Logs has some common fields, such as time, queries, results, etc. We can enrich the log with additional fields. Back Back
  • 15. Back
  • 16. Back