SlideShare a Scribd company logo
Ensuring Property Portal Listing Data Security
Speakers
Rami Essaid
CEO & Co-founder
Distil in Real Estate
What Is Web Scraping?
Web Scraping
Also known as screen scraping, web scraping is the act of
copying large amounts of data from a website – either
manually or with an automated program.
Legitimate Scraping
Scraping can sometimes be benevolent and totally
acceptable. For example, the search engine bots that index
your website
Malicious Scraping
A systematic theft of intellectual property accessible on a
website, including pricing, content, images, and proprietary
data
Web scrapingof listing data results in competitive disadvantage, damage to
brand reputation and loss of revenue and customers
Presenting inaccurate listing data - once listing data has been scraped, you
lose control over the presentation of the property
Damaged SEO pagerank due to stolen data duplicated on other websites
Skewed analytic data leads to mis-informed business decisions
Slowdowns and downtime due to excessive scraping
How Web Scraping Impacts Real Estate Portals
Cost of scraping / acquiring data has gone down
The ability to scrape has become easier and more accessible
Virtual servers and bandwidth are cheap
Sophistication of Botnet as a Service has increased
The Growth of the Scraping Problem
Survey Respondents
100 real estate executives representing
over 600,000 realtors
14 real estate portal operators running
400,000 real estate websites
2015 Real Estate Web Scraping Survey
We asked real estate portals “how much scraping is acceptable for your business
model and operational budget?”
How much Scraping Traffic is Too Much for your Business?
43% of responded “Less than 1%”
28% responded “Less than 15%”
Up to 25% of website traffic on real estate portals is from scrapers
How Much Bot Traffic do Real Estate Portals Actually See?
Real Estate Sites
Bad Bots
25%
Good Bots
35%
Source: “Distil Networks, The 2015 Bad Bot Landscape Report”
Humans
40%
of those surveyed said their business model
could NOT handle this level of scraping
71%
Actual Real Estate Portal Traffic
~80% are relying on the wrong tools to detect the problem
Advanced bots and distributed scraping may not be apparent in log analysis
Only 21% are relying on
commercial tools
Why Aren’t Portals Aware of The Scope of the problem?
Legacy Tools are Ineffective on the Modern Bots
Real estate portals are largely
implementing the wrong tools
Top implemented anti-scraping
solutions are
● IP Blocking
● Rate limiting (based on IPs)
● WAFs
Why can’t these tools keep up?
Source: “Distil Networks, 2015 Study of Scraping Real Estate Websites and MLS Data Security”
IP Blocking is always one step behind attackers
Attackers rotate IP addresses from huge pools of IP
IP addresses can easily be spoofed
Anonymous proxies help mask user origins
IP Based Solutions are too Reactive
Attacks are often distributed among many IP Addresses
Scraping happens at a very slow pace but from many sources
Low and Slow Attacks Evade Rate Limiting
1 IP scraping 1,000 pages = 500 IPs scraping 2 pages each
Bad guys have more tools to leverage when building bots
Web Browsers are Becoming More Complex
The Evolution of the Web
Browser versions and their Technologies
Source: http://www.evolutionoftheweb.com
Advanced bots use browser capabilities to evade detection and mimic human
behavior
Bots are Increasingly Able to Mimic Humans
Bad Bot Sophistication levels, 2014
What should bot defenses look like?
Tools Must Leverage Many Techniques to Detect Advanced Bots
Identifying advanced bots and browser
automation requires specialized techniques
Commercial, purpose-built solutions tend to
have more automation checks
Approaches to Detecting Bots, by Tier
IP blocking is not effective when dealing with modern threats
Device fingerprinting provides distinct advantages like
○ Tracking attackers across IP addresses
○ Detecting bots through anonymous proxy networks
○ Reducing false positives associated with
humans anonymizing themselves
Use Device Fingerprinting Instead of IP Blocking
Community sourced attack data aggregation provides more accurate data
source for enforcement
Machine learning and self configuration greatly
reduced security maintenance overhead
Community Sourced Intelligence Improves Accuracy
Mobile users now outnumber desktop
users
Mobile clients are now being used to
launch attacks
Mobile sites tend to be easier to
scrape
○ Less superfluous content
○ Highly structured and easy to
navigate layouts
Mobile Growth Brings With it Mobile Threats
Source: Comscore, The US Mobile App report
Precautions should be implemented to extend security strategies to cover mobile
websites
Mobile clients need to be subjected to the same scrutiny as other users
Mobile Should not be Overlooked
The World’s Most Accurate Bot Detection System
Inline Fingerprinting
Fingerprints stick to the bot even if it attempts to
reconnect from random IP addresses or hide behind an
anonymous proxy.
Known Violators Database
Real-time updates from the world’s largest Known
Violators Database, which is based on the collective
intelligence of all Distil-protected sites.
Browser Validation
The first solution to disallow browser spoofing by
validating each incoming request as self-reported and
detects all known browser automation tools.
Behavioral Modeling and Machine Learning
Machine-learning algorithms pinpoint behavioral
anomalies specific to your site’s unique traffic patterns.
Challenges Distil Results
Homegrown ‘IP blocking’ solution costly to maintain Automated bot defense eliminated the need for manual tuning
and maintenance
Had to overprovision infrastructure to account for
random spikes in bot traffic
Eliminated attacks from 90+ countries representing over 99.9%
of bad bots
Webs scraping bots broke through their defenses Stopped thousands of threats from imposter Googlebots –
making single page requests from 1000+ IP addresses/month
Onthehouse Saves Infrastructure Costs by Blocking Bad Bots
Australia’s only free property research portal, covering 98%
of Australian properties
Distil was quick to setup and ensures that we block the bots that
are dangerous to our organization.”
-Arun Thenabadu, CTO of Onthehouse
“
www.distilnetworks.com/trial/
Offer Ends: November 6th
Two Months of Free Service + Traffic Analysis
www.distilnetworks.com
QUESTIONS….COMMENTS?
I N F O @ D I S T I L N E T W O R K S . C O M
1.866.423.0606
OR CALL US ON
Abstract
Session Time: 20 minutes
Industry: Real Estate (Global - show is in Amsterdam)
Title: Ensuring Property Portal Listing Data Security
Subtitle: Don’t Bother with Litigation, Just Protect Your Listing Data Before the Theft Occurs
Abstract:
Securing your property portal listing data is harder than ever. Why? Web scraping is cheap and easy. Bots simply steal whatever content they’ve been
programmed to fetch – listing text, photos, and other data that should only be available to paid subscribers and legitimate consumers.
Attend this session to learn how to avoid expensive litigation by protecting your content before the theft occurs. Review the latest research on how non-human
traffic has evolved over the past few years and best practices to protect both copyrighted and non-copyrightable content.
Hear the results from research conducted with property portal executives on the current state of anti-scraping efforts.
Key takeaways include:
Insights into the latest research about “scraping” property portal websites
How web scraping works and what you can do to shore up your defenses
How to create a secure listing “supply chain” with your upstream and downstream partners
How to protect your brand image, reputation and SEO rankings
Bots Aren’t solving CAPTCHAs
○Realtor.org offers free tools to track data - Reactive = expensive
Checklist for Syndication has many references to data scraping – legal guidance
NoScrape – aborted project - no update since 2010?
Problem is not going away
Industry Help? ...is Way behind on Bad Bots
Ads for Scraping Programs
on Realtor.com!
○Realtor.com blog to “deter scraping” relies on
obsolete IP address blocking and expensive IP
litigation
“REALTOR.com® logging, tracking and monitoring
patterns that indicate data is being stolen for these
illegitimate purposes. Once an offender is identified, their
IP address is blocked from accessing the site.”
(Oct 10, 2014)

More Related Content

Ensuring Property Portal Listing Data Security

  • 1. Ensuring Property Portal Listing Data Security
  • 3. Distil in Real Estate
  • 4. What Is Web Scraping? Web Scraping Also known as screen scraping, web scraping is the act of copying large amounts of data from a website – either manually or with an automated program. Legitimate Scraping Scraping can sometimes be benevolent and totally acceptable. For example, the search engine bots that index your website Malicious Scraping A systematic theft of intellectual property accessible on a website, including pricing, content, images, and proprietary data
  • 5. Web scrapingof listing data results in competitive disadvantage, damage to brand reputation and loss of revenue and customers Presenting inaccurate listing data - once listing data has been scraped, you lose control over the presentation of the property Damaged SEO pagerank due to stolen data duplicated on other websites Skewed analytic data leads to mis-informed business decisions Slowdowns and downtime due to excessive scraping How Web Scraping Impacts Real Estate Portals
  • 6. Cost of scraping / acquiring data has gone down The ability to scrape has become easier and more accessible Virtual servers and bandwidth are cheap Sophistication of Botnet as a Service has increased The Growth of the Scraping Problem
  • 7. Survey Respondents 100 real estate executives representing over 600,000 realtors 14 real estate portal operators running 400,000 real estate websites 2015 Real Estate Web Scraping Survey
  • 8. We asked real estate portals “how much scraping is acceptable for your business model and operational budget?” How much Scraping Traffic is Too Much for your Business? 43% of responded “Less than 1%” 28% responded “Less than 15%”
  • 9. Up to 25% of website traffic on real estate portals is from scrapers How Much Bot Traffic do Real Estate Portals Actually See? Real Estate Sites Bad Bots 25% Good Bots 35% Source: “Distil Networks, The 2015 Bad Bot Landscape Report” Humans 40% of those surveyed said their business model could NOT handle this level of scraping 71%
  • 10. Actual Real Estate Portal Traffic
  • 11. ~80% are relying on the wrong tools to detect the problem Advanced bots and distributed scraping may not be apparent in log analysis Only 21% are relying on commercial tools Why Aren’t Portals Aware of The Scope of the problem?
  • 12. Legacy Tools are Ineffective on the Modern Bots Real estate portals are largely implementing the wrong tools Top implemented anti-scraping solutions are ● IP Blocking ● Rate limiting (based on IPs) ● WAFs Why can’t these tools keep up? Source: “Distil Networks, 2015 Study of Scraping Real Estate Websites and MLS Data Security”
  • 13. IP Blocking is always one step behind attackers Attackers rotate IP addresses from huge pools of IP IP addresses can easily be spoofed Anonymous proxies help mask user origins IP Based Solutions are too Reactive
  • 14. Attacks are often distributed among many IP Addresses Scraping happens at a very slow pace but from many sources Low and Slow Attacks Evade Rate Limiting 1 IP scraping 1,000 pages = 500 IPs scraping 2 pages each
  • 15. Bad guys have more tools to leverage when building bots Web Browsers are Becoming More Complex The Evolution of the Web Browser versions and their Technologies Source: http://www.evolutionoftheweb.com
  • 16. Advanced bots use browser capabilities to evade detection and mimic human behavior Bots are Increasingly Able to Mimic Humans Bad Bot Sophistication levels, 2014
  • 17. What should bot defenses look like?
  • 18. Tools Must Leverage Many Techniques to Detect Advanced Bots Identifying advanced bots and browser automation requires specialized techniques Commercial, purpose-built solutions tend to have more automation checks Approaches to Detecting Bots, by Tier
  • 19. IP blocking is not effective when dealing with modern threats Device fingerprinting provides distinct advantages like ○ Tracking attackers across IP addresses ○ Detecting bots through anonymous proxy networks ○ Reducing false positives associated with humans anonymizing themselves Use Device Fingerprinting Instead of IP Blocking
  • 20. Community sourced attack data aggregation provides more accurate data source for enforcement Machine learning and self configuration greatly reduced security maintenance overhead Community Sourced Intelligence Improves Accuracy
  • 21. Mobile users now outnumber desktop users Mobile clients are now being used to launch attacks Mobile sites tend to be easier to scrape ○ Less superfluous content ○ Highly structured and easy to navigate layouts Mobile Growth Brings With it Mobile Threats Source: Comscore, The US Mobile App report
  • 22. Precautions should be implemented to extend security strategies to cover mobile websites Mobile clients need to be subjected to the same scrutiny as other users Mobile Should not be Overlooked
  • 23. The World’s Most Accurate Bot Detection System Inline Fingerprinting Fingerprints stick to the bot even if it attempts to reconnect from random IP addresses or hide behind an anonymous proxy. Known Violators Database Real-time updates from the world’s largest Known Violators Database, which is based on the collective intelligence of all Distil-protected sites. Browser Validation The first solution to disallow browser spoofing by validating each incoming request as self-reported and detects all known browser automation tools. Behavioral Modeling and Machine Learning Machine-learning algorithms pinpoint behavioral anomalies specific to your site’s unique traffic patterns.
  • 24. Challenges Distil Results Homegrown ‘IP blocking’ solution costly to maintain Automated bot defense eliminated the need for manual tuning and maintenance Had to overprovision infrastructure to account for random spikes in bot traffic Eliminated attacks from 90+ countries representing over 99.9% of bad bots Webs scraping bots broke through their defenses Stopped thousands of threats from imposter Googlebots – making single page requests from 1000+ IP addresses/month Onthehouse Saves Infrastructure Costs by Blocking Bad Bots Australia’s only free property research portal, covering 98% of Australian properties Distil was quick to setup and ensures that we block the bots that are dangerous to our organization.” -Arun Thenabadu, CTO of Onthehouse “
  • 25. www.distilnetworks.com/trial/ Offer Ends: November 6th Two Months of Free Service + Traffic Analysis
  • 26. www.distilnetworks.com QUESTIONS….COMMENTS? I N F O @ D I S T I L N E T W O R K S . C O M 1.866.423.0606 OR CALL US ON
  • 27. Abstract Session Time: 20 minutes Industry: Real Estate (Global - show is in Amsterdam) Title: Ensuring Property Portal Listing Data Security Subtitle: Don’t Bother with Litigation, Just Protect Your Listing Data Before the Theft Occurs Abstract: Securing your property portal listing data is harder than ever. Why? Web scraping is cheap and easy. Bots simply steal whatever content they’ve been programmed to fetch – listing text, photos, and other data that should only be available to paid subscribers and legitimate consumers. Attend this session to learn how to avoid expensive litigation by protecting your content before the theft occurs. Review the latest research on how non-human traffic has evolved over the past few years and best practices to protect both copyrighted and non-copyrightable content. Hear the results from research conducted with property portal executives on the current state of anti-scraping efforts. Key takeaways include: Insights into the latest research about “scraping” property portal websites How web scraping works and what you can do to shore up your defenses How to create a secure listing “supply chain” with your upstream and downstream partners How to protect your brand image, reputation and SEO rankings
  • 29. ○Realtor.org offers free tools to track data - Reactive = expensive Checklist for Syndication has many references to data scraping – legal guidance NoScrape – aborted project - no update since 2010? Problem is not going away Industry Help? ...is Way behind on Bad Bots Ads for Scraping Programs on Realtor.com! ○Realtor.com blog to “deter scraping” relies on obsolete IP address blocking and expensive IP litigation “REALTOR.com® logging, tracking and monitoring patterns that indicate data is being stolen for these illegitimate purposes. Once an offender is identified, their IP address is blocked from accessing the site.” (Oct 10, 2014)

Editor's Notes

  1. This is hear to show that we know what we’re talking about. This is a sample of some of our real estate customers.
  2. Side Owner: Rami
  3. Describe what the problems real estate portals face as a result of scraping bots. This is not just limited to web scraping but also SEO damage, skewed data and website slowdown.
  4. This problem is here to stay. In fact, it is growing because of cheap and plentify resources and ready made tools to perform the attacks.
  5. To get a better understanding of how web scraping affects the real estate industry, we put together a survey. The survey included respondents of real estate portal operators and real estate executives.
  6. To understands “how much is too much” we asked how much scraping traffic could a real estate portal sustain with their business model and operational budget and found that the answer was not much. 43% answered less than a single percent of traffic should be from scrapers. An additional 28 percent answered less than 15%.
  7. The next step was to compare that with what is actually happening on real estate portal websites. According to our Bad Bot landscape report, the average real estate site has between 12 and 25% bad bot traffic. This is more than what 71% of websites surveyed deemed that they could handle in terms of bad bots.
  8. Here’s a look at a popular real estate portal using our service. This data is from last week and you can see only 12.5% of traffic is actually humans. The vast majority of traffic is Bots. on this particular site, about 75% of traffic comes from good bots, and another 12% comes from bad bots.
  9. If the web scraping problem is so rampant, why weren’t more of the respondents aware of its scope? To answer that we asked what technologies people were using to detect bots on their website. Again we can see some interesting trends. Specifically, most real estate portals are relying on log analysis to find bots. This is a good practice but many bots, especially the more advanced bots which we’ll learn more about in this presentation, do not make themselves readily apparent in logs. Only 17% of the surveyed portals were using a commercial tool.
  10. We also wanted to know what was being done to STOP these bots and we again we found that the majority of portals were relying on obsolete tools. The three major tools were IP Blocking, Rate limiting, and WAFs. What is it about these tools that is insufficient? Why can’t these keep up with bots?
  11. IP blocking was the #1 most cited method of defense for real estate portals. Unfortunately, its too reactive to be useful. IP addresses can be faked, users can mask their IPs, and ttackers can rotate IP addresses. All of these mean that you’re one step behind the bots playing an endless game of whack-a-mole.
  12. What about rate limiting? The problem here is that the defense strategy is relying on the premise that all of the requests are coming from a single trackable source which can be limited. What happens if hacker distribute the attack over a huge pool of IPs doing very slow activity to obtain the same result? It flys right under the radar of rate limiting. This is referred to as a low-and-slow attack.
  13. ok. But WAFs. WAFs are a proven tool that helps protect against many web app security attacks. That’s true. WAFs are great at stopping the OWASP TOP ten attacks but they aren’t great for stopping bots. One of the main reasons has to do with what a WAF is designed to do. It is looking for web app attacks but lets legit humans proceed on to the website unabated. As web browsers become more complex, attackers have more and more tools which which to build better bots. This chart shows the release of different technologies in various browsers over time. As you can see, in recent years, there have been more capabilities released at faster rates.
  14. One of the major things attackers are doing with all of these new tools is creating sophisticated bots capable of evading detection and posing as humans. This type of bots is able to bypass WAFs, posing as legitimate human visitors.
  15. How do we protect against this? Traditional security isn’t cutting it because it was not designed to deal with this problem. Most WAFs are designed specifically to protect against threats like the OWASP top 10 and do so with a rules based approach. Advanced bots on the other hand, fly under the radar of these tools because they appear to be human and are not performing attacks which trigger Web app attack rulesets. Identifying these bots requires using a variety of approaches, that become more advanced as the bots become more sophisticated.
  16. Rami: As you can see from the data about mobile bad bot traffic, you’re going to want to protect your mobile site. Craig: Why the increase in bad bot mobile traffic? One reason is that mobile sites are easier to scrape. The same characteristics that make a mobile optimized site easy to quickly navigate for humans also makes them prime targets for bad bots. Mobile sites tend to be easier to scrape because they provide more structured access to website data.
  17. Slide owner: Rami