SlideShare a Scribd company logo
| Web Scraping and Automation With Outsystems
No API? No Problem! Let
the Robot Do Your Work!
Web Scraping and Automation With Outsystems
| Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems
Miguel
Antunes
OutSystems MVP - Tech Lead | Do iT Lean
@
in
miguel.antunes@doitlean.com
/antunes-miguel
| Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems
we ♥ APIs,
but… we don’t always have them
| Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems
Pulling data straight out
of HTML – otherwise
known as web scraping.
| Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems
Any content that can be
viewed on a webpage
can be scraped.
| Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems
but… Why You Should
Scrape?
| Web Scraping and Automation With Outsystems
| Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems
No Rate-Limiting
| Web Scraping and Automation With Outsystems
Anonymous Access
| Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems
The Data’s Already in
Your Face
| Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems
Let’s Get to Scraping
| Web Scraping and Automation With Outsystems
No matter what language you’re
into, there’s a great scraping
library for your project:
● BeautifulSoup or Scrapy,
Python
● Upton or Wombat or
Nokogiri, Ruby
● Scraperjs or X-ray, Node
● Scrape, Go
● Jaunt, Java
| Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems
+ Text and HTML
Processing
| Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems
Leonardo Fernandes
Head of Delivery OutSystems, MVP | Phoenix Services
| Web Scraping and Automation With Outsystems
Extract information from plain text data with regular
expressions, or from HTML with CSS selectors.
Manipulate HTML documents with ease, and sanitize user
input against HTML injection.
| Web Scraping and Automation With Outsystems
The Plan
● Pinpoint your target: a simple
html website
● Design your scraping scheme
● Run & let the magic operate
| Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems
Hands-on!
| Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems
What about Enterprise
usage?
You may ask...
| Web Scraping and Automation With Outsystems
| Web Scraping and Automation With Outsystems
Frankort & Koning needs
● Check if Product/Producers is
certified
● Do that multiple times per
day, multiple times per
product
Global Gap problems
● No API available
● All the checks needs to be
done manually
| Web Scraping and Automation With Outsystems
How does it work…
You want to know which farm
produced your product?
● On the packaging of several products, you can find a 13-digit GLOBALG.A.P. Number
(GGN).
This number identifies the producer or producer group that has farmed your
product.
● As a consumer, you can use it to verify whether the product is from a certified
producer or not in the GLOBALG.A.P. Database.
● Retailers also use this number for business-to-business traceability to ensure that
products–especially fresh fruit and vegetables–come from a certified origin and that
the production is safe and sustainable.
| Web Scraping and Automation With Outsystems
| Web Scraping and Automation With Outsystems
OutSystems + Selenium + Chrome
● Automate user
interactions
● Extract HTML
● Parse HTML as before
| Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems
Let’s see it in action...
| Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems
700+
Producers
160+
Products
900+
Certificates
*estimating that each certificate would take 1 minute to check manually
~15h
Manually*
~2h
Automatically
| Web Scraping and Automation With Outsystems
Thank You!
@
in
miguel.antunes@doitlean.co
m
/antunes-miguel

More Related Content

No API? No Problem! Let the Robot Do Your Work! Web Scraping and Automation With Outsystems

  • 1. | Web Scraping and Automation With Outsystems No API? No Problem! Let the Robot Do Your Work! Web Scraping and Automation With Outsystems
  • 2. | Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems Miguel Antunes OutSystems MVP - Tech Lead | Do iT Lean @ in miguel.antunes@doitlean.com /antunes-miguel
  • 3. | Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems we ♥ APIs, but… we don’t always have them
  • 4. | Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems Pulling data straight out of HTML – otherwise known as web scraping.
  • 5. | Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems Any content that can be viewed on a webpage can be scraped.
  • 6. | Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems but… Why You Should Scrape?
  • 7. | Web Scraping and Automation With Outsystems
  • 8. | Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems No Rate-Limiting
  • 9. | Web Scraping and Automation With Outsystems Anonymous Access
  • 10. | Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems The Data’s Already in Your Face
  • 11. | Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems Let’s Get to Scraping
  • 12. | Web Scraping and Automation With Outsystems No matter what language you’re into, there’s a great scraping library for your project: ● BeautifulSoup or Scrapy, Python ● Upton or Wombat or Nokogiri, Ruby ● Scraperjs or X-ray, Node ● Scrape, Go ● Jaunt, Java
  • 13. | Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems + Text and HTML Processing
  • 14. | Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems Leonardo Fernandes Head of Delivery OutSystems, MVP | Phoenix Services
  • 15. | Web Scraping and Automation With Outsystems Extract information from plain text data with regular expressions, or from HTML with CSS selectors. Manipulate HTML documents with ease, and sanitize user input against HTML injection.
  • 16. | Web Scraping and Automation With Outsystems The Plan ● Pinpoint your target: a simple html website ● Design your scraping scheme ● Run & let the magic operate
  • 17. | Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems Hands-on!
  • 18. | Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems What about Enterprise usage? You may ask...
  • 19. | Web Scraping and Automation With Outsystems
  • 20. | Web Scraping and Automation With Outsystems Frankort & Koning needs ● Check if Product/Producers is certified ● Do that multiple times per day, multiple times per product Global Gap problems ● No API available ● All the checks needs to be done manually
  • 21. | Web Scraping and Automation With Outsystems How does it work… You want to know which farm produced your product? ● On the packaging of several products, you can find a 13-digit GLOBALG.A.P. Number (GGN). This number identifies the producer or producer group that has farmed your product. ● As a consumer, you can use it to verify whether the product is from a certified producer or not in the GLOBALG.A.P. Database. ● Retailers also use this number for business-to-business traceability to ensure that products–especially fresh fruit and vegetables–come from a certified origin and that the production is safe and sustainable.
  • 22. | Web Scraping and Automation With Outsystems
  • 23. | Web Scraping and Automation With Outsystems OutSystems + Selenium + Chrome ● Automate user interactions ● Extract HTML ● Parse HTML as before
  • 24. | Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems Let’s see it in action...
  • 25. | Web Scraping and Automation With Outsystems| Web Scraping and Automation With Outsystems 700+ Producers 160+ Products 900+ Certificates *estimating that each certificate would take 1 minute to check manually ~15h Manually* ~2h Automatically
  • 26. | Web Scraping and Automation With Outsystems Thank You! @ in miguel.antunes@doitlean.co m /antunes-miguel

Editor's Notes

  1. Thank you all for being here, let me also thanks OutSystems to let be here on the stage talking about a topic that I really like. Which is Web Scraping and Automation with OutSystems.
  2. They’re like the one ring of programming, enabling you to pull info and perform actions from different services. I doubt Slack would be nearly as popular is it is without all those cool API integrations. Now, considering how popular APIs are these days, it’s frustrating to run into a service or site without one. But, it’s actually quite common. Netflix shut down it’s API years ago. My bank doesn’t have one. Most news sources don’t either. Bottom line, many apps & data aren’t designed for programmatic access. But don’t let that discourage you from building your next big thing. If you need to collect data or perform an action on the web without access to an API, there are a couple ways you can hack it.
  3. If a website provides a way for a visitor’s browser to download content and render that content in a structured way, then almost by definition, that content can be accessed programmatically. In this presentation, I’ll show you how. Over the past few years, I’ve scraped dozens of websites – from cinema blogs, models agencies to cooking recipes sites, undocumented JSON endpoints that I found by inspecting network traffic in my browser, you name it. There are some tricks that site owners will use to thwart this type of access – which we’ll dive into later – but they almost all have simple work-arounds.
  4. Let me share with you some good point why web scraping is a good thing. Of course, the first one is when we don’t have an API.
  5. Site owners generally care way more about maintaining their public-facing visitor website than they do about their structured data feeds. We’ve seen it very publicly with Twitter clamping down on their developer ecosystem, and I’ve seen it multiple times in my projects where APIs change or feeds move without warning. Sometimes it’s deliberate, but most of the time these sorts of problems happen because no one at the organization really cares or maintains the structured data. If it goes offline or gets horribly mangled, no one really notices. One the other hand, if the website goes down or is having issues, that’s a more of an in-your-face, drop-everything-until-this-is-fixed kind of problem, and gets dealt with quickly.
  6. Another thing to think about is that the concept of rate-limiting is virtually non-existent for public websites. Aside from the occasional captchas on sign up pages, most businesses generally don’t build a lot of defenses against automated access. I’ve scraped a single site for over 4 hours at a time and not seen any issues. Unless you’re making an high amount of concurrent requests, you probably won’t be viewed as a DDOS attack, you’ll just show up as a super-avid visitor in the logs, in case anyone’s looking.
  7. There are also fewer ways for the website’s administrators to track your behavior, which can be useful if you want gather data more privately. With APIs, you often have to register to get a key and then send along that key with every request. But with simple HTTP requests, you’re basically anonymous besides your IP address and cookies, which can be easily spoofed.
  8. Web scraping is also universally available, as I mentioned earlier. You don’t have to wait for a site to come up with an API or even contact anyone at the organization to ask for it. Just spend some time browsing the site until you find the data you need and figure out some basic access patterns – which we’ll talk about next.
  9. OutSystems has some libraries too, they are at the forge ready to be downloaded and to be used. I personally like Text and HTML Processing
  10. It was created by Leonardo, kudos to him.
  11. It is probably a horrible idea to try parsing the HTML of the page as a long string, right? (although there are times I’ve needed to fall back on that). A good library will read in the HTML that you pull in using some HTTP request and turn it into an object that you can iterate over to your heart’s content, similar to a JSON object. And this component just does that perfectly. The key to web scraping is figuring out how to identify the exact elements you’re looking for. This could be by looking for element types (divs, list items), particular ids or classes, or by doing regex / XPath searches.
  12. So the first thing you’re going to need to do is fetch the data. You’ll need to start by finding your “endpoints” – the URL or URLs that return the data you need. If you know you need your information organized in a certain way – or only need a specific subset of it – you can browse through the site using their navigation. Pay attention to the URLs and how they change as you click between sections and drill down into sub-sections. The other option for getting started is to go straight to the site’s search functionality. Try typing in a few different terms and again, pay attention to the URL and how it changes depending on what you search for. You’ll probably see a GET parameter like q= that always changes based on you search term. Try removing other unnecessary GET parameters from the URL, until you’re left with only the ones you need to load your data. Make sure that there’s always a beginning ? to start the query string and a & between each key/value pair.
  13. I’m going to share with you on real scenario where we used Web Scraping to overcome the problem of not having an API to interact with a third party system.
  14. Frankort & Koning is an international organisation that trades in fruit and vegetables. Let me try do simplify their business process. They buy from the producers and sell to the markets. And since we’re talking about fruits and vegetables, they’re really worried about the freshness of the products they trade. And this is an area with a lot of regulations, for example they can only sold products that came from certified producers.
  15. Global GAP is the organization that certify the producers. Frankort needed to have a way to know for each product that comes to the warehouse was certified for that specific producer. The problem is that Global Gab, doesn't have an api to cross check that information, everything is done manually.
  16. And I can we do that cross check? There’s a 13-digit code on the packaging of the products. Like a traditional barcode that we are used to see. With this code we can check the global gap database for the product certification.
  17. This is how we can check the certification for a GGN number on the Global Gap database. After inserting the producer GGN It will show us what this producer is certified to produce and sell, for this case he is certified to do it for cucumbers. Now imagine this… Frankort receives thousands of products daily, checking all the products manually would make someone jump of a bridge for sure!
  18. What we did, was to combine outsystems with selenium and then selenium with google chrome. This way we can automate user actions, extract the HTML that resulted from that interaction and parse it as we did before.