SlideShare a Scribd company logo
Using Hadoop for Malware, 
Network, Forensics and Log 
analysis 
● Michael Boman 
● michael@michaelboman.org 
● http://blog.michaelboman.org 
● @mboman
Background 
● 44CON 2012 – Malware analysis as a 
hobby 
● DEEPSEC 2012 – Malware analysis on a 
shoe-string budget 
● DEEPSEC 2013 - Malware Datamining and 
Attribution
44CON 2014: Using hadoop for malware, network, forensics and log analysis
VirusShare Malware Collection 
7000 
6000 
5000 
4000 
3000 
2000 
1000 
0 
VirusShare 
5.8 TByte 
Total Size (GB) 
2012-01-01 2014-07-21
VirusShare Latest Releases
What is Hadoop? 
● Distributed processing of large data sets 
(“Big Data”) 
● Runs on of-the-shelf hardware 
● Runs from a single node to thousands of 
machines 
● High failure tolerance 
– “Hardware is crappy and will fail”
Hadoop components 
Data Access 
Data Storage 
Interaction 
Visualization 
Execution 
Development 
Data Serialization 
Data Intelligence 
Java Virtual Machine 
Operating system (Redhat, Ubuntu, Windows etc.) 
Data Integration 
Sqoop Flume Chukwa 
HDFS 
(distributed storage) 
Map Reduce 
(distributed processing) 
YARN 
(Distributed 
Scheduling) 
Pig Hive 
HBase 
Cassandra 
HCatalog 
Lucene 
Hama 
Crunch 
Avro 
Thrift 
Drill 
Mahout 
Mgmnt, Monitoring, Orchestration 
Ambari Zookeeper Oozie
How to obtain your Hadoop 
infrastructure (examples) 
● Pre-packaged “distributions” 
– Cloudera 
– Hortonworks 
● Rent 
– Amazon Web Services 
● Roll your own 
– Compile from source
Malware Analysis - BinaryPig 
● Creates large archives of individual 
samples on HDFS as key/value sets 
(samples are small, HDFS likes them big) 
● Static analysis in done in batch 
● Results are stored in ElasticSearch for 
easy access/further analysis
Malware Analysis - BinaryPig 
● Extracting resource information 
● AV-(re)scanning 
● Scanning samples with new/updated Yara 
signatures
How does it work? 
ZIP-archive / 
local dir 
Binarypig 
Sequence file
How does it work? 
Sequence files stored 
in HDFS
How does it work? 
Pig-scripts for: 
Hashes 
ClamAV 
Yara 
Strings
Network Analysis - PacketPig 
● PCAP in HDFS 
● Detecting anomalies and intrusion 
signatures 
● Learn time frame and identity of attacker 
● Triage incidents 
● “Show me packet captures I’ve never seen 
before.”
How does it work? 
PCAP are created locally 
and uploaded to HDFS
How does it work? 
PCAP uploaded to HDFS
How does it work? 
Pig Scripts for 
snort signatures 
P0f 
User-agent extraction 
What-ever you want
Computer Forensics - Sleuth Kit 
Hadoop Framework 
● Uses both HDFS and HBase to store file 
information 
● Ingest 
● Analysis 
● Reporting
How does it work? 
Fsrip dumps information 
about image and 
information 
about files in the image
How does it work? 
Fsrip dumps information 
about image and 
information 
about files in the image
How does it work? 
RAW disk image file is 
uploaded to HDFS
How does it work? 
RAW disk image file is 
uploaded to HDFS
How does it work? 
Populates HBase entries 
table with information from 
the hard drive files.
How does it work? 
Extract raw filedata 
Keyword search 
Extracts text 
Tokenize 
Cluster similar objects 
Compare with other image
How does it work? 
Build a report from 
previous steps
Log Analysis 
● FLUME-agents push local logs to HDFS. 
● Pig scripts process data on schedule. 
Results from Pig are stored in HDFS / 
HBase. 
● HBase will have the data processed by Pig 
ready for reporting or further analysis. 
● Data interaction/extraction using REST 
services.
How does it work? 
FLUME-agents push 
local logs to HDFS
How does it work? 
FLUME-agents push 
local logs to HDFS
How does it work? 
Pig-scripts extracts 
data and puts them 
into HBase
How does it work? 
Pig-scripts can perform 
additional analysis on 
HBase data
How do I do it? 
● Store malware samples locally 
● Upload samples to analyze to S3 
● Run EMR on samples on S3 
● Download the results from S3 to local
Saving money 
● Samples stored locally and backed up on 
Amazon Glacier. 
● Use reduced redundancy storage on S3 
– 99.99% instead of 99.999999999% 
● Spot-bid on EC2 instances for EMR 
– ~$0.011 instead of $0.052 
● My AWS cost is expecting to be about 
$20/month
Conclusions 
● Malware Analysis 
● Network Analysis 
● Computer Forensics 
● Log Analysis
Questions? 
● michael@michaelboman.org 
● @michael 
● http://blog.michaelboman.org

More Related Content

44CON 2014: Using hadoop for malware, network, forensics and log analysis

  • 1. Using Hadoop for Malware, Network, Forensics and Log analysis ● Michael Boman ● michael@michaelboman.org ● http://blog.michaelboman.org ● @mboman
  • 2. Background ● 44CON 2012 – Malware analysis as a hobby ● DEEPSEC 2012 – Malware analysis on a shoe-string budget ● DEEPSEC 2013 - Malware Datamining and Attribution
  • 4. VirusShare Malware Collection 7000 6000 5000 4000 3000 2000 1000 0 VirusShare 5.8 TByte Total Size (GB) 2012-01-01 2014-07-21
  • 6. What is Hadoop? ● Distributed processing of large data sets (“Big Data”) ● Runs on of-the-shelf hardware ● Runs from a single node to thousands of machines ● High failure tolerance – “Hardware is crappy and will fail”
  • 7. Hadoop components Data Access Data Storage Interaction Visualization Execution Development Data Serialization Data Intelligence Java Virtual Machine Operating system (Redhat, Ubuntu, Windows etc.) Data Integration Sqoop Flume Chukwa HDFS (distributed storage) Map Reduce (distributed processing) YARN (Distributed Scheduling) Pig Hive HBase Cassandra HCatalog Lucene Hama Crunch Avro Thrift Drill Mahout Mgmnt, Monitoring, Orchestration Ambari Zookeeper Oozie
  • 8. How to obtain your Hadoop infrastructure (examples) ● Pre-packaged “distributions” – Cloudera – Hortonworks ● Rent – Amazon Web Services ● Roll your own – Compile from source
  • 9. Malware Analysis - BinaryPig ● Creates large archives of individual samples on HDFS as key/value sets (samples are small, HDFS likes them big) ● Static analysis in done in batch ● Results are stored in ElasticSearch for easy access/further analysis
  • 10. Malware Analysis - BinaryPig ● Extracting resource information ● AV-(re)scanning ● Scanning samples with new/updated Yara signatures
  • 11. How does it work? ZIP-archive / local dir Binarypig Sequence file
  • 12. How does it work? Sequence files stored in HDFS
  • 13. How does it work? Pig-scripts for: Hashes ClamAV Yara Strings
  • 14. Network Analysis - PacketPig ● PCAP in HDFS ● Detecting anomalies and intrusion signatures ● Learn time frame and identity of attacker ● Triage incidents ● “Show me packet captures I’ve never seen before.”
  • 15. How does it work? PCAP are created locally and uploaded to HDFS
  • 16. How does it work? PCAP uploaded to HDFS
  • 17. How does it work? Pig Scripts for snort signatures P0f User-agent extraction What-ever you want
  • 18. Computer Forensics - Sleuth Kit Hadoop Framework ● Uses both HDFS and HBase to store file information ● Ingest ● Analysis ● Reporting
  • 19. How does it work? Fsrip dumps information about image and information about files in the image
  • 20. How does it work? Fsrip dumps information about image and information about files in the image
  • 21. How does it work? RAW disk image file is uploaded to HDFS
  • 22. How does it work? RAW disk image file is uploaded to HDFS
  • 23. How does it work? Populates HBase entries table with information from the hard drive files.
  • 24. How does it work? Extract raw filedata Keyword search Extracts text Tokenize Cluster similar objects Compare with other image
  • 25. How does it work? Build a report from previous steps
  • 26. Log Analysis ● FLUME-agents push local logs to HDFS. ● Pig scripts process data on schedule. Results from Pig are stored in HDFS / HBase. ● HBase will have the data processed by Pig ready for reporting or further analysis. ● Data interaction/extraction using REST services.
  • 27. How does it work? FLUME-agents push local logs to HDFS
  • 28. How does it work? FLUME-agents push local logs to HDFS
  • 29. How does it work? Pig-scripts extracts data and puts them into HBase
  • 30. How does it work? Pig-scripts can perform additional analysis on HBase data
  • 31. How do I do it? ● Store malware samples locally ● Upload samples to analyze to S3 ● Run EMR on samples on S3 ● Download the results from S3 to local
  • 32. Saving money ● Samples stored locally and backed up on Amazon Glacier. ● Use reduced redundancy storage on S3 – 99.99% instead of 99.999999999% ● Spot-bid on EC2 instances for EMR – ~$0.011 instead of $0.052 ● My AWS cost is expecting to be about $20/month
  • 33. Conclusions ● Malware Analysis ● Network Analysis ● Computer Forensics ● Log Analysis
  • 34. Questions? ● michael@michaelboman.org ● @michael ● http://blog.michaelboman.org

Editor's Notes

  1. Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data. MapReduce: MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query. Hive: Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook. It allows users to write queries in a SQL-like language caled HiveQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc. Pig: Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL.) HBase: HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily. Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop. Oozie: Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive -- then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed. Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop. Ambari: Ambari is a web-based set of tools for deploying, administering and monitoring Apache Hadoop clusters. It's development is being led by engineers from Hortonworoks, which include Ambari in its Hortonworks Data Platform. Avro: Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing removed procedure calls. Mahout: Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model. Sqoop: Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target. HCatalog: HCatalog is a centralized metadata management and sharing service for Apache Hadoop. It allows for a unified view of all data in Hadoop clusters and allows diverse tools, including Pig and Hive, to process any data elements without needing to know physically where in the cluster the data is stored. BigTop: BigTop is an effort to create a more formal process or framework for packaging and interoperability testing of Hadoop's sub-projects and related components with the goal improving the Hadoop platform as a whole.