SlideShare a Scribd company logo
First steps at parsing and analyzing
web server log files at scale
Elias Dabbas
@eliasdabbas
Raw log file
Harvard Dataverse, ecommerce site (zanbil.ir)


3.3GB ~1.3M lines
Zaker, Farzin, 2019, "Online Shopping Store - Web Server Logs", 

https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1
Parse and convert to DataFrame/Table
• Loading and parsing the whole file into memory probably won’t work (or scale)

• Log files are usually not big, they’re huge

• Sequentially parse chunks of lines, save to another efficient format (parquet), combine
Log File Analysis
• File ingestion gets even faster after saving the DataFrame to a single
optimized file, also more convenient to store as a single file
Log File Analysis
Log File Analysis
• Convert to more efficient data types

• Faster writing and reading time
Log File Analysis
• Magic provided by:

• Pandas

• Apache Arrow Project

• Apache Parquet Project
Model Name: MacBook Pro
Model Identifier: MacBookPro16,4
Processor Name: 8-Core Intel Core i9
Processor Speed: 2.4 GHz
Number of Processors: 1
Total Number of Cores: 8
L2 Cache (per Core): 256 KB
L3 Cache: 16 MB
Hyper-Threading Technology: Enabled
Memory: 32 GB
logs_to_df function
Assumes common (or combined) log format
Can be extended to other formats
def logs_to_df(logfile, output_dir, errors_file):
with open(logfile) as source_file:
linenumber = 0
parsed_lines = []
for line in source_file:
try:
log_line = re.findall(combined_regex, line)[0]
parsed_lines.append(log_line)
except Exception as e:
with open(errors_file, 'at') as errfile:
print((line, str(e)), file=errfile)
continue
linenumber += 1
if linenumber % 250_000 == 0:
df = pd.DataFrame(parsed_lines, columns=columns)
df.to_parquet(f'{output_dir}/file_{linenumber}.parquet')
parsed_lines.clear()
else:
df = pd.DataFrame(parsed_lines, columns=columns)
df.to_parquet(‘{output_dir}/file_{linenumber}.parquet’)
parsed_lines.clear()
combined_regex = '^(?P<client>S+) S+ (?P<userid>S+) [(?P<datetime>[^]]+)] "(?
P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (?
P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)'
Regular Expressions Cookbook
by Jan Goyvaerts, Steven Levithan
Thank you

More Related Content

Log File Analysis

  • 1. First steps at parsing and analyzing web server log files at scale Elias Dabbas @eliasdabbas
  • 2. Raw log file Harvard Dataverse, ecommerce site (zanbil.ir) 
 3.3GB ~1.3M lines Zaker, Farzin, 2019, "Online Shopping Store - Web Server Logs",  https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1
  • 3. Parse and convert to DataFrame/Table • Loading and parsing the whole file into memory probably won’t work (or scale) • Log files are usually not big, they’re huge • Sequentially parse chunks of lines, save to another efficient format (parquet), combine
  • 5. • File ingestion gets even faster after saving the DataFrame to a single optimized file, also more convenient to store as a single file
  • 8. • Convert to more efficient data types • Faster writing and reading time
  • 10. • Magic provided by: • Pandas • Apache Arrow Project • Apache Parquet Project Model Name: MacBook Pro Model Identifier: MacBookPro16,4 Processor Name: 8-Core Intel Core i9 Processor Speed: 2.4 GHz Number of Processors: 1 Total Number of Cores: 8 L2 Cache (per Core): 256 KB L3 Cache: 16 MB Hyper-Threading Technology: Enabled Memory: 32 GB
  • 11. logs_to_df function Assumes common (or combined) log format Can be extended to other formats def logs_to_df(logfile, output_dir, errors_file): with open(logfile) as source_file: linenumber = 0 parsed_lines = [] for line in source_file: try: log_line = re.findall(combined_regex, line)[0] parsed_lines.append(log_line) except Exception as e: with open(errors_file, 'at') as errfile: print((line, str(e)), file=errfile) continue linenumber += 1 if linenumber % 250_000 == 0: df = pd.DataFrame(parsed_lines, columns=columns) df.to_parquet(f'{output_dir}/file_{linenumber}.parquet') parsed_lines.clear() else: df = pd.DataFrame(parsed_lines, columns=columns) df.to_parquet(‘{output_dir}/file_{linenumber}.parquet’) parsed_lines.clear() combined_regex = '^(?P<client>S+) S+ (?P<userid>S+) [(?P<datetime>[^]]+)] "(? P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (? P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)' Regular Expressions Cookbook by Jan Goyvaerts, Steven Levithan