Log File Analysis

First steps at parsing and analyzing
web server log files at scale
Elias Dabbas
@eliasdabbas

Raw log file
Harvard Dataverse, ecommerce site (zanbil.ir)
 
3.3GB ~1.3M lines
Zaker, Farzin, 2019, "Online Shopping Store - Web Server Logs",

https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1

Parse and convert to DataFrame/Table
• Loading and parsing the whole file into memory probably won’t work (or scale)

• Log files are usually not big, they’re huge

• Sequentially parse chunks of lines, save to another eﬃcient format (parquet), combine

• File ingestion gets even faster after saving the DataFrame to a single
optimized file, also more convenient to store as a single file

• Convert to more eﬃcient data types

• Faster writing and reading time

• Magic provided by:

• Pandas

• Apache Arrow Project

• Apache Parquet Project
Model Name: MacBook Pro
Model Identifier: MacBookPro16,4
Processor Name: 8-Core Intel Core i9
Processor Speed: 2.4 GHz
Number of Processors: 1
Total Number of Cores: 8
L2 Cache (per Core): 256 KB
L3 Cache: 16 MB
Hyper-Threading Technology: Enabled
Memory: 32 GB

logs_to_df function
Assumes common (or combined) log format
Can be extended to other formats
def logs_to_df(logfile, output_dir, errors_file):
with open(logfile) as source_file:
linenumber = 0
parsed_lines = []
for line in source_file:
try:
log_line = re.findall(combined_regex, line)[0]
parsed_lines.append(log_line)
except Exception as e:
with open(errors_file, 'at') as errfile:
print((line, str(e)), file=errfile)
continue
linenumber += 1
if linenumber % 250_000 == 0:
df = pd.DataFrame(parsed_lines, columns=columns)
df.to_parquet(f'{output_dir}/file_{linenumber}.parquet')
parsed_lines.clear()
else:
df = pd.DataFrame(parsed_lines, columns=columns)
df.to_parquet(‘{output_dir}/file_{linenumber}.parquet’)
parsed_lines.clear()
combined_regex = '^(?P<client>S+) S+ (?P<userid>S+) [(?P<datetime>[^]]+)] "(?
P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (?
P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)'
Regular Expressions Cookbook
by Jan Goyvaerts, Steven Levithan

Log File Analysis

More Related Content

Log File Analysis