This document discusses parsing and analyzing large web server log files at scale. It summarizes that log files are usually huge in size and cannot be loaded entirely into memory. It proposes sequentially parsing chunks of lines and saving them to an efficient file format like Parquet to combine the files. This allows faster writing, reading and ingestion times compared to the raw log file format. Specific Python libraries like Pandas, Apache Arrow and Apache Parquet are used to efficiently convert and store the log data. A logs_to_df function is also defined that parses common/combined log formats line by line and saves chunks as Parquet files for scalable analysis of large log datasets.
Report
Share
Report
Share
1 of 12
Download to read offline
More Related Content
Log File Analysis
1. First steps at parsing and analyzing
web server log files at scale
Elias Dabbas
@eliasdabbas
2. Raw log file
Harvard Dataverse, ecommerce site (zanbil.ir)
3.3GB ~1.3M lines
Zaker, Farzin, 2019, "Online Shopping Store - Web Server Logs",
https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1
3. Parse and convert to DataFrame/Table
• Loading and parsing the whole file into memory probably won’t work (or scale)
• Log files are usually not big, they’re huge
• Sequentially parse chunks of lines, save to another efficient format (parquet), combine
5. • File ingestion gets even faster after saving the DataFrame to a single
optimized file, also more convenient to store as a single file
8. • Convert to more efficient data types
• Faster writing and reading time
10. • Magic provided by:
• Pandas
• Apache Arrow Project
• Apache Parquet Project
Model Name: MacBook Pro
Model Identifier: MacBookPro16,4
Processor Name: 8-Core Intel Core i9
Processor Speed: 2.4 GHz
Number of Processors: 1
Total Number of Cores: 8
L2 Cache (per Core): 256 KB
L3 Cache: 16 MB
Hyper-Threading Technology: Enabled
Memory: 32 GB
11. logs_to_df function
Assumes common (or combined) log format
Can be extended to other formats
def logs_to_df(logfile, output_dir, errors_file):
with open(logfile) as source_file:
linenumber = 0
parsed_lines = []
for line in source_file:
try:
log_line = re.findall(combined_regex, line)[0]
parsed_lines.append(log_line)
except Exception as e:
with open(errors_file, 'at') as errfile:
print((line, str(e)), file=errfile)
continue
linenumber += 1
if linenumber % 250_000 == 0:
df = pd.DataFrame(parsed_lines, columns=columns)
df.to_parquet(f'{output_dir}/file_{linenumber}.parquet')
parsed_lines.clear()
else:
df = pd.DataFrame(parsed_lines, columns=columns)
df.to_parquet(‘{output_dir}/file_{linenumber}.parquet’)
parsed_lines.clear()
combined_regex = '^(?P<client>S+) S+ (?P<userid>S+) [(?P<datetime>[^]]+)] "(?
P<method>[A-Z]+) (?P<request>[^ "]+)? HTTP/[0-9.]+" (?P<status>[0-9]{3}) (?
P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)'
Regular Expressions Cookbook
by Jan Goyvaerts, Steven Levithan