I have large traffic files that I'm trying to analyze statistically to check if a user clicks on links in specific sites.
It is important to say that my packets are sorted by flows (IP1 <=> IP2).
My first idea was look in the packets content and search for hrefs and links, save them all in some kind of data structure with their time stamps, and then iterate again over the packets to search for requests at close time to the time the links appeared.
Something like in the following pseudo code:
for each packet in each flow:
search for "href" or "http://" or "https://"
save the links with their timestamp
for each packet in each flow:
if it's an http request and its url matches some url in the list and the
time is close enough, record it
The problem with this code is that some (important) links are dynamically generated while the page is loading, and cannot be found using the above method.
Another idea was check the referrer field in the http header and look for packets that where referred by the relevant sites. This method generates a lot of false positives because of frames and embedded objects.
It is important to mention that this is not my server, and my intention is to make a tool for statistical analysis of users behavior (thus, I can't add some kind of click tracker to my site).
Does anyone have an idea what can I do in order to check if the users clicked on links according to their network traffic?
Any help will be appreciated!
Thank you