I'm having trouble structuring the output of a Netezza query. I'm working with web data and need to be able to count the distinct URL's that are receiving traffic. The issue is that most of the rows contain URL parameters that need to be stripped before counting.
This is what my URL's look like:
- www.site.com/page?utm_source=xxx&utm_campaign=yyy
- www.site.com/another?cmp_code=zzz&cmp_vendor=aaa
- www.site.com/page?cmp_code=nnn&cmp_vendor=bbb
- www.site.com/another
- www.site.com/something
What I'd like to return is something like this:
- www.site.com/page
- www.site.com/another
- www.site.com/page
- www.site.com/another
- www.site.com/something
My query is below. It runs without error, but its not producing the results that I'd like.
SELECT UID, TIMESTAMP, SUBSTR(FULL_URL, 1, INSTR(FULL_URL, '?', -1) AS MY_URL
FROM MY.TABLE
LIMIT 100;
This is the result
- www.site.com/page
- www.site.com/another
- www.site.com/page
- NULL
- NULL
You can see here that if my URL contains a question mark, I'm getting the exact result that I want. If it does not contain URL parameters then it returns nothing at all.
I've also tried a regex approach, but this is currently giving me a Count field incorrect
error.
SELECT UID, TIMESTAMP, REGEXP_REPACE(FULL_URL, (\.\+)(\/?)(\.\*), $1) AS MY_URL
FROM MY.TABLE
LIMIT 100;
I'm still new to Netezza and I'm not seeing the logic or operators that I need to accomplish this. Can anyone point me in the right direction?
regexp_extract(full_url,'^[^?]+')