12

Here are a few URLs:

http://sub.example.com/?feed=atom&hello=world
http://www.sub.example.com/?feed=atom&hello=world
http://sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom#123

As you can see, they all lead to the exact same page but the URL format is different. Here is two other basic examples:

http://example.com/hello/
http://example.com/hello

Both are the same.

I want to convert the URL into one standard format so that when I store the URL in the database, I can easily check whether if the URL string already exists in the database.

Because of the various ways of how the URL can be formatted, this can be puzzling.

What's the definitive approach to converting URL into one standard format? Maybe parse_url() route...?

Edit

As outlined in the comments, there is no definitive solution to this, but the aim is to get as close as possible with what we have without "retrieving" the page. Please read comments before posting an answer to this bounty.

11
  • 2
    This is actually a super interesting question. +1
    – Gary Woods
    Commented Aug 4, 2018 at 13:10
  • 1
    Not sure there could be a definitive approach unless you own the site that serves those URLs. There is no way to know for sure, or prove, that all of those URLs are the same without retrieving each of them, creating a checksum, and comparing the checksum values.
    – Dave
    Commented Aug 4, 2018 at 17:03
  • The aim is to convert the URL in one standard format where for example, it will always be http://sub.example.com/?feed=atom&hello=world Commented Aug 5, 2018 at 10:28
  • 4
    These are DIFFERENT urls. www.sub.* and sub.* in theory could point to different pages. Best you can do is sort the query string. Likewise, trailing slashes also mean different urls. Commented Aug 5, 2018 at 10:47
  • 2
    The correct solution is to open the URL and see if it returns a 301 redirect; then store the redirected url. Or scan the page for <link rel=canonical> tag. Both techniques are used by websites to indicate "preferred" variant of same URL. Commented Aug 5, 2018 at 10:56

9 Answers 9

1

After you parse_url:

  1. Remove the www prefix from the domain name
  2. If the path is not empty - remove the trailing slash from it
  3. Sort query parameters alphabetically by their name - if there are any

Combine these parts in order to get a canonical URL.

4
  • 1
    1) what if www.example.com and example.com are different (ii) what if trailing slash is required e.g. when the url is a directory? Commented Aug 5, 2018 at 15:50
  • The OP implied that for his URLs www and non-www means the same and that the trailing slash is ignored/removed by the server.
    – IVO GELOV
    Commented Aug 5, 2018 at 15:54
  • you must emphasize that you cannot imply anything. Even Google has problems with duplicate URLs. Commented Aug 5, 2018 at 15:57
  • Therefore there is no definite approach for solving this problem. Question closed.
    – IVO GELOV
    Commented Aug 5, 2018 at 16:01
1

I had the same issue for a reports-configuration-save functionality. In our system, users can design his own reports of sales (like JQL of Jira); for that, we use get params as conditions, and fragment identifier (after #) as layout setup, like this:

http://example.com/report.php?since=20180101&until=20180806#sort=amount&color=blue

For our system, order of GET or after # params are irrelevant as well you reach the same report configuration if set param "until" first than "since", so for us are the same request.

Considering this, subdomains are out of discussion, cause you must solve this using rewrite techniques (like mod_rewrite with 301 in Apache) or create a pool of domain exceptions to do this at software level. Also, different domains can point into different websites, so you must decide if is a good idea; in subdos "www" is very easy to figured it out, but it will toke you time in another cases.

Server side can help to get vars in query section. For example, in PHP you can use function parse_str and $_SERVER['QUERY_STRING'] to get array, and then, you will need use asort() to order it to finnaly compare if are the same request (array_diff function).

Unfortunately, server side is not an option since have no capability to get after hash (#) content, and we still without consider another problems, like scriptname included, protocols or ports:

http://www.sub.example.com/index.php?hello=world&feed=atom
https://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com:8081/?hello=world&feed=atom

In my personal experience, the most close solution is JavaScript, for handling url, parsing query section as array, compare them and do the same with fragment identifier. If you need to use it in server side, every load page will must be followed with an ajax request sending this data to the server.

Apologies in advance for length of my answer, but it is what I had to go through in order to solve the same problems you have. Greetings!

Get protocol, domain, and port from URL Get protocol, domain, and port from URL

How can I get query string values in JavaScript? How can I get query string values in JavaScript?

How do I get the fragment identifier (value after hash #) from a URL? How do I get the fragment identifier (value after hash #) from a URL?

1

adding the preferred <link rel="canonical" ... > tag into the HTML headers is the only reliable solution, in order to reference unique content to a single SEF URL. see Google's documentation, concerning Consolidate duplicate URLs, which possibly answers the whole question more autoritative and reliable, than I ever could.

the idea of being able to know of the canonical URL or to resolve a bunch externals URLs, without parsing those server's .htaccess rewrite-rules or the HTML headers, does not appear to be applicable (simply because one can maintain a table with URL aliases, which subsequently do not permit guessing how a HTTP request might have been re-written).

this question might belong to https://webmasters.stackexchange.com/search?q=cannonical.

1

Since the question is marked „PHP“ I assume you are in the backend.

There are enough answers how you can compare URLs (protocol, host, port, path, list of request params) where path is case sensitive, protocol and host are not. Changing the order of request parameters is strictly speaking also changing the URL.

My impression is that you want to differentiate by the RESOURCE which the server is serving (http://www.sub.example.com/ serves the same resource as http://sub.example.com/ or .../hello serves the same resource as .../hello/)

Which resource is served, you should perfectly know on the backend level, since you (the backend) know what you are serving. Find the perfect ID for the resource and use it.

PS: the URL is not a good identifier for that. But if you must use it, just use a sanitized version (sanitization for your purpose => sanitize to your preferred host, strip or add slashes at end of paths, drop things like /../ from path (security issue anyway), bring the request params in a certain order, whatever is right for your purpose.

Best regards, iPirat

1

It's the case with duplicate URLs and you can avoid these kind of duplicate URLs using a URL factory redirecting all URLs which are not proper to the proper URL.

And the same thing is explained in this article:

https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38

Any other URLs leading to the same page are 301 redirected to the proper version of the URLs.

This is the best practice of Search Engine Optimization(SEO). Here I'm going to give you a couple of examples.

You can consider the URLs of this website, for example the wrong links of this page are

https://stackoverflow.com/questions/51685850
https://stackoverflow.com/questions/51685850/convert-url-into-one-s
https://stackoverflow.com/questions/51685850/

If you go to the above wrong URLs of this page, you'll be redirected to the proper URL which is

https://stackoverflow.com/questions/51685850/convert-url-into-one-standard-format

And if you change the title of this question, all other URLs are 301 redirected to the proper URL. The idea here is the 301 redirection which tells the search engines to replace the old URL with the new one otherwise the search engines find different URLs providing the same content.

The real deal here is the id of the question, 51685850. This id is used to create the proper URL with the information from the database. With the URL factory that is created in the article in the link provided, you do not even need to store URLs in the database.

You can read more on duplicate content here:

https://moz.com/learn/seo/duplicate-content

The same rules are applied to tinywebhut.com as well, the wrong URLs are

https://www.tinywebhut.com/remove-duplicate-38
https://www.tinywebhut.com/some-text-38
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38/

In the above URLs the ID is appended to the end of the URL which is 38 and if you go to any of these URLs, you'll be 301 redirected to the proper version of the URLs which is

https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38

I didn't make any functions to explain this here because it is already done in this article:

https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38

You can achieve the goal with a couple of really simple functions and you can apply the same idea to remove other duplicate URLs such as /about.php, /about, /about.php/, /about/ and so on. And to achieve this you just need a little more code to your existing functions.

One alternative is adding canonical tag, for example, even if you have more than one URL to go the same page, you just need to apply canonical tag and add the link to the proper URL.

<link rel="canonical" href="https://stackoverflow.com/questions/51685850/convert-url-into-one-standard-format" />

This way you are telling the search engines that the multiple URLs should be considered as one and the search engines add the link used in the canonical tag in their search results. You can read more on canonicalization here:

https://moz.com/learn/seo/canonicalization

But still the best way to get rid of duplicate content is the 301 redirect. If you have a 301 redirect like I talked at the beginning, all problems are solved without surprises.

1

My original answer assumes that the pages are all owned by the OP, as per the line "As you can see, they all lead to the exact same page but the URL format is different...". I am adapting the answer to handle multiple options and adding a list of assumptions you can and cannot make about URLs.

As others have pointed out there is no definitive easy answer to this if you do not know that the page(s) are the same. However, if you follow these assumptions, you should be safe standardizing some things:

CAN ASSUME

  • Query strings with the same values point to the same location regardless of order. Example: https://example.com/?fruit=apple&color=red is the same as https://example.com/?color=red&fruit=apple

  • 301 redirects to a specific source can be followed. If you receive a 301 redirect response, follow the redirect and use that URL. You can safely assume that if a URL actually does point to the same page, and page rank is optimized, then you can follow it.

  • If there is a single <link rel="canonical"> tag in the HTML, that too can be used to cover the canonical link (see below for why).

CANNOT ASSUME

If you own the site

Redirect all traffic in the first part of the URL format you want: Do you want www.example.com or example.com or sub.example.com? Do you want a trailing slash or not? Redirect this first, either using server rules or PHP. This is also highly beneficial for search page rank (if that matters to you).

An example of this would be something like this:

if (!$_SERVER['HTTPS'] || 'example.com' !== $_SERVER['HTTP_HOST'] || rtrim($_SERVER['PHP_SELF'], '/') !== $_SERVER['PHP_SELF']) {
    header('HTTP/1.1 301 Moved Permanently'); 
    header('Location: '. 'https://example.com/'.rtrim($_SERVER['PHP_SELF']), '/'));
    exit;
}

Finally, to manage any remaining SEO concerns, you can add this HTML tag:

`<link rel="canonical" href="<?php echo $url; ?>">`

Whether you own the site or not, you can standardize query order

Even if you don't control the site, you can assume that query order does not matter. To standardize this, take your query and rebuild the parameters, appending it to your normalized URL.

function getSortedQuery() 
{
    $url = [];
    parse_str($_SERVER['QUERY_STRING'], $url);
    ksort($url);
    return http_build_query($url);
}

$url = $_SERVER['HTTP_HOST'].$_SERVER['PHP_SELF'].'?'.getSortedQuery();

Another option is to grab the contents of the page and see if there is a <link rel="canonical"> string, and use that string to log your data. This is a bit more costly as it requires a full page load.

To repeat, do make sure you grab 301 redirects as they are not suggestions, but directives, as to the end result URL.

One final suggestion

I might recommend using two columns, one being "canonical_url" and another being "effective_url". Sometimes a URL works and then later becomes a 301 redirect. This is just my take but I would like to know these things.

2
  • 1
    Unless I'm mistaken this is not the OP's site. It is a site that may be accessed using different URLs and he wants to just store one of them. If it is in fact his site then your answer is spot on but I don't think that's the case here.
    – Dave
    Commented Aug 10, 2018 at 9:58
  • Then the quote "As you can see, they all lead to the exact same page but the URL format is different..." is incorrect and misleading.
    – smcjones
    Commented Aug 10, 2018 at 16:16
1

All of the answers have great information. Assuming you are using an Apache-like server, for the URL bit, I would use .htaccess (or, preferably, if you can change it - the equivalent server Apache config file) to do the rewrites. For a simple example:

RewriteEngine on
RewriteBase /

RewriteCond %{HTTP_HOST} ^www\.example\.com$
RewriteRule (.*) http://example.com/$1 [R=Permanent]

In this example, the "R=Permanent" DOES do a redirect. This is usually not a big issue as, a) it tells the browser to remember the redirect, and b) your internal links are presumably relative, so protocol (http or https) and the server (example.com or whatever) are preserved. So generally the redirect will be once per session or less - time well spent, IMO, to avoid doing all this in PHP.

I guess you could use it to rewrite the order of the query bits as well, though when the query bits are significant, I tend to (not recommending you do, just sayin') add them to my path (eg rewrite ".../blah/atom" to ".../blah.php?feed=atom"). At any rate, there are loads of rewrite tricks available, and I recommend you read about them in Apache mod_rewrite.

If you do go this route, be sure to carefully think through what you want to happen - once you start mucking with the URL's, you are usually stuck with your decisions for a long while.

2
  • Unless I'm mistaken this is not the OP's site. It is a site that may be accessed using different URLs and he wants to just store one of them. If it is in fact his site then your answer is spot on but I don't think that's the case here.
    – Dave
    Commented Aug 12, 2018 at 10:40
  • Well, if he is doing log processing or the like, then your point is well taken, and agree this would be off target. I'll pull the answer later if that seems to be the case.
    – wordragon
    Commented Aug 12, 2018 at 13:21
0

As several have pointed out, while the URLs you show may currently point to the same content, there is no way to tell if they will in the future. A change in either protocol or hostname can get you different sets of content, even example.com vs. www.example.com, even if served up by the same machine at the same IP. Not common, but it can happen...

So if I were wanting to maintain a list of URLs, I would store protocol, hostname, directory path, filename if present (aka "whatever came after the last slash before a questionmark"), and a sorted on key set of key/value pairs for the GET arguments

And then don't forget that you can go to https://www.google.com and not have anything BUT the protocol and hostname...

-2

Avoid passing the parameters in the url. Pass your parameters to the web page using JSON.

1
  • Seriously? Do you have a concrete reason? Commented Aug 14, 2018 at 19:59

Not the answer you're looking for? Browse other questions tagged or ask your own question.