Get domain name from given url

Question

Given a URL, I want to extract domain name(It should not include 'www' part). Url can contain http/https. Here is the java code that I wrote. Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.

public static String getDomainName(String url) throws MalformedURLException{
    if(!url.startsWith("http") && !url.startsWith("https")){
         url = "http://" + url;
    }        
    URL netUrl = new URL(url);
    String host = netUrl.getHost();
    if(host.startsWith("www")){
        host = host.substring("www".length()+1);
    }
    return host;
}

Input: http://google.com/blah

Output: google.com

Try http://74.125.226.70 and let me know how that works out :) — Marvin Pinto, Commented Mar 7, 2012 at 19:39
And how would you get the domain name from that? Assuming that's what you're after.. — Marvin Pinto, Commented Mar 7, 2012 at 19:42
For example http://www.de/ or http://www.com/ will not give the desired results. — Michael Konietzka, Commented Mar 7, 2012 at 19:50

Mike Samuel · Accepted Answer · 2012-03-07 20:41:37Z

If you want to parse a URL, use java.net.URI. java.net.URL has a bunch of problems -- its equals method does a DNS lookup which means code using it can be vulnerable to denial of service attacks when used with untrusted inputs.

"Mr. Gosling -- why did you make url equals suck?" explains one such problem. Just get in the habit of using java.net.URI instead.

public static String getDomainName(String url) throws URISyntaxException {
    URI uri = new URI(url);
    String domain = uri.getHost();
    return domain.startsWith("www.") ? domain.substring(4) : domain;
}

should do what you want.

Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.

Your code as written fails for the valid URLs:

httpfoo/bar -- relative URL with a path component that starts with http.
HTTP://example.com/ -- protocol is case-insensitive.
//example.com/ -- protocol relative URL with a host
www/foo -- a relative URL with a path component that starts with www
wwwexample.com -- domain name that does not starts with www. but starts with www.

Hierarchical URLs have a complex grammar. If you try to roll your own parser without carefully reading RFC 3986, you will probably get it wrong. Just use the one that's built into the core libraries.

If you really need to deal with messy inputs that java.net.URI rejects, see RFC 3986 Appendix B:

Appendix B. Parsing a URI Reference with a Regular Expression

As the "first-match-wins" algorithm is identical to the "greedy" disambiguation method used by POSIX regular expressions, it is natural and commonplace to use a regular expression for parsing the potential five components of a URI reference.

The following line is the regular expression for breaking-down a well-formed URI reference into its components.
  ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
   12            3  4          5       6  7        8 9
The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis).

@Jitendra, I recommend you don't work on fixing them. The Java libraries people have already done the work for you. — Mike Samuel, Commented Mar 7, 2012 at 19:55
Also for URI netUrl = new URI("www.google.com"); netUrl.getHost() returns NULL. I think I still need to check for http:// or https:// — RandomQuestion, Commented Mar 7, 2012 at 19:55
@Jitendra, www.google.com is a relative URL with a path component that is www.google.com. For example, if resolved against http://example.com/, you would get http://example.com/www.google.com. — Mike Samuel, Commented Mar 7, 2012 at 19:58
URI host will be null if it contains special characters, for example: "öob.se" — inc, Commented Feb 28, 2018 at 7:58
if domain name contians underscode ( _ ) then uri.getHost(); is returning null. — user2128672, Commented Mar 30, 2020 at 13:45

Mike Samuel · Accepted Answer · 2014-10-21 13:40:19Z

import java.net.*;
import java.io.*;

public class ParseURL {
  public static void main(String[] args) throws Exception {

    URL aURL = new URL("http://example.com:80/docs/books/tutorial"
                       + "/index.html?name=networking#DOWNLOADING");

    System.out.println("protocol = " + aURL.getProtocol()); //http
    System.out.println("authority = " + aURL.getAuthority()); //example.com:80
    System.out.println("host = " + aURL.getHost()); //example.com
    System.out.println("port = " + aURL.getPort()); //80
    System.out.println("path = " + aURL.getPath()); //  /docs/books/tutorial/index.html
    System.out.println("query = " + aURL.getQuery()); //name=networking
    System.out.println("filename = " + aURL.getFile()); ///docs/books/tutorial/index.html?name=networking
    System.out.println("ref = " + aURL.getRef()); //DOWNLOADING
  }
}

Thanks! We can use Java URL class instead of Android Uri for unit-testing. Now I see a difference between authority and host. — CoolMind, Commented Dec 23, 2022 at 14:11

Community · Accepted Answer · 2017-05-23 11:54:59Z

Here is a short and simple line using InternetDomainName.topPrivateDomain() in Guava: InternetDomainName.from(new URL(url).getHost()).topPrivateDomain().toString()

Given http://www.google.com/blah, that will give you google.com. Or, given http://www.google.co.mx, it will give you google.co.mx.

As Sa Qada commented in another answer on this post, this question has been asked earlier: Extract main domain name from a given url. The best answer to that question is from Satya, who suggests Guava's InternetDomainName.topPrivateDomain()

public boolean isTopPrivateDomain()

Indicates whether this domain name is composed of exactly one subdomain component followed by a public suffix. For example, returns true for google.com and foo.co.uk, but not for www.google.com or co.uk.

Warning: A true result from this method does not imply that the domain is at the highest level which is addressable as a host, as many public suffixes are also addressable hosts. For example, the domain bar.uk.com has a public suffix of uk.com, so it would return true from this method. But uk.com is itself an addressable host.

This method can be used to determine whether a domain is probably the highest level for which cookies may be set, though even that depends on individual browsers' implementations of cookie controls. See RFC 2109 for details.

Putting that together with URL.getHost(), which the original post already contains, gives you:

import com.google.common.net.InternetDomainName;

import java.net.URL;

public class DomainNameMain {

  public static void main(final String... args) throws Exception {
    final String urlString = "http://www.google.com/blah";
    final URL url = new URL(urlString);
    final String host = url.getHost();
    final InternetDomainName name = InternetDomainName.from(host).topPrivateDomain();
    System.out.println(urlString);
    System.out.println(host);
    System.out.println(name);
  }
}

no preview on this?
– gumuruh
Commented Apr 25, 2023 at 18:10 — gumuruh, Commented Apr 25, 2023 at 18:10
@gumuruh, do you mean a Javascript snippet? This is Java.
– Kirby
Commented Apr 25, 2023 at 20:14 — Kirby, Commented Apr 25, 2023 at 20:14

Adil Hussain · Accepted Answer · 2023-12-20 12:26:34Z

I wrote a method (see below) which extracts a url's domain name and which uses simple String matching. What it actually does is extract the bit between the first "://" (or index 0 if there's no "://" contained) and the first subsequent "/" (or index String.length() if there's no subsequent "/"). The remaining, preceding "www(_)*." bit is chopped off. I'm sure there will be cases where this is not good enough but it should be good enough for most cases.

Mike Samuel's answer says that the java.net.URI class can do this and is preferred to the java.net.URL class but I encountered problems with the java.net.URI class. Notably, the java.net.URI.getHost() method returns a null value if the url does not include the scheme, i.e. the "http(s)" bit.

/**
 * Extracts the domain name from {@code url}
 * by means of String manipulation
 * rather than using the {@link URI} or {@link URL} class.
 *
 * @param url is non-null.
 * @return the domain name within {@code url}.
 */
public String getUrlDomainName(String url) {
  String domainName = new String(url);

  int index = domainName.indexOf("://");

  if (index != -1) {
    // keep everything after the "://"
    domainName = domainName.substring(index + 3);
  }

  index = domainName.indexOf('/');

  if (index != -1) {
    // keep everything before the '/'
    domainName = domainName.substring(0, index);
  }

  // check for and remove a preceding 'www'
  // followed by any sequence of characters (non-greedy)
  // followed by a '.'
  // from the beginning of the string
  domainName = domainName.replaceFirst("^www.*?\\.", "");

  return domainName;
}

I think this might not be correct for http://bob.com:8080/service/read?name=robert — Lee Meador, Commented Mar 21, 2019 at 15:15
Thanks for pointing out Lee. Note that I did qualify my answer with "I'm sure there'll be cases where this won't be good enough...". My answer will need some slight modifying for your particular case. — Adil Hussain, Commented Mar 22, 2019 at 17:27

Lee Meador · Accepted Answer · 2019-03-21 15:37:28Z

All the above are good. This one seems really simple to me and easy to understand. Excuse the quotes. I wrote it for Groovy inside a class called DataCenter.

static String extractDomainName(String url) {
    int start = url.indexOf('://')
    if (start < 0) {
        start = 0
    } else {
        start += 3
    }
    int end = url.indexOf('/', start)
    if (end < 0) {
        end = url.length()
    }
    String domainName = url.substring(start, end)

    int port = domainName.indexOf(':')
    if (port >= 0) {
        domainName = domainName.substring(0, port)
    }
    domainName
}

And here are some junit4 tests:

@Test
void shouldFindDomainName() {
    assert DataCenter.extractDomainName('http://example.com/path/') == 'example.com'
    assert DataCenter.extractDomainName('http://subpart.example.com/path/') == 'subpart.example.com'
    assert DataCenter.extractDomainName('http://example.com') == 'example.com'
    assert DataCenter.extractDomainName('http://example.com:18445/path/') == 'example.com'
    assert DataCenter.extractDomainName('example.com/path/') == 'example.com'
    assert DataCenter.extractDomainName('example.com') == 'example.com'
}

horvoje · Accepted Answer · 2020-01-11 16:41:32Z

In my case i only needed the main domain and not the subdomain (no "www" or whatever the subdomain is) :

public static String getUrlDomain(String url) throws URISyntaxException {
    URI uri = new URI(url);
    String domain = uri.getHost();
    String[] domainArray = domain.split("\\.");
    if (domainArray.length == 1) {
        return domainArray[0];
    }
    return domainArray[domainArray.length - 2] + "." + domainArray[domainArray.length - 1];
}

With this method the url "https://rest.webtoapp.io/llSlider?lg=en&t=8" will have for domain "webtoapp.io".

migueloop · Accepted Answer · 2014-11-04 11:13:07Z

3

I made a small treatment after the URI object creation

 if (url.startsWith("http:/")) {
        if (!url.contains("http://")) {
            url = url.replaceAll("http:/", "http://");
        }
    } else {
        url = "http://" + url;
    }
    URI uri = new URI(url);
    String domain = uri.getHost();
    return domain.startsWith("www.") ? domain.substring(4) : domain;

answered Nov 4, 2014 at 11:13

migueloop

5439 silver badges21 bronze badges

Add a comment |

Benjamin Zach · Accepted Answer · 2020-10-26 15:13:05Z

3

val host = url.split("/")[2]

edited Oct 26, 2020 at 15:13

Benjamin Zach

1,5702 gold badges19 silver badges42 bronze badges

answered Oct 26, 2020 at 7:20

Dean Spencer

491 silver badge2 bronze badges

Add a comment |

Warna Agung · Accepted Answer · 2015-07-06 15:56:02Z

try this one : java.net.URL;
JOptionPane.showMessageDialog(null, getDomainName(new URL("https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains")));

public String getDomainName(URL url){
String strDomain;
String[] strhost = url.getHost().split(Pattern.quote("."));
String[] strTLD = {"com","org","net","int","edu","gov","mil","arpa"};

if(Arrays.asList(strTLD).indexOf(strhost[strhost.length-1])>=0)
    strDomain = strhost[strhost.length-2]+"."+strhost[strhost.length-1];
else if(strhost.length>2)
    strDomain = strhost[strhost.length-3]+"."+strhost[strhost.length-2]+"."+strhost[strhost.length-1];
else
    strDomain = strhost[strhost.length-2]+"."+strhost[strhost.length-1];
return strDomain;}

Community · Accepted Answer · 2017-05-23 12:18:25Z

1

There is a similar question Extract main domain name from a given url. If you take a look at this answer , you will see that it is very easy. You just need to use java.net.URL and String utility - Split

edited May 23, 2017 at 12:18

CommunityBot

11 silver badge

answered Dec 25, 2015 at 21:26

Ayaz Alifov

8,5104 gold badges63 silver badges57 bronze badges

Add a comment |

Shivam Yadav · Accepted Answer · 2020-01-03 07:19:09Z

One of the way I did and worked for all of the cases is using Guava Library and regex in combination.

public static String getDomainNameWithGuava(String url) throws MalformedURLException, 
  URISyntaxException {
    String host =new URL(url).getHost();
    String domainName="";
    try{
        domainName = InternetDomainName.from(host).topPrivateDomain().toString();
    }catch (IllegalStateException | IllegalArgumentException e){
        domainName= getDomain(url,true);
    }
    return domainName;
}

getDomain() can be any common method with regex.

horvoje · Accepted Answer · 2020-01-13 10:29:01Z

private static final String hostExtractorRegexString = "(?:https?://)?(?:www\\.)?(.+\\.)(com|au\\.uk|co\\.in|be|in|uk|org\\.in|org|net|edu|gov|mil)";
private static final Pattern hostExtractorRegexPattern = Pattern.compile(hostExtractorRegexString);

public static String getDomainName(String url){
    if (url == null) return null;
    url = url.trim();
    Matcher m = hostExtractorRegexPattern.matcher(url);
    if(m.find() && m.groupCount() == 2) {
        return m.group(1) + m.group(2);
    }
    return null;
}

Explanation : The regex has 4 groups. The first two are non-matching groups and the next two are matching groups.

The first non-matching group is "http" or "https" or ""

The second non-matching group is "www." or ""

The second matching group is the top level domain

The first matching group is anything after the non-matching groups and anything before the top level domain

The concatenation of the two matching groups will give us the domain/host name.

PS : Note that you can add any number of supported domains to the regex.

spaceMonkey · Accepted Answer · 2015-04-22 07:36:25Z

If the input url is user input. this method gives the most appropriate host name. if not found gives back the input url.

private String getHostName(String urlInput) {
        urlInput = urlInput.toLowerCase();
        String hostName=urlInput;
        if(!urlInput.equals("")){
            if(urlInput.startsWith("http") || urlInput.startsWith("https")){
                try{
                    URL netUrl = new URL(urlInput);
                    String host= netUrl.getHost();
                    if(host.startsWith("www")){
                        hostName = host.substring("www".length()+1);
                    }else{
                        hostName=host;
                    }
                }catch (MalformedURLException e){
                    hostName=urlInput;
                }
            }else if(urlInput.startsWith("www")){
                hostName=urlInput.substring("www".length()+1);
            }
            return  hostName;
        }else{
            return  "";
        }
    }

nhcodes · Accepted Answer · 2020-05-22 18:34:16Z

0

To get the actual domain name, without the subdomain, I use:

private String getDomainName(String url) throws URISyntaxException {
    String hostName = new URI(url).getHost();
    if (!hostName.contains(".")) {
        return hostName;
    }
    String[] host = hostName.split("\\.");
    return host[host.length - 2];
}

Note that this won't work with second-level domains (like .co.uk).

answered May 22, 2020 at 18:34

nhcodes

1,78614 silver badges24 bronze badges

Add a comment |

Abdennour TOUMI · Accepted Answer · 2021-01-19 07:56:42Z

0

// groovy
String hostname ={url -> url[(url.indexOf('://')+ 3)..-1].split('/')[0] }

hostname('http://hello.world.com/something') // return 'hello.world.com'
hostname('docker://quay.io/skopeo/stable') // return 'quay.io'

edited Jan 19, 2021 at 7:56

answered Jan 19, 2021 at 7:50

Abdennour TOUMI

91.5k41 gold badges261 silver badges267 bronze badges

Add a comment |

seunggabi · Accepted Answer · 2022-06-19 15:35:25Z

0

const val WWW = "www."

fun URL.domain(): String {
    val domain: String = this.host
    return if (domain.startsWith(ConstUtils.WWW)) {
        domain.substring(ConstUtils.WWW.length)
    } else {
        domain
    }
}

answered Jun 19, 2022 at 15:35

seunggabi

1,81214 silver badges12 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Get domain name from given url

16 Answers 16

Appendix B. Parsing a URI Reference with a Regular Expression

Not the answer you're looking for? Browse other questions tagged
java
url
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

16 Answers 16

Appendix B. Parsing a URI Reference with a Regular Expression

Not the answer you're looking for? Browse other questions tagged javaurl or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
java
url
or ask your own question.