Extract and remove anchor tags which are not allowed

Question

I was trying to write a script which will do the following:

Read content from a file or database
Extract all anchor tags from the content
Scan all links and preserve those which are allowed, e.g. links to social network, search engines or authority domains and remove the rest while preserving their content (anchor text).

Example content:

<a rel="nofollow" href="http://www.test.com/tyest">test1</a>
<a href="http://google.com">google</a>
<a title="This is just a check" href="http://www.check.com">check</a>
<a rel="nofollow" href="http://www.ip.com">http://www.ip.com</a>

Allowed Domains:

google.com
msn.com
ip.com

Desired Output:

test1
<a href="http://google.com">google</a>
check
<a rel="nofollow" href="http://www.ip.com">http://www.ip.com</a>

Limitations:

Anchor tags will not be following any specific rules and can contain may or may not the rel,title,descrition properties and in any order.
The anchor text itself can a link as well, e.g.: http://google.com that should be preserved even if the link is not allowed.

I did my homework and tried writing a simple bare level script to start the initial work using different regex along with the help available online, but didn't succeed. Here is my code:

// sample input
$comment = '<p><a rel="nofollow" href="http://www.1google.com/tyest">test with no http</a></p>
                <p><a rel="nofollow" href="http://google.com">just a domain name</a></p>
                <p><a rel="nofollow" href="http://www.g1gle.com">check</a></p>
                <p><a rel="nofollow" href="http://www.ip.com">http://www.ip.com</a></p>
                <p><a rel="nofollow" href="http://osamashabrez.com">http://testx.osamashabrez.com</a></p>
                <p><a rel="nofollow" href="http://www.subchalega.com">http://www.subchalega.com</a></p>
                <p><a rel="nofollow" href="http://www.letscheck.com">http://www.letscheck.com</a></p>
                <p><a rel="nofollow" href="http://www.google.com/osama/here/">http://www.google.com</a></p>
                <p><a rel="nofollow" description="testing" title="google" href="http://www.google.com/last/">laaaaaaaa</a></p><h1>Header one</h1>
                <p><a rel="nofollow" href="http://domain1.com">http://testx.osamashabrez.com</a></p>';

// add http to the domain name if not already present
function addhttp($url) {
    if (!preg_match('~^(?:f|ht)tps?://~i', $url)) {
        $url = 'http://' . $url;
    }
    return $url;
}

// removed deep links to match with the allowed URLS 
function removeDeepLinks($url) {
    $pos = strrpos ( $url, '.com' );
    if ( $pos !== false )
        return substr( $url, 6, $pos-2 );
    return $url;
}
// allowed domains fetched from the db
$domains = "http://osamashabrez.com\rhttp://google.com\rwordpress.org\rabc.com";
$domains = preg_split( "~\r~", $domains, -1, PREG_SPLIT_NO_EMPTY );
// adding http if not already present
// will be done one when data will be inserted
foreach ( $domains as $key => $domain ) { $domains[$key] = addhttp($domain); }
// remove this and sky will fall on your head :D
sort( $domains );
print_r ( $domains );
// regex to extract href="xyz.com" link as we can not use any other option
// due to the uncertainity of data passed to this script
$regex = '/(href=".*?")/is';
if ($c=preg_match_all ($regex, $comment, $matches)) {
    $matches = $matches[1];
    foreach ( $matches as $key => $url ) {
        // remove deep links for matching
        $matches[$key] = removeDeepLinks($url);
    }
    print_r($matches);
    foreach( $matches as $key => $url ) {
        // if domain is not allowed
        if ( !array_search( $url, $domains ) ) {
            // find position of URL
            $pos_url     = strrpos( $comment, $url );
            // fint the starting position of anchor tag
            $pos_a_start = strrpos(substr($comment, 0, $pos_url), '<a ');
            // fint the end
            $pos_a_end   = strpos($comment, '</a>',$pos_url);
            // extract the whole anchor tag
            $anchor_tag  = substr($comment, $pos_a_start, $pos_a_end - $pos_a_start + 4);
            echo "URL:\t" .$url . "\r";
            echo "Anchor Tag:\t{$anchor_tag}\r";
            echo "POS START :: END:\t{$pos_a_start}::{$pos_a_end}\r";


            // something weired goes where commenting this line works but only the opening
            // tags are removed from the text
            // the code does work with some data inputs and does not work with a few others
            $comment = substr($comment, 0, $pos_a_end) . substr($comment, $pos_a_end+4);
            // removing opening tags
            $opening_tag = substr( $anchor_tag, 0, strpos($anchor_tag, '>') +1 );
            $comment = str_replace($opening_tag, '', $comment);
        }
    }
}
echo $comment;

The above code is working with a few data inputs and breaks on other, I'd like to get some help, working code example or a review on my provided code. Also mention if there is a better way to make the work done. Any help will be highly appreciated.

Thanks

The DomDocument class is your friend on this one. It'll make this task much easier than using regex. — drew010, Commented Jul 7, 2012 at 20:44
@Juhana I know html can not be parsed with regex and for the same reason I used a complete string extraction regex for href="xxx" and the rest was implemented by myself. — Osama Shabrez, Commented Jul 7, 2012 at 21:03
@drew010 I did searched on a lot of programming forums and google, but none did mention this class, I'll study and post the results today as its already 2AM here in Pakistan. A working code example would be a great help. — Osama Shabrez, Commented Jul 7, 2012 at 21:03
@OsamaShabrez It will parse the (X)/HTML document into a structure where you can query all of the <a> tags and iterate over them in a foreach and examine any attributes (href) you want. See also DomXPath::query() which will allow you to get all <a> tags. Also check out this tutorial — drew010, Commented Jul 7, 2012 at 21:11

hakre · Accepted Answer · 2012-07-08 07:45:25Z

1

A DOM parser is better suited for this task.

There are many options including:

Here is an example using QueryPath:

$qp = qp($html)
foreach ($qp->find('a') as $link) {
    $href = $link->attr('href');
    // Get the host domain
    $host = parse_url($href, PHP_URL_HOST);
    // Check our allowed hosts
    if (!in_array($host, $allowedHosts) {
        // Replace the links HTML with just its text
        $link->html($link->text());
    }
}
// Echo our result
echo $query->top()->html();

(Not tested, but should work with a few modifications.)

edited Jul 8, 2012 at 7:45

hakre

196k55 gold badges444 silver badges846 bronze badges

answered Jul 7, 2012 at 22:34

Petah

45.9k30 gold badges159 silver badges214 bronze badges

I tried using your provided loop and its working pretty well except it does not allow me to change the anchor tag, I am unable to use the loop with values passed by reference, i.e: foreach ($qp->find('a') as &$link) and the api documentation does not provided me any special help regarding this issue, any help would be highly appreciated
– Osama Shabrez
Commented Jul 11, 2012 at 15:35

Add a comment |

Collectives™ on Stack Overflow

Extract and remove anchor tags which are not allowed

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
php
regex
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged phpregex or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
php
regex
or ask your own question.