I was trying to write a script which will do the following:
- Read content from a file or database
- Extract all anchor tags from the content
- Scan all links and preserve those which are allowed, e.g. links to social network, search engines or authority domains and remove the rest while preserving their content (anchor text).
Example content:
<p><a rel="nofollow" href="http://www.test.com/tyest">test1</a></p>
<p><a href="http://google.com">google</a></p>
<p><a title="This is just a check" href="http://www.check.com">check</a></p>
<p><a rel="nofollow" href="http://www.ip.com">http://www.ip.com</a></p>
Allowed Domains:
google.com
msn.com
ip.com
Desired Output:
<p>test1</p>
<p><a href="http://google.com">google</a></p>
<p>check</p>
<p><a rel="nofollow" href="http://www.ip.com">http://www.ip.com</a></p>
Limitations:
- Anchor tags will not be following any specific rules and can contain may or may not the rel,title,descrition properties and in any order.
- The anchor text itself can a link as well, e.g.: http://google.com that should be preserved even if the link is not allowed.
I did my homework and tried writing a simple bare level script to start the initial work using different regex along with the help available online, but didn't succeed. Here is my code:
// sample input
$comment = '<p><a rel="nofollow" href="http://www.1google.com/tyest">test with no http</a></p>
<p><a rel="nofollow" href="http://google.com">just a domain name</a></p>
<p><a rel="nofollow" href="http://www.g1gle.com">check</a></p>
<p><a rel="nofollow" href="http://www.ip.com">http://www.ip.com</a></p>
<p><a rel="nofollow" href="http://osamashabrez.com">http://testx.osamashabrez.com</a></p>
<p><a rel="nofollow" href="http://www.subchalega.com">http://www.subchalega.com</a></p>
<p><a rel="nofollow" href="http://www.letscheck.com">http://www.letscheck.com</a></p>
<p><a rel="nofollow" href="http://www.google.com/osama/here/">http://www.google.com</a></p>
<p><a rel="nofollow" description="testing" title="google" href="http://www.google.com/last/">laaaaaaaa</a></p><h1>Header one</h1>
<p><a rel="nofollow" href="http://domain1.com">http://testx.osamashabrez.com</a></p>';
// add http to the domain name if not already present
function addhttp($url) {
if (!preg_match('~^(?:f|ht)tps?://~i', $url)) {
$url = 'http://' . $url;
}
return $url;
}
// removed deep links to match with the allowed URLS
function removeDeepLinks($url) {
$pos = strrpos ( $url, '.com' );
if ( $pos !== false )
return substr( $url, 6, $pos-2 );
return $url;
}
// allowed domains fetched from the db
$domains = "http://osamashabrez.com\rhttp://google.com\rwordpress.org\rabc.com";
$domains = preg_split( "~\r~", $domains, -1, PREG_SPLIT_NO_EMPTY );
// adding http if not already present
// will be done one when data will be inserted
foreach ( $domains as $key => $domain ) { $domains[$key] = addhttp($domain); }
// remove this and sky will fall on your head :D
sort( $domains );
print_r ( $domains );
// regex to extract href="xyz.com" link as we can not use any other option
// due to the uncertainity of data passed to this script
$regex = '/(href=".*?")/is';
if ($c=preg_match_all ($regex, $comment, $matches)) {
$matches = $matches[1];
foreach ( $matches as $key => $url ) {
// remove deep links for matching
$matches[$key] = removeDeepLinks($url);
}
print_r($matches);
foreach( $matches as $key => $url ) {
// if domain is not allowed
if ( !array_search( $url, $domains ) ) {
// find position of URL
$pos_url = strrpos( $comment, $url );
// fint the starting position of anchor tag
$pos_a_start = strrpos(substr($comment, 0, $pos_url), '<a ');
// fint the end
$pos_a_end = strpos($comment, '</a>',$pos_url);
// extract the whole anchor tag
$anchor_tag = substr($comment, $pos_a_start, $pos_a_end - $pos_a_start + 4);
echo "URL:\t" .$url . "\r";
echo "Anchor Tag:\t{$anchor_tag}\r";
echo "POS START :: END:\t{$pos_a_start}::{$pos_a_end}\r";
// something weired goes where commenting this line works but only the opening
// tags are removed from the text
// the code does work with some data inputs and does not work with a few others
$comment = substr($comment, 0, $pos_a_end) . substr($comment, $pos_a_end+4);
// removing opening tags
$opening_tag = substr( $anchor_tag, 0, strpos($anchor_tag, '>') +1 );
$comment = str_replace($opening_tag, '', $comment);
}
}
}
echo $comment;
The above code is working with a few data inputs and breaks on other, I'd like to get some help, working code example or a review on my provided code. Also mention if there is a better way to make the work done. Any help will be highly appreciated.
Thanks