PHP convert HTML to Text

PHP Best Practices

PHP convert HTML to Text for searching keywords on a page, indexing, or downloading as plain text.


Date : 2006-04-10
There are many times when it would be helpful to be able to read just the text from a page. One instance is when indexing content on multiple pages. This could be for anything from spell checking to keyword searching. It's a simple enough function that takes advantage of PHPs preg_replace function to strip away all unwanted tags and markup and converts some of the more common HTML entities into their text equivilents. Feel free to ask if you have any questions about this code:

Code:
<?php
function HTML2TEXT($Document) {
$Rules = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript
                '@<[\/\!]*?[^<>]*?>@si',          // Strip out HTML tags
                '@([\r\n])[\s]+@',                // Strip out white space
                '@&(quot|#34);@i',                // Replace HTML entities
                '@&(amp|#38);@i',                 //   Ampersand &
                '@&(lt|#60);@i',                  //   Less Than <
                '@&(gt|#62);@i',                  //   Greater Than >
                '@&(nbsp|#160);@i',               //   Non Breaking Space
                '@&(iexcl|#161);@i',              //   Inverted Exclamation point
                '@&(cent|#162);@i',               //   Cent
                '@&(pound|#163);@i',              //   Pound
                '@&(copy|#169);@i',               //   Copyright
                '@&(reg|#174);@i',                //   Registered
                '@&#(d+);@e');                   // Evaluate as php
$Replace = array ('',
                  '',
                  '1',
                  '"',
                  '&',
                  '<',
                  '>',
                  ' ',
                  chr(161),
                  chr(162),
                  chr(163),
                  chr(169),
                  chr(174),
                  'chr()');
  return preg_replace($Rules, $Replace, $Document);
}
   $ch = curl_init("http://www.bestcodingpractices.com/");
   ob_start();
   curl_exec($ch);
   $info = curl_getinfo($ch);
   curl_close($ch);
   $html = ob_get_contents();
   ob_end_clean();
   
   if ($info['http_code']==200) {
     echo "URL is valid<hr>";
     echo HTML2TEXT($html);
   } else {
     echo "URL is not valid<hr>";
   }
?>


The test section at the bottom takes advantage of the CURL library to pull content from our homepage and then converts it to text before displaying it.

Comments :

No comments yet
  • Search For Articles