Search Engine Creation 101

PHP Best Practices

A basic PHP search engine.


Date : 2006-02-21
Search Engine Creation 101

A while back I was working on rebuilding a customers web site from code that hadnt seen the light of day for at least five years. Part of this work entailed fixing the existing search engine so that it would more precisely display results about particular portions of the site.

After some trial and error it became apparent to me that there was going to be no way to effectively patch the existing search system to achieve what the client was after and so I was forced to build a new engine from scratch. The time frame for launch of this system being a very small window meant I didnt have the luxury of adding every bell and whistle I would have liked but this also had the effect of making the engine very lean and very clean.

To begin with I needed a way to store the information contained in the searchable pages in a way that I could easily parse in order to get some valid results. With that in mind I created several tables in a MySQL database that I entitled 'articles', as it was a group of articles that were going to be searchable.

My preferred method of creating databases is to use the MySQL Admin interface to define a schema - its a seamless tool that can be found here and is very useful for all kinds of tasks that an administrator may need to perform on a database. Still, not everyone likes or has access to a gui for creating schemas in MySQL so the code for the three tables that I set up is as follows:

First create the database:

Code:
`articles`.CREATE DATABASE `articles` /*!40100 DEFAULT CHARACTER SET latin1 */;


Next create the tables in the database:

Code:
CREATE TABLE articles.page (
   page_id int(10) unsigned NOT NULL auto_increment,
   page_url varchar(200) NOT NULL default '',
   PRIMARY KEY (page_id)
) ENGINE=MyISAM;

CREATE TABLE articles.word (
   word_id int(10) unsigned NOT NULL auto_increment,
   word_word varchar(50) NOT NULL default '',
   PRIMARY KEY (word_id)
) ENGINE=MyISAM;

CREATE TABLE articles.occurrence (
   occurrence_id int(10) unsigned NOT NULL auto_increment,
   word_id int(10) unsigned NOT NULL default '0',
   page_id int(10) unsigned NOT NULL default '0',
   PRIMARY KEY (occurrence_id)
) ENGINE=MyISAM;


So now we have three tables in a database called 'articles'. Later we will use this database to store data that we can easily search using standard MySQL queries to return specific data about the pages we will be linking to.

The next thing we need to do is add some data to the tables so that we'll have something to search for. In my case the data we needed to search was in several html files across various directories within the web sites structure. Initially I used the following script to add one page at a time and parse it into the database:

Code:
<?php
if( $_POST['MyURL'] )
{
  mysql_pconnect("myhost","mylogin","mypass")
    or die("ERROR: Could not connect to database!");

  mysql_select_db("articles");

  /* Grab the URL from POST */

  $url = addslashes( $_POST['MyURL'] );

  if( !$url )
  {
     die( "You need to define a URL to process." );
  }
  else if( substr($url,0,7) != "http://" )
  {
     $url = "http://$url";
  }

  /* Check to see if we already have this page */
  $result = mysql_query("SELECT page_id FROM page WHERE page_url = "$url"");
  $row = mysql_fetch_array($result);

  if( $row['page_id'] )
  {
     /* If yes, use the old page_id: */
     $page_id = $row['page_id'];
  }
  else
  {
     /* If not, create one: */

     $page = $url;

     // tags
     $start = '<title>';
     $end = '</title>';
   
     // open the file
     $fp = fopen( $page, 'r' );

     $cont = "";

     // read the contents into a buffer
     while( !feof( $fp ) ) {
         $buf = trim( fgets( $fp, 4096 ) );
         $cont .= $buf;
     }
   
     // get tag contents
     preg_match( "/$start(.*)$end/s", $cont, $match );

     $contents = $match[ 1 ];

     mysql_query("INSERT INTO page (page_url) VALUES ("$url")");
     mysql_query("UPDATE page SET page_title="$contents" where page_url="$url"");
     $current_row = mysql_query("SELECT page_id FROM page WHERE page_url = "$url"");
     $cur_result = mysql_fetch_array($current_row);
     $page_id = $cur_result['page_id'];

  }

  /* Parse the text into the db: */
  if( !($fd = fopen($url,"r")) )
     die( "Could not open URL!" );

  while( $buf = fgets($fd,1024) )
  {
     $buf = trim($buf);

     /* Remove all HTML-tags: */
     $buf = strip_tags($buf);
     $buf = ereg_replace('/&w;/', '', $buf);

     /* Use regexp to get words: */
     preg_match_all("/(b[w+]+b)/",$buf,$words);

     /* Loop through the data and insert them it into the db: */
     for( $i = 0; $words[$i]; $i++ )
     {
        for( $j = 0; $words[$i][$j]; $j++ )
        {
           /* Check to see if we already have a record for this in the word table: */
           $cur_word = addslashes( strtolower($words[$i][$j]) );

           $result = mysql_query("SELECT word_id FROM word
                                  WHERE word_word = '$cur_word'");
           $row = mysql_fetch_array($result);
           if( $row['word_id'] )
           {
              /* If its there use the old word_id: */
              $word_id = $row['word_id'];
           }
           else
           {
              /* If its not then create it: */
              mysql_query("INSERT INTO word (word_word) VALUES ("$cur_word")");
              $word_id = mysql_insert_id();
           }

           /* Register the word: */
           mysql_query("INSERT INTO occurrence (word_id,page_id)
                        VALUES ($word_id,$page_id)");
           echo "Indexing: $cur_word<br>";
        }
     }
  }
  fclose($fd);
}

else {
        echo "<TABLE width=100% ALIGN=CENTER><tr><td width=100%><br><br><br>";
        echo "<form method='post'> <center>URL: <input type='text' size='40' name='MyURL'>n";
        echo "<input type='submit' value='AddURL'></form>n<br><br><hr width=75% size=1>";
      }
?>


With a little extra work this script could be modified to search numerous locations automatically but for our purposes this makes it a trivial matter of adding at least a couple pages to our database.

It is good to note that once your database is fully populated with the data that you will be searching that adding an index to the tables can speed up your searches exponentially, therefore it is a good idea to modify the index of your tables with the following:

Code:
CREATE INDEX word_word_ix ON word (word_word);


Indexes allow MySQL to search through data at an incredible rate even given hundreds of thousands of rows.

Now that we have our database populated we need a way to retrieve results from it. I personally like to give people their results without any extraneous data so I used the following script to display the engine and return results:

Code:
<?
if( $_POST['keyword'] )
{
  echo "<TABLE width=100% ALIGN=CENTER><tr><td width=100%><br><center>Search:</center><br><br>";
  echo "<form method='post'> <center>Keyword: <input type='text' size='20' name='keyword'>n";
  echo "<input type='submit' value='Search'></form>n<br><br><hr width=75% size=1>";

  /* Connect to the database */
  mysql_pconnect("myhost","mylogin","mypass")
       or die("ERROR: Could not connect to database!");
  mysql_select_db("articles");

  $keyword = addslashes( $_POST['keyword'] );

  /* The search query: */
  $result = mysql_query(" SELECT p.page_title as title, p.page_url AS url, COUNT(*) AS occurrences FROM page p, word w, occurrence o WHERE p.page_id = o.page_id AND w.word_id = o.word_id AND w.word_word = "$keyword" GROUP BY p.page_id ORDER BY occurrences" );

  /* return results: */
  echo "<h2>Search results for '".$_POST['keyword']."':</h2>n<br><br>";
  for( $i = 1; $row = mysql_fetch_array($result); $i++ )
  {
    echo "$i. <a href='".$row['url']."'>".$row['title']."</a><br><br>n";
  }

}
else
{
  /* If no keyword show the search page */
  echo "<TABLE width=100% ALIGN=CENTER><tr><td width=100%><br><center>Search</center><br><br>";
  echo "<form method='post'> <center>Keyword: <input type='text' size='20' name='keyword'>n";
  echo "<input type='submit' value='Search'></form>n<br><br><hr width=75% size=1>";
}
?>


Note that this particular method allows the use of a single keyword to search the database. With a little modification the search engine could easily be set up to use multiple search terms.

So there you have it. A very basic search engine from start to finish. Though this could be modified to be much more robust the down and dirty details are all here.

Happy searching!

Comments :

HeavyAl 2006-04-07 #6

I think I've seen a similar article somewhere else, but I like the way this is broken down into its primary pieces. Easy to digest the particulars.

iftecan2000 2006-07-27 #31

Hi great Article. But I beleive that MySql Text Search would return the result-set much faster than a plain old select statement.

rafe 2006-08-16 #46

Good point. Like I mentioned near the end of the article, there are many ways to make this more robust and text search is definately one of them. When I originally wrote this code I was going for quick and dirty as I was pressed for time. If you have a solution that involves the text search capability of MySQL then I'd love to see it posted here.

  • Search For Articles