Implementing a search engine with php and mySQL

Author: Peter Mescalchin
Last update: May 1st 2006

I can be emailed via my contact form.

Table of contents:

Introduction

In this article we will outline a technique for providing relatively simple but effective full text search functionality using PHP and mySQL.

Whilst geared towards PHP and mySQL, the underlying methods here could easily be re-implemented with basically any web application/database technology that the reader is comfortable with. I would expect this article and its example be used as a base for theory and development - this is in no way intended to be a fully-fledged plug-in search engine framework.

In addition, be sure to refer to the links at the end of this document for further reading and information regarding the topics discussed.

Any comments, questions, errors or improvements with regards to this document and the related examples can be directed to my contact form.

Return to top

Requirements

The search engine has been designed with the following requirements in mind:

Return to top

Installing the example

This example requires a working installation of PHP4 or 5 & mySQL (4.x.x or greater) in order to run.

Return to top

Implementation - database

Search index data is managed in five tables. Their purpose and relationship is as follows:

SearchItem

Each SearchItem record holds a reference to a content item indexed by the search engine. The summary information is used to generate search result listings running across multiple content database tables. For this example we store summary data for Title and ItemDateTime, but this could be extended for individual implementation such as images, content summaries, etc.

SearchItemTable

Since we are searching across multiple database tables, SearchItemTable is used to associate a SearchItem record to the database table it represents.

SearchWord

The SearchWord table is a dictionary of every word indexed by the search engine. As content items are indexed and new words encountered, they will be added to this table.

SearchWordStem

Each word defined in SearchWord is associated to a stem word in this table. Calculating stem words can be a rather complex task, so by adding stem words directly to the database during index updates, we avoid the overhead of real time calculations when users search via the front-end website.

SearchItemWordCount

Essentially the core of the search index is to keep count of individual stem words (SearchWordStem) in content items. This data determines what items qualify for a search result and in what order they should be. Due to the large row numbers this table will increase to as more and more content is added, well placed indexing is vital to ensure good performance.

Database diagram

Return to top

Implementation - application code

Adding & removing content items

The purpose of example itemadd.php is to manage the task of adding and removing items to our search index tables by calling methods in our searchindex class. In a real world situation, the functions performed here would either be hooked into web application CRUD functionality or run as a scheduled task/cron.

Lines 13 - 39 - handle add/delete item actions fired from form post actions. The 'additem' case, inserts a new content row into either 'table01' or 'table02' and then calls searchindex method indexitem(). The 'remove' case, deletes an item from a content table then calls the removeitem() method to update the search index.

if (isset($_POST['additem'])) {
    // add new item into the database

    // insert content into content table
    mysql_query('INSERT INTO ' . $_POST['additem'] . ' (title,itemdatetime,content) ' .
                'VALUES (\'' . addslashes($_POST['title']) . '\',NOW(),\'' . addslashes($_POST['content']) . '\')',$dbcnx);

    // get the record identity of the row just added
    $recordid = mysql_insert_id($dbcnx);

    // index new content in search index system
    $indexer = new searchindex;
    $indexer->settable($_POST['additem']);
    $indexer->setrecordid($recordid);
    $indexer->indexitem('title','itemdatetime','content');

} elseif (isset($_GET['remove'])) {
    // remove item from the database
    mysql_query('DELETE FROM ' . $_GET['remove'] . ' WHERE (id = ' . $_GET['id'] . ')',$dbcnx);

    // remove item from search index
    $indexer = new searchindex;
    $indexer->settable($_GET['remove']);
    $indexer->setrecordid($_GET['id']);

    $indexer->removeitem();
}

Lines 42 - 55 - generate an HTML form for submitting new content items.

Lines 58 - 80 - display current content items in the database, with options to delete items.

Updating the search index

The searchindex class (defined in searchindex.php) performs the job of adding, updating and removing items from our search engine datatables.

Lines 18 - 37 - define methods settable() and setrecordid() which need to be called before updating or removing a content item from the search index.

function settable($inputtable) {
    global $dbcnx;


    $dbresult = mysql_query('SELECT id FROM searchitemtable WHERE (name = \'' . $inputtable . '\')',$dbcnx);

    if ($tablerec = mysql_fetch_assoc($dbresult)) {
        $this->tablename = $inputtable;
        $this->tableid = $tablerec['id'];
    } else {
        die('Invalid searchitemtable name: ' . $inputtablename);
    }

    mysql_free_result($dbresult);
}

function setrecordid($inputid) {

    $this->recordid = $inputid;
}

Lines 43 - 56 - retrieve content item from the database table and place data into local variables.

Lines 59 - 70 - firstly, constructs a search index string, with str_repeat() used to make words in the title twice as strong as words in the body. A regular expression is then used to split the index string into its individual words (line 64). Noise words are then removed using array_diff() and the $this-stopwords array. Finally the count of individual words is calculated with the use of the array_count_values() function.

// construct the text to be indexed lowercased - title is twice as powerful as the content so add it twice with str_repeat()
$indextext = strtolower(str_repeat($itemtitle . ' ',2) . $itemcontent) . '!';

// split content into words
// need to add the '!' at the end for the regexp - otherwise the last word in content text could be lost if it ends without punctuation (fullstop/etc.)
preg_match_all('/([a-zA-Z][a-zA-Z-]+[a-zA-Z-])[^a-zA-Z]/',$indextext . '!',$matches);

// remove stop/noise words
$indexwords = array_diff($matches[1],$this->stopwords);

// calculate the count of each word (grouping)
$indexwords = array_count_values($indexwords);

Lines 73 - 90 - create or update a SearchItem record for the content item.

Lines 94 - 132 - each foreach loop handles the creation of SearchWord and SearchWordStem records, using the porterstemmer class to generate stem words. As stem words are generated they get cached in $this->stemwordcache to avoid repeat calls to $porterstem->stem() (line 107). The $indexwordcount[] array is constructed during this process (lines 130 - 137) to hold each retrieved SearchWordStem found and it's frequency.

$porterstem = new porterstemmer;
$indexwordcount = array();

foreach ($indexwords as $word => $wordcount) {
    // only words greater than two characters are considered
    if (strlen($word) <= 2) break;

    // porter stem word
    if (isset($this->stemwordcache[$word])) {
        // fetch stem word from array cache to save CPU
        $wordstemid = $this->stemwordcache[$word];
    } else {
        // calculate stem word
        $wordstem = $porterstem->stem($word);

        // get [searchwordstem] record id from database - create if not found
        $wordstemid = $this->sqlvalue('SELECT id FROM searchwordstem WHERE (word = \'' . addslashes($wordstem) . '\')');

        if ($wordstemid == -1) {
            // insert a new [searchwordstem] record
            mysql_query('INSERT INTO searchwordstem (word) VALUES (\'' . addslashes($wordstem) . '\')',$dbcnx);
            $wordstemid = mysql_insert_id($dbcnx);
        }

        // check if the word exists in the [searchword] table - if not create it
        $wordid = $this->sqlvalue('SELECT id FROM searchword WHERE (word = \'' . addslashes($word) . '\')');

        if ($wordid == -1) {
            // insert a new [searchword] record - associated to its [searchwordstem] record
            mysql_query('INSERT INTO searchword (searchwordstemid,word) VALUES (' . $wordstemid . ',\'' . addslashes($word) . '\')',$dbcnx);
        }

        // add stem word details to cache array
        $this->stemwordcache[$word] = $wordstemid;
    }

    // now add/update the count of this $wordstemid in the $indexwordcount array for this content item
    $indexwordcount[$wordstemid] = (isset($indexwordcount[$wordstemid])) ? $indexwordcount[$wordstemid] + $wordcount : $wordcount;
}

Lines 135 - 148 - create SearchItemWordCount records for the content item being indexed, based on data in the array $indexwordcount[].

mysql_query('DELETE FROM searchitemwordcount WHERE (searchitemid = ' . $searchitemid . ')',$dbcnx);

// if [searchitemwordcount] records were just deleted, then we need to call the searchindexcleanup() function
$docleanup = (mysql_affected_rows($dbcnx) > 0) ? TRUE : FALSE;

// batch insert all [searchitemwordcount] records in one hit
if ($indexwordcount) {
    $insertdata = array();
    foreach ($indexwordcount as $stemid => $stemcount) {
        $insertdata[] = '(' . $searchitemid . ',' . $stemid . ',' . $stemcount . ')';
    }
    mysql_query('INSERT INTO searchitemwordcount (searchitemid,searchwordstemid,wordcount) ' .
                'VALUES ' . implode(',',$insertdata),$dbcnx);
}

Line 151 - if required a call is made to $this->searchindexcleanup() which deletes SearchWord & SearchWordStem records that are no longer referenced in SearchItemWordCount.

Line 156 - 172 - the removeitem() method allows content items to be removed from the search index.

function removeitem() {
    global $dbcnx;


    $searchitemid = $this->sqlvalue('SELECT id FROM searchitem ' .
                                    'WHERE (tableid = ' . $this->tableid . ') ' .
                                    'AND (recordid = ' . $this->recordid . ')');

    if ($searchitemid > 0) {
        mysql_query('DELETE FROM searchitemwordcount WHERE (searchitemid = ' . $searchitemid . ')',$dbcnx);

        mysql_query('DELETE FROM searchitem WHERE (id = ' . $searchitemid . ')',$dbcnx);

        // clean up search index table data
        $this->searchindexcleanup();
    }
}

Querying the search index

The script itemsearch.php is an example of using the generated search index data to relay search query results back to the user.

Lines 23 - 25 - get the user's search criteria from a form postback into $searchwords[] and commence the search result query process.

if ($searchtext != '') {
    // get search text stem word ids into an array
    $searchwords = str_word_count(strtolower($_POST['searchtext']),1);

Lines 33 - 45 - queries individual words in the search string against the SearchWord table, returning SearchWordStem IDs which are then stored in the $stemwordids[] array.

$stemwordids = array();
if ($sql > '') {
    $dbresult = mysql_query('SELECT DISTINCT searchwordstemid ' .
                            'FROM searchword ' .
                            'WHERE (word IN(' . $sql . '))',$dbcnx);

    while ($stemrec = mysql_fetch_assoc($dbresult)) {
        // add stem word id to $stemwordids array
        $stemwordids[] = intval($stemrec['searchwordstemid']);
    }

    mysql_free_result($dbresult);
}

Lines 52 - 63 - SQL search index query. Here we query the SearchItemWordCount table looking for the stem word ids matched above ($stemwordids[]). These rows are then INNER JOINed to SearchItem. The GROUP BY and HAVING conditions ensure that our search behaves as an AND search - only SearchItems containing all stem word ids given are returned. Ordered descending by SUM(searchitemwordcount.wordcount), search results are ranked from most relevant to least based on search word count.

$sql =    'SELECT searchitem.tableid,searchitem.recordid,searchitem.title,' .
            'UNIX_TIMESTAMP(searchitem.searchitemdate) AS searchitemdate,' .
            'COUNT(searchitemwordcount.searchitemid) AS hitcount,' .
            'SUM(searchitemwordcount.wordcount) AS rank ' .
        'FROM searchitemwordcount ' .
            'INNER JOIN searchitem ON (searchitemwordcount.searchitemid = searchitem.id) ' .
        'WHERE (searchitemwordcount.searchwordstemid IN (' . implode(',',$stemwordids) . ')) ' .
        'GROUP BY searchitem.tableid,searchitem.recordid,searchitem.title,searchitem.searchitemdate ' .
        'HAVING (hitcount = ' . sizeof($stemwordids) . ') ' .
        'ORDER BY rank DESC,searchitem.title';

$dbresult = mysql_query($sql,$dbcnx);

Lines 65 - 70 - iterate through the database results and render HTML. We simply output the item Title, SearchItemDate, TableID and RecordID. From this information we can successfully construct a url through to the target content item.

while ($resultrow = mysql_fetch_assoc($dbresult)) {
    echo(   '<div><a href="#">' . htmlspecialchars($resultrow['title']) . '</a></div>' .
            '<div>' . date('jS F Y',$resultrow['searchitemdate']) . '</div>' .
            '<div>Table ID: ' . $resultrow['tableid'] . '</div>' .
            '<div>Record ID: ' . $resultrow['recordid'] . '</div><br />');
}

Return to top

Further reading