Do you want to extend your searching options beyond the basics? Would automated "Related Items" functionality have clients beating a path to your door? Here's one way to go about it.
The Problem
Keyword search is one of the most frequently requested features of modern Web development and we all have our own preferred solutions. Approaches based on Verity, Lucene, full-text indexes, or plain old SQL are all popular, but, while they provide basic search functionality, their limitations are well documented. What if you want a bit more than just matching occurrences of a couple of keywords? What if you want to relate entire content items on your site to each other, or to items on other Web sites? Just how do you power that "Related Items..." section that your client has seen on Amazon and just has to have?
There are two fundamental approaches to tackling this:
1. Human intervention: You could build a system that allows site editors to manually link different content items together or to set up a classification scheme to link items; or you could take the Amazon approach, recording user interactions and using collaborative filtering techniques.
2. A more sophisticated approach to searching: Techniques such as latent semantic indexing (LSI) and Bayesian analysis are effective but involve costly technologies and time-consuming integration.
Let's assume we don't want to increase the editorial overhead on our client, that our site will not have the level of user interaction required for collaborative filtering, and that we certainly don't have the budget to buy Autonomy. This article will take a few introductory steps into the LSI jungle and explain the basics of building and using a keyword vector space.
The Theory
The vector space is the underpinning of LSI; it is the mathematical representation of all the content items in your repository, based upon the concept of the term space. The term space is built by taking each individual keyword in the content repository and assigning it a geometrical axis. Individual items can be plotted as vectors in this high-dimensional term space, based on the keywords they contain and their frequency.
The vectors of items that share multiple keywords will be close together in the vector space, whereas items that share few keywords will be farther apart. Once this mathematical model is set up, calculating their degree of similarity is simply a matter of finding the angle between vectors.
Confused? Let's look at a basic example. We have two articles on our Web site. The first article contains the words "web" twice and "blog" once. Of course, proper Web site content will contain much more than two keywords, but it is easier for now to visualize the idea with just two words.
Our term space will therefore consist of two orthogonal axes - one representing the keyword "web" and the other "blog". We can plot a vector for the article onto this term space - two units along the "web" axis and one along the "blog" axis (see Figure 1).
Now that the vector space has been set up, additional articles can be plotted as vectors in the same space. The second article contains the word "blog" twice and "web" once, so its vector will appear alongside the first vector (see Figure 2).
The angle between these vectors is the measure of the similarity of the two articles. The angle between two vectors A and B can be found using the cosine measure - a rearrangement of the scalar (or dot) product equation:
If two vectors are identical, the angle between them will be 0º, the cosine of which is 1. Two perpendicular vectors (no shared keywords) will have an angle of 90º and a cosine of 0. So the above result multiplied by 100 will return the match measurement; the two articles show 80% similarity based on the vector space analysis.
That's the theory behind the vector space approach. The following sections will walk through a basic example of its implementation.
The Solution
The example consists of two templates: default.cfm (see Listing 1) sets up some sample content and calls methods in vector.cfc (see Listing 2) which build the term space, build the item vectors, and calculate the cosine measures. The template displays one full content item and lists its matches by relevance.
Step 1: Set Up Sample Data
We start by setting up some sample content items to fire at the vector space engine. The sample content in Listing 1 is stored as an array of structures and consists of five short items about two distinct subjects, pasted from Google News (UK).
Step 2: Set Up the Term Space
The term space is effectively a list of the unique keywords in content. Each keyword will be represented by an axis in the term space. The content array (arItems) is sent to the prepareTermSpace public method in vector.cfc (see Listing 2). This method loops through the array and concatenates all the titles and bodies of the items into one long string. This aggregate string is then sent to the getUniqueKeywords private method (see Listing 2). This method's job is to return an array containing all keywords, but it goes through a number of rationalizing steps before producing the final list.
1. First, all HTML and punctuation is removed by calling another private method, removePunctuation (see Listing 2), which runs through a series of regular expressions to strip the string down to raw text.
2. The next step is to get rid of all the "stop" words in the string. Stop words are common articles, pronouns, and so forth that do not convey meaning on their own and are of little use in the search engine context. The removeStopWords private method (see Listing 2) loops through an example stop-word list and removes any matches. The compilation of stop-word lists is an art; depending on your content, there may be additional words you can add to the list. For instance, if your site is about ColdFusion, you can probably add the words "ColdFusion,","cfm," and "Macromedia" to your stop list. The more omnipresent keywords filtered out at this stage, the less processing needed to calculate the item vectors.
3. The next step we should be thinking about is called stemming. The list of keywords will contain the same words in different forms: singular and plural, different verb tenses, and so forth. Stemming is a technique that removes suffixes from words, leaving a common root. A popular algorithm used for this purpose is the Porter Stemming Algorithm, which is explained in great detail at www.tartarus.org/ ~martin/PorterStemmer. For brevity, we will turn a blind eye to stemming in this example, but we are missing out on a further reduction in the term space dimensions and the associated performance and usability benefits.
4. The list of keywords is sorted alphabetically (after converting the list to an array, for speed) and duplicates are removed. The getUniqueKeywords method can return either a basic keyword list or a list with a record of keyword frequency. The term space does not need to record frequency, so we need only the straight list for now. We will use the frequency option later, when we call the same method while calculating item vectors.
The final result is an array (arTermSpace) listing the unique keywords contained in the sample content. Each keyword represents an axis in the term space.
Step 3: Build the Item Vectors
With the term space set up, the next step is to calculate the vectors for each item. We do this in default.cfm (see Listing 1) by calling the buildItemVectors public method in vector.cfc:
arItemVectors = objVector.buildItemVectors(arItems=arItems,
arTermSpace=arTermSpace, intTitleWeightFactor=3);
This method loops through the content array and performs the following steps for each item:
1. A concatenated string containing the title and body is sent to the getUniqueKeywords method, which this time returns a list of the keyword frequencies. A title weighting factor (in this case, intTitleWeightFactor=3) is introduced by repeating the title string multiple times in the concatenated string.
2. Armed with the list of keyword frequencies (arItemKeywords), the item vector is built by looping through each keyword in the term space and recording the number of times that the keyword appears in the item. Term space keywords that do not appear in the item have a zero value recorded in their vector. Each item vector is stored in arItemVectors[x].arVector, with the corresponding item ID in arItemVectors[x].intItemID.
It is worth noting at this point that this will probably not be the optimum way to store the term space and vectors. If your application stores content in a database, you can build/increment the term space and item vectors when a new item is added and store them in the database.
Step 4: Return List of Item-Relevance Matches
With the term space and item vectors calculated, the preparation work is over with. We can now get on with calculating some cosine measures and returning relevance rankings.
In default.cfm (see Listing 1), we build an arguments collection to send to the getItemMatches public method (see Listing 2). The arguments collection consists of the vector of the current item being viewed (arCurrentItemVector), the vectors of the other items in the content collection (arItemVectors), the maximum number of matches to be returned (iMaxRows), and a lower-bound value for relevance (iThreshold). The example is set to return all example items - you will need to tweak the arguments based on your own content. A large value for iThreshold will return fewer, but more relevant, matches.
The getItemMatches method (see Listing 2) loops through the other item vectors and calculates the cosine measure of each one compared with the current item by calling the calculateCosineMeasure private method. If a cosine measure value is greater than or equal to the lower-bound value, it is added to a result array (arItemMatches). Once the full array is built, it is sorted using the arrayOfStructsSort UDF by Nathan Dintenfass, available from cflib.org at www.cflib.org/udf.cfm?ID=359. The array is then truncated to contain iMaxRows items.
Step 5: Displaying the Results
The template outputs the title and body of the current item and a list of item matches (by looping through arItemMatches). The match titles are looked up from the main arItems content array, the cosine measure is recorded, and links allow the user to reload the default.cfm page to focus on a different item.
Conclusion and Further Development
So there we have it - a vector space engine that provides "related items" functionality without the need for additional editorial overhead or collaborative filtering based on complicated user interaction.
These techniques are only the first steps into latent semantic indexing. Further processing would apply some factors to the item vectors to reduce the effect of terms that appear in a large number of items and normalize item lengths. The "semantic" part of the technique comes when the entire model is put through a singular value decomposition algorithm (see mathworld.wolfram.com/SingularValueDecomposition.html). This reduces the number of axes in the term space and compresses adjacent vectors together. In effect, keywords are superimposed, meaning that semantically similar words such as "car" and "automobile" become mathematically identical, making search results much more useful. A good introduction to the world of LSI can be found at: javelina.cet.middlebury.edu/lsa/out/cover_page.htm.