Login


Easy Full-Text Search Queries

By Jonathan Wood on 6/22/2011 (Updated on 3/25/2012)
Language: C#
Technology: .NETSQL
Platform: Windows
License: CPOL
Views: 27,990
General Programming » SQL & Other Databases » Full-Text Search » Easy Full-Text Search Queries

Test Project Screenshot

Download Source Code Download Source Code

Introduction

SQL Server's Full-Text Search offers powerful and efficient ways to search your data. However, the query syntax can be rather cryptic.

It's bad enough for SQL programmers, but what if you want to allow your users to enter queries? Unless all your users are database experts, they won't know or understand syntax like FORMSOF(INFLECTIONAL, dogs) AND FORMSOF(THESAURUS, cats).

Some time ago, I ran into the excellent article A Google-like Full Text Search by Michael Coles. This article presented code that converted a Google-like query syntax into Full-Text Search syntax.

I really liked the idea. However, I decided against using the code presented in the article for two reasons. First, it relied on the Irony language implementation kit. While I'm sure this is a very fine bit of code, it's a large project capable of implementing custom languages and just seemed like overkill for my needs. And second, the code presented was kind of picky about the syntax and would halt with a syntax error if it encountered something it didn't like. When processing a user's search query, I would prefer to interpret it as best I could without throwing back cryptic error messages. Afterall, when was the last time you got a query error while searching Google?

So I decided to implement my own code that was more forgiving and didn't rely on large third-party libraries.

A Simple Search Syntax

My code takes a simple search query and translates it into a Full-Text Search query for SQL Server. The syntax for the input query is similar to that supported by search engines like Google. I also added a few enhancements, some lifted from Coles' article.

If the input query contains syntax errors, such as missing the closing quote, my code will simply translate the query as best it can and produce a valid output.

Table 1 shows a number of examples of the translations my code will perform. Note that parentheses can be used to control operator precedence, as demonstrated in the last example.

Table 1: Sample Conversions from Simple Search Syntax to Full-Text Search Syntax

Input Output
abc FORMSOF(INFLECTIONAL, abc)
~abc FORMSOF(THESAURUS, abc)
+abc "abc"
"abc" "abc"
-abc NOT FORMSOF(INFLECTIONAL, abc)
abc def (FORMSOF(INFLECTIONAL, abc) AND FORMSOF(INFLECTIONAL, def))
abc or def (FORMSOF(INFLECTIONAL, abc) OR FORMSOF(INFLECTIONAL, def))
<abc def> (FORMSOF(INFLECTIONAL, abc) NEAR FORMSOF(INFLECTIONAL, def))
abc and (def or ghi) (FORMSOF(INFLECTIONAL, abc) AND (FORMSOF(INFLECTIONAL, def) OR FORMSOF(INFLECTIONAL, ghi)))

The EasyFts Class

Listing 1 shows my EasyFts class. Note that it makes use of my handy TextParser class to simplify parsing the input text.

The ToFtsQuery method takes a simple search query and converts in into a Full-Text search query.

The code builds a very simple expression tree consisting of terminal nodes (search terms) and internal nodes (conjunctions). Whenever a search term is added as a terminal node, an internal node is added to control how the search term is joined to the rest of the query. This makes it easy to implement operator precedence with parentheses. It also makes it easy to form the final result once the expression tree has been built.

Listing 1: The EasyFts Class

public class EasyFts
{
    /// <summary>
    /// Query term forms.
    /// </summary>
    protected enum TermForms
    {
        Inflectional,
        Thesaurus,
        Literal,
    }

    /// <summary>
    /// Term conjunction types.
    /// </summary>
    protected enum ConjunctionTypes
    {
        And,
        Or,
        Near,
    }

    /// <summary>
    /// Common interface for expression nodes
    /// </summary>
    protected interface INode
    {
        /// <summary>
        /// Indicates this term (or both child terms) should be excluded from
        /// the results
        /// </summary>
        bool Exclude { get; set; }

        /// <summary>
        /// Indicates this term is enclosed in parentheses
        /// </summary>
        bool Grouped { get; set; }
    }

    /// <summary>
    /// Terminal (leaf) expression node class.
    /// </summary>
    private class TerminalNode : INode
    {
        // Interface members
        public bool Exclude { get; set; }
        public bool Grouped { get; set; }

        // Class members
        public string Term { get; set; }
        public TermForms TermForm { get; set; }

        // Convert node to string
        public override string ToString()
        {
            string fmt = String.Empty;
            if (TermForm == TermForms.Inflectional)
                fmt = "{0}FORMSOF(INFLECTIONAL, {1})";
            else if (TermForm == TermForms.Thesaurus)
                fmt = "{0}FORMSOF(THESAURUS, {1})";
            else if (TermForm == TermForms.Literal)
                fmt = "{0}\"{1}\"";
            return String.Format(fmt,
                Exclude ? "NOT " : String.Empty,
                Term);
        }
    }

    /// <summary>
    /// Internal (non-leaf) expression node class
    /// </summary>
    private class InternalNode : INode
    {
        // Interface members
        public bool Exclude { get; set; }
        public bool Grouped { get; set; }

        // Class members
        public INode Child1 { get; set; }
        public INode Child2 { get; set; }
        public ConjunctionTypes Conjunction { get; set; }

        // Convert node to string
        public override string ToString()
        {
            return String.Format(Grouped ? "({0} {1} {2})" : "{0} {1} {2}",
                Child1.ToString(),
                Conjunction.ToString().ToUpper(),
                Child2.ToString());
        }
    }

    // Characters not allowed in unquoted search terms
    protected const string Punctuation = "~\"`!@#$%^&*()-+=[]{}\\|;:,.<>?/";

    /// <summary>
    /// Collection of stop words. These words will not
    /// be included in the resulting query unless quoted.
    /// </summary>
    public HashSet<string> StopWords { get; set; }

    // Class constructor
    public EasyFts()
    {
        StopWords = new HashSet<string>(StringComparer.CurrentCultureIgnoreCase);
    }

    /// <summary>
    /// Converts an "easy" search term to a full-text search term.
    /// </summary>
    /// <param name="query">Search term to convert</param>
    /// <returns>A valid full-text search query</returns>
    public string ToFtsQuery(string query)
    {
        INode node = FixUpExpressionTree(ParseNode(query, ConjunctionTypes.And), true);
        return (node != null) ? node.ToString() : String.Empty;
    }

    /// <summary>
    /// Parses a query segment and converts it to an expression
    /// tree.
    /// </summary>
    /// <param name="query">Query segment to convert</param>
    /// <param name="defaultConjunction">Implicit conjunction type</param>
    /// <returns>Root node of expression tree</returns>
    private INode ParseNode(string query, ConjunctionTypes defaultConjunction)
    {
        TermForms termForm = TermForms.Inflectional;
        bool termExclude = false;
        ConjunctionTypes conjunction = defaultConjunction;
        bool resetState = true;
        INode root = null;
        INode node;
        string term;

        TextParser parser = new TextParser(query);
        while (!parser.EndOfText)
        {
            if (resetState)
            {
                // Reset modifiers
                termForm = TermForms.Inflectional;
                termExclude = false;
                conjunction = defaultConjunction;
                resetState = false;
            }

            parser.MovePastWhitespace();
            if (!parser.EndOfText &&
                !Punctuation.Contains(parser.Peek()))
            {
                // Extract query term
                int start = parser.Position;
                parser.MoveAhead();
                while (!parser.EndOfText &&
                    !Punctuation.Contains(parser.Peek()) &&
                    !Char.IsWhiteSpace(parser.Peek()))
                    parser.MoveAhead();

                // Allow trailing wildcard
                if (parser.Peek() == '*')
                {
                    parser.MoveAhead();
                    termForm = TermForms.Literal;
                }

                // Interpret token
                term = parser.Extract(start, parser.Position);
                if (String.Compare(term, "AND", true) == 0)
                    conjunction = ConjunctionTypes.And;
                else if (String.Compare(term, "OR", true) == 0)
                    conjunction = ConjunctionTypes.Or;
                else if (String.Compare(term, "NEAR", true) == 0)
                    conjunction = ConjunctionTypes.Near;
                else if (String.Compare(term, "NOT", true) == 0)
                    termExclude = true;
                else
                {
                    root = AddNode(root, term, termForm, termExclude, conjunction);
                    resetState = true;
                }
                continue;
            }
            else if (parser.Peek() == '"')
            {
                // Match next term exactly
                termForm = TermForms.Literal;
                // Extract quoted term
                term = ExtractQuote(parser);
                root = AddNode(root, term.Trim(), termForm, termExclude, conjunction);
                resetState = true;
            }
            else if (parser.Peek() == '(')
            {
                // Parse parentheses block
                term = ExtractBlock(parser, '(', ')');
                node = ParseNode(term, defaultConjunction);
                root = AddNode(root, node, conjunction, true);
                resetState = true;
            }
            else if (parser.Peek() == '<')
            {
                // Parse angle brackets block
                term = ExtractBlock(parser, '<', '>');
                node = ParseNode(term, ConjunctionTypes.Near);
                root = AddNode(root, node, conjunction);
                resetState = true;
            }
            else if (parser.Peek() == '-')
            {
                // Match when next term is not present
                termExclude = true;
            }
            else if (parser.Peek() == '+')
            {
                // Match next term exactly
                termForm = TermForms.Literal;
            }
            else if (parser.Peek() == '~')
            {
                // Match synonyms of next term
                termForm = TermForms.Thesaurus;
            }
            // Advance to next character
            parser.MoveAhead();
        }
        return root;
    }

    /// <summary>
    /// Fixes any portions of the expression tree that would produce an invalid SQL Server full-text
    /// query.
    /// </summary>
    /// <remarks>
    /// While our expression tree may be properly constructed, it may represent a query that
    /// is not supported by SQL Server. This method traverses the expression tree and corrects
    /// problem expressions as described below.
    /// 
    ///     NOT term1 AND term2         Subexpressions swapped.
    ///     NOT term1                   Expression discarded.
    ///     NOT term1 AND NOT term2     Expression discarded if node is grouped (parenthesized)
    ///                                 or is the root node; otherwise, the parent node may
    ///                                 contain another subexpression that will make this one
    ///                                 valid.
    ///     term1 OR NOT term2          Expression discarded.
    ///     term1 NEAR NOT term2        NEAR conjunction changed to AND.*
    ///
    /// * This method converts all NEAR conjunctions to AND when either subexpression is not
    /// an InternalNode with the form TermForms.Literal.
    /// </remarks>
    /// <param name="node">Node to fix up</param>
    /// <param name="isRoot">True if node is the tree's root node</param>
    protected INode FixUpExpressionTree(INode node, bool isRoot = false)
    {
        // Test for empty expression tree
        if (node == null) return null;

        // Special handling for internal nodes
        if (node is InternalNode)
        {
            // Fix up child nodes
            var internalNode = node as InternalNode;
            internalNode.Child1 = FixUpExpressionTree(internalNode.Child1);
            internalNode.Child2 = FixUpExpressionTree(internalNode.Child2);

            // Correct subexpressions incompatible with conjunction type
            if (internalNode.Conjunction == ConjunctionTypes.Near)
            {
                // If either subexpression is incompatible with NEAR conjunction then change to AND
                if (IsInvalidWithNear(internalNode.Child1) || IsInvalidWithNear(internalNode.Child2))
                    internalNode.Conjunction = ConjunctionTypes.And;
            }
            else if (internalNode.Conjunction == ConjunctionTypes.Or)
            {
                // Eliminate subexpressions not valid with OR conjunction
                if (IsInvalidWithOr(internalNode.Child1))
                    internalNode.Child1 = null;
                if (IsInvalidWithOr(internalNode.Child2))
                    internalNode.Child1 = null;
            }

            // Handle eliminated child expressions
            if (internalNode.Child1 == null && internalNode.Child2 == null)
            {
                // Eliminate parent node if both child nodes were eliminated
                return null;
            }
            else if (internalNode.Child1 == null)
            {
                // Child1 eliminated so return only Child2
                node = internalNode.Child2;
            }
            else if (internalNode.Child2 == null)
            {
                // Child2 eliminated so return only Child1
                node = internalNode.Child1;
            }
            else
            {
                // Determine if entire expression is an exclude expression
                internalNode.Exclude = (internalNode.Child1.Exclude && internalNode.Child2.Exclude);
                // If only first child expression is an exclude expression,
                // then simply swap child expressions
                if (!internalNode.Exclude && internalNode.Child1.Exclude)
                {
                    var temp = internalNode.Child1;
                    internalNode.Child1 = internalNode.Child2;
                    internalNode.Child2 = temp;
                }
            }
        }
        // Eliminate expression group if it contains only exclude expressions
        return ((node.Grouped || isRoot) && node.Exclude) ? null : node;
    }

    /// <summary>
    /// Determines if the specified node is invalid on either side of a NEAR conjuction.
    /// </summary>
    /// <param name="node">Node to test</param>
    protected bool IsInvalidWithNear(INode node)
    {
        // NEAR is only valid with TerminalNodes with form TermForms.Literal
        return !(node is TerminalNode) || ((TerminalNode)node).TermForm != TermForms.Literal;
    }

    /// <summary>
    /// Determines if the specified node is invalid on either side of an OR conjunction.
    /// </summary>
    /// <param name="node">Node to test</param>
    protected bool IsInvalidWithOr(INode node)
    {
        // OR is only valid with non-null, non-excluded (NOT) subexpressions
        return node == null || node.Exclude == true;
    }

    /// <summary>
    /// Creates an expression node and adds it to the
    /// give tree.
    /// </summary>
    /// <param name="root">Root node of expression tree</param>
    /// <param name="term">Term for this node</param>
    /// <param name="termForm">Indicates form of this term</param>
    /// <param name="termExclude">Indicates if this is an excluded term</param>
    /// <param name="conjunction">Conjunction used to join with other nodes</param>
    /// <returns>The new root node</returns>
    protected INode AddNode(INode root, string term, TermForms termForm, bool termExclude, ConjunctionTypes conjunction)
    {
        if (term.Length > 0 && !IsStopWord(term))
        {
            INode node = new TerminalNode
            {
                Term = term,
                TermForm = termForm,
                Exclude = termExclude
            };
            root = AddNode(root, node, conjunction);
        }
        return root;
    }

    /// <summary>
    /// Adds an expression node to the given tree.
    /// </summary>
    /// <param name="root">Root node of expression tree</param>
    /// <param name="node">Node to add</param>
    /// <param name="conjunction">Conjunction used to join with other nodes</param>
    /// <returns>The new root node</returns>
    protected INode AddNode(INode root, INode node, ConjunctionTypes conjunction, bool group = false)
    {
        if (node != null)
        {
            node.Grouped = group;
            if (root != null)
                root = new InternalNode
                {
                    Child1 = root,
                    Child2 = node,
                    Conjunction = conjunction
                };
            else
                root = node;
        }
        return root;
    }

    /// <summary>
    /// Extracts a block of text delimited by the specified open and close
    /// characters. It is assumed the parser is positioned at an
    /// occurrence of the open character. The open and closing characters
    /// are not included in the returned string. On return, the parser is
    /// positioned at the closing character or at the end of the text if
    /// the closing character was not found.
    /// </summary>
    /// <param name="parser">TextParser object</param>
    /// <param name="openChar">Start-of-block delimiter</param>
    /// <param name="closeChar">End-of-block delimiter</param>
    /// <returns>The extracted text</returns>
    private string ExtractBlock(TextParser parser, char openChar, char closeChar)
    {
        // Track delimiter depth
        int depth = 1;

        // Extract characters between delimiters
        parser.MoveAhead();
        int start = parser.Position;
        while (!parser.EndOfText)
        {
            if (parser.Peek() == openChar)
            {
                // Increase block depth
                depth++;
            }
            else if (parser.Peek() == closeChar)
            {
                // Decrease block depth
                depth--;
                // Test for end of block
                if (depth == 0)
                    break;
            }
            else if (parser.Peek() == '"')
            {
                // Don't count delimiters within quoted text
                ExtractQuote(parser);
            }
            // Move to next character
            parser.MoveAhead();
        }
        return parser.Extract(start, parser.Position);
    }

    /// <summary>
    /// Extracts a block of text delimited by double quotes. It is
    /// assumed the parser is positioned at the first quote. The
    /// quotes are not included in the returned string. On return,
    /// the parser is positioned at the closing quote or at the end of
    /// the text if the closing quote was not found.
    /// </summary>
    /// <param name="parser">TextParser object</param>
    /// <returns>The extracted text</returns>
    private string ExtractQuote(TextParser parser)
    {
        // Extract contents of quote
        parser.MoveAhead();
        int start = parser.Position;
        while (!parser.EndOfText && parser.Peek() != '"')
            parser.MoveAhead();
        return parser.Extract(start, parser.Position);
    }

    /// <summary>
    /// Determines if the given word has been identified as
    /// a stop word.
    /// </summary>
    /// <param name="word">Word to check</param>
    protected bool IsStopWord(string word)
    {
        return StopWords.Contains(word);
    }
}

Preventing Unsupported Queries

In my original code, I found that it was possible to generate queries that are unsupported by SQL Server. For example, my code will happily parse the query "not abc" but SQL Server will throw an exception because NOT used before the only search term leaves nothing to actually search for. Also, SQL Server does not support expressions like "abc near not def". It's not possible to be near something that doesn't exist.

So I updated the code by adding the FixUpExpressionTree() method. After the expression tree is created, the code calls the new method to remove any elements that are invalid to SQL Server. See the FixUpExpressionTree() summary section in the code for additional details about what this method does.

Problem with Stop Words

One thing that has been a thorn in my side is SQL Server 2008's handling of stop words. Stop words are words such as a, and, and the that are not included in the Full-Text index. Having stop words makes sense because these words are very common and don't really add to the context of the search. Since these words are not in the index, SQL Server cannot include these words in the search.

The problem is that SQL Server considers a stop word to have no results. So, for example, if you searched for black and white, SQL Server would never return a single result even if every row in the database contained all three words! This is because and is a stop word a stop words have no matches.

Since this behavior is completely counter-intuitive, my EasyFts class allows you to specify a list of stop words via the StopWords property. Unlike SQL Server, my code will simply not include these words in the search and so their presence will not block all results.

I know SQL Server has a number of options for modifying or turning off stop words, but doesn't it make sense for SQL Server to allow an option such as the one my class provides that will prevent these words from blocking all results? I've posted a request for this feature on Microsoft Connect. If you'd like to see this feature added, please vote on my request.

Update

Since writing this article, it has been brought to my attention that SQL Server provides an option for preventing the issue described above.

The transform noise words option can be used to enable SQL Server to return matches even when the query contains a stop word (noise word). Set this option to 1 to enable noise word transformation.

sp_configure 'show advanced options', 1
RECONFIGURE
GO
sp_configure 'transform noise words', 1
RECONFIGURE
GO

If you are able to configure your SQL Server, you might want to consider removing those parts of my code that deal with stop words and use the above setting instead. You can refer to transform noise words Server Configuration Option for additional information. Thanks to Don Draper for pointing this out.

Using the Code

To use the EasyFts class, simply declare an instance of the class and call the ToFtsQuery() method. If you'd like the class to avoid passing along certain stop words to SQL Server, simply add those stop words to the StopWords collection property.

Note that the text returned is just a portion of the actual SQL Server query. You should incorporate the text returned, preferrably as a SQL Parameter, into your full query, which might look something like this:

SELECT * FROM Articles
WHERE CONTAINS(Articles.*, @FtsQuery)

The attached download includes the full source code for this article and demonstrates using the EasyFts class.

Conclusion

In my view, Full-Text Search is a great addition to SQL Server. It provides an easy and efficient way to search for words in your database. But some of the advantages are lost if the required syntax is too cryptic for your users.

Being able to convert from a more friendly syntax that your users will understand is necessary to gain full advantage offered by Full-Text Search.

Update History

3/25/2012: Added the FixUpExpressionTree() method to eliminate expressions unsupported by SQL Server; Added comments about SQL Server's transform noise words option; Smaller changes made to the code.

End-User License

Use of this article and any related source code or other files is governed by the terms and conditions of The Code Project Open License.

Author Information

Jonathan Wood

I'm a software/website developer working out of the greater Salt Lake City area in Utah. I've developed many websites including Black Belt Coder, Insider Articles, and others.