Login


Parsing Text with Extension Methods

By Jonathan Wood on 3/7/2011
Language: C#
Technology: .NET
Platform: Windows
License: CPOL
Views: 8,555
General Programming » Text Handling » Parsing » Parsing Text with Extension Methods

Introduction to Extension Methods

Extension methods are a feature new to C# 3.0. They allow you to define new methods for existing classes without inheriting from those classes.

Alternatively, you could inherit from an existing class and add new methods to your derived class. However, extension methods provide some unique flexibility. For example, if your derived class inherited from the string class, you may need to modify some of your existing code that uses the string class so that it uses your derived class instead. With extension methods, the method is added to the existing class and no changes are needed to existing code.

Note that LINQ uses extension methods to add query methods to the IEnumerable and IEnumerable<T> classes. There are clearly advantages to this approach.

So let's cut to the chase with a simple example. Listing 1 shows a sample extension method for the String class.

Listing 1: An Extension Method for the String Class

static class StringExtension
{
    public static bool IsNumeric(this string s)
    {
        foreach (char c in s)
        {
            if (!Char.IsDigit(c))
                return false;
        }
        return true;
    }
}

This code adds the IsNumeric() method to the string class. The first thing to note is that both the class and method are static. This is a requirement of extension methods. The next thing to note is the argument for IsNumeric() contains the this keyword. The use of this this way is what makes the method an extension method, and associates it with the string class.

You can now call this method on any instance of the string class and it behaves exactly as though IsNumeric() is an existing method for this class. It even shows up in Intellisense. In addition, this new method will be available for classes that derive from the string class.

Listing 2: Calling an Extension Method

string s = "12345";
bool b = s.IsNumeric();

In Listing 2, IsNumeric() returns true because all characters in the string are digits.

This can be a handy feature, especially if you have code that already uses a class you didn't write and you'd like a nice convenient way to call a method that operates on that class.

Parsing Text

Not long ago, I wrote an article that presented a text parsing helper class. So I thought it might be interesting to reimplement that same code as extension methods. Listing 3 shows my TextParserExtensions class.

The advantage to using extension methods will be that the parsing methods will now appear as methods of the string class, which makes them easy to find.

However, I should point out potential disadvantages. For example, extension methods provide no way to specify state information. Whereas my original class would keep track of things like the current position within the string being parsed, I can't do that with extension methods. Instead, the caller will need to track that information and pass some additional arguments in order for the methods to know what the current string position is.

Listing 3: The TextParserExtensions Class

/// <summary>
/// String class method extensions for parsing text.
/// </summary>
static class TextParserExtensions
{
    public static char NullChar = (char)0;

    /// <summary>
    /// Indicates if the current position is at the end of the text.
    /// </summary>
    /// <param name="pos">Current text position</param>
    /// <returns>True if the current position is at the end of the
    /// text, false otherwise</returns>
    public static bool ParseEndOfText(this string text, int pos)
    {
        return pos >= text.Length;
    }

    /// <summary>
    /// Returns the number of characters remaining after the current
    /// text position.
    /// </summary>
    /// <param name="pos">Current text position</param>
    /// <returns>The number of characters remaining</returns>
    public static int ParseRemaining(this string text, int pos)
    {
        return text.Length - pos;
    }

    /// <summary>
    /// Parses a token from this string.
    /// </summary>
    /// <param name="pos">Current text position</param>
    /// <param name="chars">Characters that delimit tokens</param>
    /// <param name="result">Returns the parsed token, if any</param>
    /// <returns>The new text position</returns>
    public static int ParseToken(this string text, int pos, char[] chars, out string result)
    {
        // Find token start
        pos = ParseSkip(text, pos, chars);
        int start = pos;

        // Find token end
        pos = ParseMoveTo(text, pos, chars);

        // Extract token
        result = ParseExtract(text, start, pos);

        // Return new position
        return pos;
    }

    /// <summary>
    /// Parses a line of text from this string, and sets the current text
    /// position to the start of the next non-empty line.
    /// </summary>
    /// <param name="pos">Current text position</param>
    /// <param name="result">The extracted line</param>
    /// <returns>New text position</returns>
    public static int ParseLine(this string text, int pos, out string result)
    {
        // Get start position
        int start = pos;

        // Get end position
        pos = ParseMoveToEndOfLine(text, pos);

        // Extract line
        result = ParseExtract(text, start, pos);

        // Skip past any newline characters
        return ParseSkip(text, pos, Environment.NewLine.ToCharArray());
    }

    /// <summary>
    /// Extracts a substring from the specified range of the current text
    /// </summary>
    /// <param name="start">Starting text position</param>
    /// <param name="end">Ending text position</param>
    /// <returns>The specified substring</returns>
    public static string ParseExtract(this string text, int start, int end)
    {
        return text.Substring(start, end - start);
    }

    /// <summary>
    /// Returns the character at the current text position.
    /// </summary>
    /// <param name="pos">Current text position</param>
    /// <returns>The specified character</returns>
    public static char ParsePeek(this string text, int pos)
    {
        return ParsePeek(text, pos, 0);
    }

    /// <summary>
    /// Returns the character at the specified text position.
    /// </summary>
    /// <param name="pos">Current text position</param>
    /// <param name="ahead">Number of characters ahead of the current
    /// text position to returned character</param>
    /// <returns>The specified character</returns>
    public static char ParsePeek(this string text, int pos, int ahead)
    {
        int i = (pos + ahead);
        if (i < text.Length)
            return text[i];
        return NullChar;
    }

    /// <summary>
    /// Moves the current position ahead one character.
    /// </summary>
    /// <param name="pos">Current text position</param>
    /// <returns>The new text position</returns>
    public static int ParseMoveAhead(this string text, int pos)
    {
        return ParseMoveAhead(text, pos, 1);
    }

    /// <summary>
    /// Moves the current position ahead the specified number of characters.
    /// </summary>
    /// <param name="pos">Current text position</param>
    /// <param name="ahead">The number of characters to move ahead</param>
    /// <returns>The new text position</returns>
    public static int ParseMoveAhead(this string text, int pos, int ahead)
    {
        return Math.Min(pos + ahead, text.Length);
    }

    /// <summary>
    /// Moves to the next occurrence of the specified string.
    /// </summary>
    /// <param name="pos">Current text position</param>
    /// <param name="s">String to find</param>
    /// <param name="ignoreCase">Indicates if case-insensitive comparisons
    /// are used</param>
    /// <returns>The new text position</returns>
    public static int ParseMoveTo(this string text, int pos, string s, bool ignoreCase)
    {
        pos = text.IndexOf(s, pos, ignoreCase ?
            StringComparison.OrdinalIgnoreCase : StringComparison.Ordinal);
        if (pos < 0)
            pos = text.Length;
        return pos;
    }

    /// <summary>
    /// Moves to the next occurrence of the specified character.
    /// </summary>
    /// <param name="pos">Current text position</param>
    /// <param name="c">Character to find</param>
    /// <returns>The new text position</returns>
    public static int ParseMoveTo(this string text, int pos, char c)
    {
        pos = text.IndexOf(c, pos);
        if (pos < 0)
            pos = text.Length;
        return pos;
    }

    /// <summary>
    /// Moves to the next occurrence of any one of the specified
    /// characters.
    /// </summary>
    /// <param name="pos">Current text position</param>
    /// <param name="chars">Array of characters to find</param>
    /// <returns>The new text position</returns>
    public static int ParseMoveTo(this string text, int pos, char[] chars)
    {
        pos = text.IndexOfAny(chars, pos);
        if (pos < 0)
            pos = text.Length;
        return pos;
    }

    /// <summary>
    /// Moves the current position to the first character that is part of a newline.
    /// </summary>
    /// <param name="pos">Current text position</param>
    /// <returns>The new text position</returns>
    public static int ParseMoveToEndOfLine(this string text, int pos)
    {
        char c = ParsePeek(text, pos);
        while (Environment.NewLine.IndexOf(c) < 0 && !ParseEndOfText(text, pos))
        {
            pos = ParseMoveAhead(text, pos);
            c = ParsePeek(text, pos);
        }
        return pos;
    }

    /// <summary>
    /// Moves to the next occurrence of any character that is not equal
    /// to the specified character.
    /// </summary>
    /// <param name="pos">Current text position</param>
    /// <param name="chars">Character to skip</param>
    /// <returns>The new text position</returns>
    public static int ParseSkip(this string text, int pos, char ch)
    {
        while (ParsePeek(text, pos) == ch)
            pos = ParseMoveAhead(text, pos);
        return pos;
    }

    /// <summary>
    /// Moves to the next occurrence of any character that is not one
    /// of the specified characters.
    /// </summary>
    /// <param name="pos">Current text position</param>
    /// <param name="chars">Array of characters to skip</param>
    /// <returns>The new text position</returns>
    public static int ParseSkip(this string text, int pos, char[] chars)
    {
        while (chars.Contains(ParsePeek(text, pos)))
            pos = ParseMoveAhead(text, pos);
        return pos;
    }

    /// <summary>
    /// Moves the current position to the next character that is not whitespace.
    /// </summary>
    /// <param name="pos">Current text position</param>
    /// <returns>The new text position</returns>
    public static int ParseMovePastWhitespace(this string text, int pos)
    {
        while (Char.IsWhiteSpace(ParsePeek(text, pos)))
            pos = ParseMoveAhead(text, pos);
        return pos;
    }
}

To test the code in Listing 3, Listing 4 declares a string and then calls a new method to parse all the words in a string.

Listing 4: Code to Test the ParseToken() Extension Method

string s = " This is\ta test string! ";
char[] whitespace = { ' ', '\t', '\r', '\n' };
string token;
int pos;

pos = s.ParseToken(0, whitespace, out token);
while (token.Length > 0)
{
    Console.WriteLine(token);
    pos = s.ParseToken(pos, whitespace, out token);
}

Conclusion

As you can see, extension methods are very simple to implement and use. They are an added convenience that, in some cases, can make your code easier to read and maintain.

The code presented in this article is a bit of an experiment. Mostly, I just wanted to see how I liked this code implemented as methods of the string class as opposed to a stand-alone parsing class. I haven't quite decided yet. If you are doing this type of parsing, perhaps you already have a preference.

End-User License

Use of this article and any related source code or other files is governed by the terms and conditions of The Code Project Open License.

Author Information

Jonathan Wood

I'm a software/website developer working out of the greater Salt Lake City area in Utah. I've developed many websites including Black Belt Coder, Insider Articles, and others.