Login


A sscanf() Replacement for .NET

By Jonathan Wood on 2/25/2011
Language: C#
Technology: .NET
Platform: Windows
License: CPOL
Views: 49,386
General Programming » Text Handling » Parsing » A sscanf() Replacement for .NET

Test Project Screenshot

Download Source Code and Test Project Download Source Code and Test Project

Introduction

C and C++ developers have used the scanf() family of functions (scanf(), sscanf(), fscanf(), etc.) as a quick and easy way to parse well-structured input. The basic idea is to be able to specify the format of an input string in a way that allows the function to extract fields from that string.

For example, if your input string is "X123 Y456", you could specify a format string of "X%d Y%d" to extract the two numeric values, 123 and 456. The "%d" tells the function to extract a decimal value at the current location. The parser reads the decimal value until a non-digit character is encountered. The values are then assigned to variables and returned to the caller. In this example, the X and Y are character literals. These are characters in the input string expected to match the same characters in the format string.

Processing stops either when the end of the format string is reached, or when characters in the input string cannot be processed according to the format string.

To be sure, there are limits to this approach. For example, you wouldn't use scanf() to parse source code. It works best will well-structured input that can readily be defined into fields. .NET developers who like the regular expression classes can use those classes for parsing well-structured text. However, for cases where scanf() works, or for developers who are accustomed to using scanf(), it provides a simple and convenient way to parse many types of text.

The scanf() Format String

The scanf() format string provides a flexible way to describe the fields in the input string. Although there are standards, different C compilers seemed to have slightly different rules about the meaning of some parts of the format string. The following definition is for format strings used by the class I'll present in this article.

Characters Description
Whitespace Any whitespace characters in the format string causes the position to advance to the next non-whitespace character in the input string. Whitespace characters include spaces, tabs and new lines.
Non-Whitespace except percent (%) Any character that is not a whitespace character or part of a format specifier (which begins with a % character) advances past the same matching character in the input string.
Format specifier A sequence that begins with a percent sign (%) to signify a format specifier, or field, that will be parsed and stored in a variable. A format specifier has the following form.

%[*][width][modifiers]type

Items within square brackets ([]) are optional. The following table describes elements within the format specifier.

Element Meaning
* Indicates that this field is parsed normally but not stored in a variable.
width Specifies the maximum number of characters to be read for this field.
modifiers If supplied, modifies the size of the data type where the field is stored. If not supplied, the default size is used. Supported modifiers are listed below.
hh: For integer fields, the result is stored in an 8-bit variable. Ignored for floating point fields.
h: For integer fields, the result is stored in a 16-bit variable. Ignored for floating point fields.
l For integer fields, the result is stored in a 64-bit variable. Floating point fields are stored in a double.
ll Same effect as the l modifier.
width Specifies the maximum number of characters to include in this field.
type Specifies the field type as described in the following table.

Type Meaning
c Reads a single character. If a width > 1 is specified, an array of characters is read.
d, i Reads a decimal integer. Number may begin with 0 (octal), 0x (hexadecimal) or a + or - sign.
e, E, f, g,G Reads a floating point variable. Number may begin with a + or - sign, and may be written using exponential notation.
o Reads an unsigned octal integer
s Reads a string of characters up to the end of the input string, the next whitespace character, or until the number of characters specified for the width has been read.
u Reads an unsigned decimal integer. Number may begin with 0 (octal), 0x (hexadecimal) or a + sign.
x, X Reads an unsigned hexadecimal integer.
[ Reads a string of characters that are included within square brackets. For example, "[abc]" will read all characters that are either a, b, or c. Use "[^abc]" to read all character that are not a, b, or c. If the first character after "[" or after "[^" is "]", the closing square bracket is considered to be one of the characters rather than the end of the scanset.

The ScanFormatted Class

I thought that writing a scanf() replacement for .NET would be a cool little project. So I created the ScanFormatted class (shown in listing 1).

It is not entirely compatible with the C standard. For starters, there are actually several slight variations on the scanf() implementations available today. Also, some changes were necessary in order to run on the .NET platform. For example, pointer types are not supported.

Another thing that changed is the way parsed variables are returned. scanf() takes a variable number of arguments that get populated with the parsed fields. .NET allows a variable number of arguments using the params array. Unfortunately, while the entire params array can be passed using ref or out, there is no way to do this for each of the arguments in the array. So I added a Results collection to the class instead. This collection is of type List<object>. After parsing a string, this collection contains all the variables that were parsed.

While it may not be exactly like scanf(), I did make an effort to come reasonably close to the C standard. For simple format strings, the parser should work almost identically. For some more complex format strings, some changes may be necessary.

The ScanFormatted class makes use of my TextParser helper class. You will need it to use the code below. All source code and a test program is included in the attached download.

Listing 1: The ScanFormatted Class

/// <summary>
/// Class that provides functionality of the standard C library sscanf()
/// function.
/// </summary>
public class ScanFormatted
{
    // Format type specifiers
    protected enum Types
    {
        Character,
        Decimal,
        Float,
        Hexadecimal,
        Octal,
        ScanSet,
        String,
        Unsigned
    }

    // Format modifiers
    protected enum Modifiers
    {
        None,
        ShortShort,
        Short,
        Long,
        LongLong
    }

    // Delegate to parse a type
    protected delegate bool ParseValue(TextParser input, FormatSpecifier spec);

    // Class to associate format type with type parser
    protected class TypeParser
    {
        public Types Type { get; set; }
        public ParseValue Parser { get; set; }
    }

    // Class to hold format specifier information
    protected class FormatSpecifier
    {
        public Types Type { get; set; }
        public Modifiers Modifier { get; set; }
        public int Width { get; set; }
        public bool NoResult { get; set; }
        public string ScanSet { get; set; }
        public bool ScanSetExclude { get; set; }
    }

    // Lookup table to find parser by parser type
    protected TypeParser[] _typeParsers;

    // Holds results after calling Parse()
    public List<object> Results;

    // Constructor
    public ScanFormatted()
    {
        // Populate parser type lookup table
        _typeParsers = new TypeParser[] {
            new TypeParser() { Type = Types.Character, Parser = ParseCharacter },
            new TypeParser() { Type = Types.Decimal, Parser = ParseDecimal },
            new TypeParser() { Type = Types.Float, Parser = ParseFloat },
            new TypeParser() { Type = Types.Hexadecimal, Parser = ParseHexadecimal },
            new TypeParser() { Type = Types.Octal, Parser = ParseOctal },
            new TypeParser() { Type = Types.ScanSet, Parser = ParseScanSet },
            new TypeParser() { Type = Types.String, Parser = ParseString },
            new TypeParser() { Type = Types.Unsigned, Parser = ParseDecimal }
        };
        // Allocate results collection
        Results = new List<object>();
    }

    /// <summary>
    /// Parses the input string according to the rules in the
    /// format string. Similar to the standard C library's
    /// sscanf() function. Parsed fields are placed in the
    /// class' Results member.
    /// </summary>
    /// <param name="input">String to parse</param>
    /// <param name="format">Specifies rules for parsing input</param>
    public int Parse(string input, string format)
    {
        TextParser inp = new TextParser(input);
        TextParser fmt = new TextParser(format);
        List<object> results = new List<object>();
        FormatSpecifier spec = new FormatSpecifier();
        int count = 0;

        // Clear any previous results
        Results.Clear();

        // Process input string as indicated in format string
        while (!fmt.EndOfText && !inp.EndOfText)
        {
            if (ParseFormatSpecifier(fmt, spec))
            {
                // Found a format specifier
                TypeParser parser = _typeParsers.First(tp => tp.Type == spec.Type);
                if (parser.Parser(inp, spec))
                    count++;
                else
                    break;
            }
            else if (Char.IsWhiteSpace(fmt.Peek()))
            {
                // Whitespace
                inp.MovePastWhitespace();
                fmt.MoveAhead();
            }
            else if (fmt.Peek() == inp.Peek())
            {
                // Matching character
                inp.MoveAhead();
                fmt.MoveAhead();
            }
            else break;    // Break at mismatch
        }

        // Return number of fields successfully parsed
        return count;
    }

    /// <summary>
    /// Attempts to parse a field format specifier from the format string.
    /// </summary>
    protected bool ParseFormatSpecifier(TextParser format, FormatSpecifier spec)
    {
        // Return if not a field format specifier
        if (format.Peek() != '%')
            return false;
        format.MoveAhead();

        // Return if "%%" (treat as '%' literal)
        if (format.Peek() == '%')
            return false;

        // Test for asterisk, which indicates result is not stored
        if (format.Peek() == '*')
        {
            spec.NoResult = true;
            format.MoveAhead();
        }
        else spec.NoResult = false;

        // Parse width
        int start = format.Position;
        while (Char.IsDigit(format.Peek()))
            format.MoveAhead();
        if (format.Position > start)
            spec.Width = int.Parse(format.Extract(start, format.Position));
        else
            spec.Width = 0;

        // Parse modifier
        if (format.Peek() == 'h')
        {
            format.MoveAhead();
            if (format.Peek() == 'h')
            {
                format.MoveAhead();
                spec.Modifier = Modifiers.ShortShort;
            }
            else spec.Modifier = Modifiers.Short;
        }
        else if (Char.ToLower(format.Peek()) == 'l')
        {
            format.MoveAhead();
            if (format.Peek() == 'l')
            {
                format.MoveAhead();
                spec.Modifier = Modifiers.LongLong;
            }
            else spec.Modifier = Modifiers.Long;
        }
        else spec.Modifier = Modifiers.None;

        // Parse type
        switch (format.Peek())
        {
            case 'c':
                spec.Type = Types.Character;
                break;
            case 'd':
            case 'i':
                spec.Type = Types.Decimal;
                break;
            case 'a':
            case 'A':
            case 'e':
            case 'E':
            case 'f':
            case 'F':
            case 'g':
            case 'G':
                spec.Type = Types.Float;
                break;
            case 'o':
                spec.Type = Types.Octal;
                break;
            case 's':
                spec.Type = Types.String;
                break;
            case 'u':
                spec.Type = Types.Unsigned;
                break;
            case 'x':
            case 'X':
                spec.Type = Types.Hexadecimal;
                break;
            case '[':
                spec.Type = Types.ScanSet;
                format.MoveAhead();
                // Parse scan set characters
                if (format.Peek() == '^')
                {
                    spec.ScanSetExclude = true;
                    format.MoveAhead();
                }
                else spec.ScanSetExclude = false;
                start = format.Position;
                // Treat immediate ']' as literal
                if (format.Peek() == ']')
                    format.MoveAhead();
                format.MoveTo(']');
                if (format.EndOfText)
                    throw new Exception("Type specifier expected character : ']'");
                spec.ScanSet = format.Extract(start, format.Position);
                break;
            default:
                string msg = String.Format("Unknown format type specified : '{0}'", format.Peek());
                throw new Exception(msg);
        }
        format.MoveAhead();
        return true;
    }

    /// <summary>
    /// Parse a character field
    /// </summary>
    private bool ParseCharacter(TextParser input, FormatSpecifier spec)
    {
        // Parse character(s)
        int start = input.Position;
        int count = (spec.Width > 1) ? spec.Width : 1;
        while (!input.EndOfText && count-- > 0)
            input.MoveAhead();

        // Extract token
        if (count <= 0 && input.Position > start)
        {
            if (!spec.NoResult)
            {
                string token = input.Extract(start, input.Position);
                if (token.Length > 1)
                    Results.Add(token.ToCharArray());
                else
                    Results.Add(token[0]);
            }
            return true;
        }
        return false;
    }

    /// <summary>
    /// Parse integer field
    /// </summary>
    private bool ParseDecimal(TextParser input, FormatSpecifier spec)
    {
        int radix = 10;

        // Skip any whitespace
        input.MovePastWhitespace();

        // Parse leading sign
        int start = input.Position;
        if (input.Peek() == '+' || input.Peek() == '-')
        {
            input.MoveAhead();
        }
        else if (input.Peek() == '0')
        {
            if (Char.ToLower(input.Peek(1)) == 'x')
            {
                radix = 16;
                input.MoveAhead(2);
            }
            else
            {
                radix = 8;
                input.MoveAhead();
            }
        }

        // Parse digits
        while (IsValidDigit(input.Peek(), radix))
            input.MoveAhead();

        // Don't exceed field width
        if (spec.Width > 0)
        {
            int count = input.Position - start;
            if (spec.Width < count)
                input.MoveAhead(spec.Width - count);
        }

        // Extract token
        if (input.Position > start)
        {
            if (!spec.NoResult)
            {
                if (spec.Type == Types.Decimal)
                    AddSigned(input.Extract(start, input.Position), spec.Modifier, radix);
                else
                    AddUnsigned(input.Extract(start, input.Position), spec.Modifier, radix);
            }
            return true;
        }
        return false;
    }

    /// <summary>
    /// Parse a floating-point field
    /// </summary>
    private bool ParseFloat(TextParser input, FormatSpecifier spec)
    {
        // Skip any whitespace
        input.MovePastWhitespace();

        // Parse leading sign
        int start = input.Position;
        if (input.Peek() == '+' || input.Peek() == '-')
            input.MoveAhead();

        // Parse digits
        bool hasPoint = false;
        while (Char.IsDigit(input.Peek()) || input.Peek() == '.')
        {
            if (input.Peek() == '.')
            {
                if (hasPoint)
                    break;
                hasPoint = true;
            }
            input.MoveAhead();
        }

        // Parse exponential notation
        if (Char.ToLower(input.Peek()) == 'e')
        {
            input.MoveAhead();
            if (input.Peek() == '+' || input.Peek() == '-')
                input.MoveAhead();
            while (Char.IsDigit(input.Peek()))
                input.MoveAhead();
        }

        // Don't exceed field width
        if (spec.Width > 0)
        {
            int count = input.Position - start;
            if (spec.Width < count)
                input.MoveAhead(spec.Width - count);
        }

        // Because we parse the exponential notation before we apply
        // any field-width constraint, it becomes awkward to verify
        // we have a valid floating point token. To prevent an
        // exception, we use TryParse() here instead of Parse().
        double result;

        // Extract token
        if (input.Position > start &&
            double.TryParse(input.Extract(start, input.Position), out result))
        {
            if (!spec.NoResult)
            {
                if (spec.Modifier == Modifiers.Long ||
                    spec.Modifier == Modifiers.LongLong)
                    Results.Add(result);
                else
                    Results.Add((float)result);
            }
            return true;
        }
        return false;
    }

    /// <summary>
    /// Parse hexadecimal field
    /// </summary>
    protected bool ParseHexadecimal(TextParser input, FormatSpecifier spec)
    {
        // Skip any whitespace
        input.MovePastWhitespace();

        // Parse 0x prefix
        int start = input.Position;
        if (input.Peek() == '0' && input.Peek(1) == 'x')
            input.MoveAhead(2);

        // Parse digits
        while (IsValidDigit(input.Peek(), 16))
            input.MoveAhead();

        // Don't exceed field width
        if (spec.Width > 0)
        {
            int count = input.Position - start;
            if (spec.Width < count)
                input.MoveAhead(spec.Width - count);
        }

        // Extract token
        if (input.Position > start)
        {
            if (!spec.NoResult)
                AddUnsigned(input.Extract(start, input.Position), spec.Modifier, 16);
            return true;
        }
        return false;
    }

    /// <summary>
    /// Parse an octal field
    /// </summary>
    private bool ParseOctal(TextParser input, FormatSpecifier spec)
    {
        // Skip any whitespace
        input.MovePastWhitespace();

        // Parse digits
        int start = input.Position;
        while (IsValidDigit(input.Peek(), 8))
            input.MoveAhead();

        // Don't exceed field width
        if (spec.Width > 0)
        {
            int count = input.Position - start;
            if (spec.Width < count)
                input.MoveAhead(spec.Width - count);
        }

        // Extract token
        if (input.Position > start)
        {
            if (!spec.NoResult)
                AddUnsigned(input.Extract(start, input.Position), spec.Modifier, 8);
            return true;
        }
        return false;
    }

    /// <summary>
    /// Parse a scan-set field
    /// </summary>
    protected bool ParseScanSet(TextParser input, FormatSpecifier spec)
    {
        // Parse characters
        int start = input.Position;
        if (!spec.ScanSetExclude)
        {
            while (spec.ScanSet.Contains(input.Peek()))
                input.MoveAhead();
        }
        else
        {
            while (!input.EndOfText && !spec.ScanSet.Contains(input.Peek()))
                input.MoveAhead();
        }

        // Don't exceed field width
        if (spec.Width > 0)
        {
            int count = input.Position - start;
            if (spec.Width < count)
                input.MoveAhead(spec.Width - count);
        }

        // Extract token
        if (input.Position > start)
        {
            if (!spec.NoResult)
                Results.Add(input.Extract(start, input.Position));
            return true;
        }
        return false;
    }

    /// <summary>
    /// Parse a string field
    /// </summary>
    private bool ParseString(TextParser input, FormatSpecifier spec)
    {
        // Skip any whitespace
        input.MovePastWhitespace();

        // Parse string characters
        int start = input.Position;
        while (!input.EndOfText && !Char.IsWhiteSpace(input.Peek()))
            input.MoveAhead();

        // Don't exceed field width
        if (spec.Width > 0)
        {
            int count = input.Position - start;
            if (spec.Width < count)
                input.MoveAhead(spec.Width - count);
        }

        // Extract token
        if (input.Position > start)
        {
            if (!spec.NoResult)
                Results.Add(input.Extract(start, input.Position));
            return true;
        }
        return false;
    }

    // Determines if the given digit is valid for the given radix
    private bool IsValidDigit(char c, int radix)
    {
        int i = "0123456789abcdef".IndexOf(Char.ToLower(c));
        if (i >= 0 && i < radix)
            return true;
        return false;
    }

    // Parse signed token and add to results
    private void AddSigned(string token, Modifiers mod, int radix)
    {
        object obj;
        if (mod == Modifiers.ShortShort)
            obj = Convert.ToSByte(token, radix);
        else if (mod == Modifiers.Short)
            obj = Convert.ToInt16(token, radix);
        else if (mod == Modifiers.Long ||
            mod == Modifiers.LongLong)
            obj = Convert.ToInt64(token, radix);
        else
            obj = Convert.ToInt32(token, radix);
        Results.Add(obj);
    }

    // Parse unsigned token and add to results
    private void AddUnsigned(string token, Modifiers mod, int radix)
    {
        object obj;
        if (mod == Modifiers.ShortShort)
            obj = Convert.ToByte(token, radix);
        else if (mod == Modifiers.Short)
            obj = Convert.ToUInt16(token, radix);
        else if (mod == Modifiers.Long ||
            mod == Modifiers.LongLong)
            obj = Convert.ToUInt64(token, radix);
        else
            obj = Convert.ToUInt32(token, radix);
        Results.Add(obj);
    }
}

Conclusion

I hope some people will find this code useful. I've seen quite a few posts on the web either asking if there was a scanf() replacement for .NET, or just asking how to parse text that could easily be parsed by this class.

If nothing else, it was an interesting little exercise for me to write.

End-User License

Use of this article and any related source code or other files is governed by the terms and conditions of The Code Project Open License.

Author Information

Jonathan Wood

I'm a software/website developer working out of the greater Salt Lake City area in Utah. I've developed many websites including Black Belt Coder, Insider Articles, and others.