Login


Converting Text to HTML

By Jonathan Wood on 3/31/2011
Language: C#HTML
Technology: .NET
Platform: Windows
License: CPOL
Views: 96,539
General Programming » Text Handling » HTML & URLs » Converting Text to HTML

Screenshot of Demo Project

Download Source Code and Demo Project Download Source Code and Demo Project

Introduction

When writing text for a Web page, that text must be formatted according to the rules of HTML and XHTML. Characters that have special meaning in HTML must be properly encoded and, since most whitespace is ignored, special tags are required to denote line breaks and paragraphs.

However, sometimes an application needs to display text from a file or database. This text may be entered by the user or from another source that has not been formatted for HTML. In these cases, the text must be converted.

Fortunately, the .NET library provides the HttpUtility.HtmlEncode() method to encode special characters so that they will appear as expected in a browser. However, this method won't do anything with line breaks and paragraphs.

When a Web application needs to display unformatted text that contains multiple lines and/or paragraphs on a Web page, a little more work is required.

Writing a Text-to-HTML Converter

Listing 1 shows my ToHtml() extension method. This code adds the ToHtml() method to string variables.

The ToHtml() method converts blocks of text separated by two or more newlines into paragraphs (using <p></p> tags). It converts single newlines into line breaks (using <br> tags). And it calls HttpUtility.HtmlEncode() to HTML-encode special characters.

In addition, this method supports a special syntax for specifying links. Because regular <a> tags would be encoded by this method, a special syntax is required to allow users to specify a hyperlink.

This syntax uses double square brackets ([[ and ]]). So, for example [[http://www.blackbeltcoder.com]] produces a hyperlink with http://www.blackbeltcoder.com as both the anchor text and the target URL.

You can also specify two text values in the form [[Black Belt Coder][http://www.blackbeltcoder.com]]. This produces a hyperlink with Black Belt Coder as the anchor text and http://www.blackbeltcoder.com as the target URL.

If you are simply taking unformatted text and displaying it on a Web page, then this special link syntax won't come into play (the double square brackets are unlikely to occur naturally). But if you or your users want the ability to submit Web content in plain text, this provides an easy syntax to specify hyperlinks.

Listing 1: The ToHtml() Extension Method

public static class StringMethodExtensions
{
    private static string _paraBreak = "\r\n\r\n";
    private static string _link = "<a href=\"{0}\">{1}</a>";
    private static string _linkNoFollow = "<a href=\"{0}\" rel=\"nofollow\">{1}</a>";

    /// <summary>
    /// Returns a copy of this string converted to HTML markup.
    /// </summary>
    public static string ToHtml(this string s)
    {
        return ToHtml(s, false);
    }

    /// <summary>
    /// Returns a copy of this string converted to HTML markup.
    /// </summary>
    /// <param name="nofollow">If true, links are given "nofollow"
    /// attribute</param>
    public static string ToHtml(this string s, bool nofollow)
    {
        StringBuilder sb = new StringBuilder();

        int pos = 0;
        while (pos < s.Length)
        {
            // Extract next paragraph
            int start = pos;
            pos = s.IndexOf(_paraBreak, start);
            if (pos < 0)
                pos = s.Length;
            string para = s.Substring(start, pos - start).Trim();

            // Encode non-empty paragraph
            if (para.Length > 0)
                EncodeParagraph(para, sb, nofollow);

            // Skip over paragraph break
            pos += _paraBreak.Length;
        }
        // Return result
        return sb.ToString();
    }

    /// <summary>
    /// Encodes a single paragraph to HTML.
    /// </summary>
    /// <param name="s">Text to encode</param>
    /// <param name="sb">StringBuilder to write results</param>
    /// <param name="nofollow">If true, links are given "nofollow"
    /// attribute</param>
    private static void EncodeParagraph(string s, StringBuilder sb, bool nofollow)
    {
        // Start new paragraph
        sb.AppendLine("<p>");

        // HTML encode text
        s = HttpUtility.HtmlEncode(s);

        // Convert single newlines to <br>
        s = s.Replace(Environment.NewLine, "<br />\r\n");

        // Encode any hyperlinks
        EncodeLinks(s, sb, nofollow);

        // Close paragraph
        sb.AppendLine("\r\n</p>");
    }

    /// <summary>
    /// Encodes [[URL]] and [[Text][URL]] links to HTML.
    /// </summary>
    /// <param name="text">Text to encode</param>
    /// <param name="sb">StringBuilder to write results</param>
    /// <param name="nofollow">If true, links are given "nofollow"
    /// attribute</param>
    private static void EncodeLinks(string s, StringBuilder sb, bool nofollow)
    {
        // Parse and encode any hyperlinks
        int pos = 0;
        while (pos < s.Length)
        {
            // Look for next link
            int start = pos;
            pos = s.IndexOf("[[", pos);
            if (pos < 0)
                pos = s.Length;
            // Copy text before link
            sb.Append(s.Substring(start, pos - start));
            if (pos < s.Length)
            {
                string label, link;

                start = pos + 2;
                pos = s.IndexOf("]]", start);
                if (pos < 0)
                    pos = s.Length;
                label = s.Substring(start, pos - start);
                int i = label.IndexOf("][");
                if (i >= 0)
                {
                    link = label.Substring(i + 2);
                    label = label.Substring(0, i);
                }
                else
                {
                    link = label;
                }
                // Append link
                sb.Append(String.Format(nofollow ? _linkNoFollow : _link, link, label));

                // Skip over closing "]]"
                pos += 2;
            }
        }
    }
}

Using the Code

Using the code is very simple. As long as your code can find the StringMethodExtensions class, strings will automatically have the new ToHtml() method.

The code in Listing 2 creates a short, two-paragraph string, converts it to HTML, and assigns the result to a Literal control.

Listing 2: Using the StringMethodExtensions Class

string s = "Here is paragraph 1.\r\n\r\nHere's paragraph 2.";
litText.Text = s.ToHtml();

This generates the HTML shown in Listing 3.

Listing 3: The HTML Produced by the Code in Listing 2

<p>
Here is paragraph 1.
</p>
<p>
Here&#39;s paragraph 2.
</p>

The attached source code download includes a slightly more extensive example.

Conclusion

This code is based on a routine that I use all the time. For example, I might use it where users can enter text in a <textarea> box.

I also use it where my software allows me to enter quick comments. Rather than having to write correct HTML code, I can just type plain text and my code generates the HTML for me. And in cases where I want to generate a simple hyperlink, this code supports that as well.

Of course, this is a simple routine and there are much more sophisticated ways of converting text to HTML. For example, Markdown is a sort of hybrid syntax between plain text and HTML, and gives you much more control over HTML formatting while still entering primarily plain text. However, that approach would be overkill for many applications.

For applications that must convert text to HTML and don't need a lot of bells and whistles, the code I've presented in this article should do quite nicely.

End-User License

Use of this article and any related source code or other files is governed by the terms and conditions of The Code Project Open License.

Author Information

Jonathan Wood

I'm a software/website developer working out of the greater Salt Lake City area in Utah. I've developed many websites including Black Belt Coder, Insider Articles, and others.