Quantcast
Channel: Martijn's C# Programming Blog » Beginner
Viewing all articles
Browse latest Browse all 10

Safely cleaning HTML with strip_tags in C#

$
0
0

Removing unwanted tags with StripTags/strip_tags

One of my favorites in the PHP libraries is the strip_tags function. Not only does it neatly remove HTML from an input it also allows you to specify which tags should stay. This is great if you are allowing your visitors to apply some basic HTML tags to their comments. This post explores two issues: using C# to remove unwanted tags, and cleaning up unwanted attributes that might be hidden in the allowed tags.

I wanted to clean some comments posted to a website from unwanted HTML tags. The users are allowed to <B> or <I> and even <A href=”"></a> their posts but anything else must be stripped before it is posted to the site. I found several regular expressions for C# that allow you to strip HTML but these magically wipe all the HTML and leave nothing.

Below is the end result of of some hacking, and of course much love-hate with the regular expression library.

string StripTags(string Input, string[] AllowedTags)

The StripTags method takes an input string, and an array of allowed tags. It returns the input as a string, minus all not wanted tags.

string test1 = StripTags("<p>George</p><b>W</b><i>Bush</i>", new string[]{"i","b"});
string test2 = StripTags("<p>George <img src='someimage.png' onmouseover='someFunction()'>W <i>Bush</i></p>", new string[]{"p"});
string test3 = StripTags("<a href='http://www.dijksterhuis.org'>Martijn <b>Dijksterhuis</b></a>", new string[]{"a"});

Using the above example code returns the following:

George<b>W</b><i>Bush</i>
<p>George W Bush</p>
<a href=’http://www.dijksterhuis.org’>Martijn Dijksterhuis</a>


string StripTagsAndAttributes(string Input, string[] AllowedTags)

The above StripTags function is similar to the original PHP strip_tags function in having the same weakness: It is still possible for a malicious user to insert attributes into each of the tags. Think “style=” and “id=”. We would be somewhat saver if we cleaned these as well. The StripTagsAndAttributes method does just that.

It first runs the input through StripTags, and for the remaining tags is strips out all but a restricted set of attributes.

string test4 = "<a class=\"classof69\" onClick='crosssite.boom()' href='http://www.dijksterhuis.org'>Martijn Dijksterhuis</a>";
Console.WriteLine(StripTagsAndAttributes(test4, new string[]{"a"}));

That “OnClick” attribute looks mighty unsafe. Running the above string through StripTagsAndAttributes as in the example above returns:

<a class=”classof69″ href=’http://www.dijksterhuis.org’>Martijn Dijksterhuis</a>

This function probably needs some tuning if you want to allow, or restrict things even further.

A word of caution

Regular expressions are voodoo, very cool, but still voodoo. The above functions work for the tests I have applied to them, but your mileage may vary! If you have a special situation that doesn’t work leave a note below and maybe we can work out the problems.

Credits

The strip_tags function is of course inspired by the PHP version , and a Javascript implementation thereof by Kevin van Sonderveld. The attribute stripping routine is based on the regular expressions by mdw252 in one of the strip_tags manual page comments.

Source code

The complete source code for the StripTags function and StripTagsAndAttributes function with my test code can be found below:


using System;
using System.Text.RegularExpressions;

namespace StripHTML
{
	class MainClass
	{

        private static string ReplaceFirst(string haystack, string needle, string replacement)
        {
       		int pos = haystack.IndexOf(needle);
            if (pos < 0) return haystack;
            return haystack.Substring(0,pos) + replacement + haystack.Substring(pos+needle.Length);
        }

		private static string ReplaceAll(string haystack, string needle, string replacement)
        {
             int pos;
			 // Avoid a possible infinite loop
             if (needle == replacement) return haystack;
              while((pos = haystack.IndexOf(needle))>0)
                       haystack = haystack.Substring(0,pos) + replacement + haystack.Substring(pos+needle.Length);
                        return haystack;
        }		

		public static string StripTags(string Input, string[] AllowedTags)
		{
			Regex StripHTMLExp = new Regex(@"(<\/?[^>]+>)");
		    string Output = Input;

			foreach(Match Tag in StripHTMLExp.Matches(Input))
			{
				string HTMLTag = Tag.Value.ToLower();
				bool IsAllowed = false;

				foreach(string AllowedTag in AllowedTags)
				{
					int offset = -1;

					// Determine if it is an allowed tag
					// "<tag>" , "<tag " and "</tag"
					if (offset!=0) offset = HTMLTag.IndexOf('<'+AllowedTag+'>');
					if (offset!=0) offset = HTMLTag.IndexOf('<'+AllowedTag+' ');
					if (offset!=0) offset = HTMLTag.IndexOf("</"+AllowedTag);

					// If it matched any of the above the tag is allowed
					if (offset==0)
					{
					 	IsAllowed = true;
						break;
					}
				}

				// Remove tags that are not allowed
				if (!IsAllowed) Output = ReplaceFirst(Output,Tag.Value,"");
			}

			return Output;
		}

		public static string StripTagsAndAttributes(string Input, string[] AllowedTags)
		{
			/* Remove all unwanted tags first */
			string Output = StripTags(Input,AllowedTags);

			/* Lambda functions */
			MatchEvaluator HrefMatch = m => m.Groups[1].Value + "href..;,;.." + m.Groups[2].Value;
			MatchEvaluator ClassMatch = m => m.Groups[1].Value + "class..;,;.." + m.Groups[2].Value;
			MatchEvaluator UnsafeMatch = m => m.Groups[1].Value + m.Groups[4].Value;

			/* Allow the "href" attribute */
			Output = new Regex("(<a.*)href=(.*>)").Replace(Output,HrefMatch);

			/* Allow the "class" attribute */
			Output = new Regex("(<a.*)class=(.*>)").Replace(Output,ClassMatch);

			/* Remove unsafe attributes in any of the remaining tags */
			Output = new Regex(@"(<.*) .*=(\'|\""|\w)[\w|.|(|)]*(\'|\""|\w)(.*>)").Replace(Output,UnsafeMatch);

			/* Return the allowed tags to their proper form */
			Output = ReplaceAll(Output,"..;,;..", "=");

			return Output;
		}

		public static void Main(string[] args)
		{
			string test1 = StripTags("<p>George</p><b>W</b><i>Bush</i>", new string[]{"i","b"});
			string test2 = StripTags("<p>George <img src='someimage.png' onmouseover='someFunction()'>W <i>Bush</i></p>", new string[]{"p"});
			string test3 = StripTags("<a href='http://www.dijksterhuis.org'>Martijn <b>Dijksterhuis</b></a>", new string[]{"a"});

			Console.WriteLine(test1);
			Console.WriteLine(test2);
			Console.WriteLine(test3);

			string test4 = "<a class=\"classof69\" onClick='crosssite.boom()' href='http://www.dijksterhuis.org'>Martijn Dijksterhuis</a>";
			Console.WriteLine(StripTagsAndAttributes(test4, new string[]{"a"}));
		}
	}

Image credit: Jesper Rønn-Jensen’s

This is a post from Martijn's C# Coding Blog.


Viewing all articles
Browse latest Browse all 10

Trending Articles