Safely cleaning HTML with strip

Removing unwanted tags with StripTags/strip_tags

One of my favorites in the PHP libraries is the strip_tags function. Not only does it neatly remove HTML from an input it also allows you to specify which tags should stay. This is great if you are allowing your visitors to apply some basic HTML tags to their comments. This post explores two issues: using C# to remove unwanted tags, and cleaning up unwanted attributes that might be hidden in the allowed tags.

I wanted to clean some comments posted to a website from unwanted HTML tags. The users are allowed to or and even <A href=”"></a> their posts but anything else must be stripped before it is posted to the site. I found several regular expressions for C# that allow you to strip HTML but these magically wipe all the HTML and leave nothing.

Below is the end result of of some hacking, and of course much love-hate with the regular expression library.

string StripTags(string Input, string[] AllowedTags)

The StripTags method takes an input string, and an array of allowed tags. It returns the input as a string, minus all not wanted tags.

string test1 = StripTags("<p>George</p><b>W</b><i>Bush</i>", new string[]{"i","b"});
string test2 = StripTags("<p>George <img src='someimage.png' onmouseover='someFunction()'>W <i>Bush</i></p>", new string[]{"p"});
string test3 = StripTags("<a href='http://www.dijksterhuis.org'>Martijn <b>Dijksterhuis</b></a>", new string[]{"a"});

Using the above example code returns the following:

GeorgeWBush
George W Bush
<a href=’http://www.dijksterhuis.org’>Martijn Dijksterhuis</a>

string StripTagsAndAttributes(string Input, string[] AllowedTags)

The above StripTags function is similar to the original PHP strip_tags function in having the same weakness: It is still possible for a malicious user to insert attributes into each of the tags. Think “style=” and “id=”. We would be somewhat saver if we cleaned these as well. The StripTagsAndAttributes method does just that.

It first runs the input through StripTags, and for the remaining tags is strips out all but a restricted set of attributes.

string test4 = "<a class=\"classof69\" onClick='crosssite.boom()' href='http://www.dijksterhuis.org'>Martijn Dijksterhuis</a>";
Console.WriteLine(StripTagsAndAttributes(test4, new string[]{"a"}));

That “OnClick” attribute looks mighty unsafe. Running the above string through StripTagsAndAttributes as in the example above returns:

<a class=”classof69″ href=’http://www.dijksterhuis.org’>Martijn Dijksterhuis</a>

This function probably needs some tuning if you want to allow, or restrict things even further.

A word of caution

Regular expressions are voodoo, very cool, but still voodoo. The above functions work for the tests I have applied to them, but your mileage may vary! If you have a special situation that doesn’t work leave a note below and maybe we can work out the problems.

Credits

The strip_tags function is of course inspired by the PHP version , and a Javascript implementation thereof by Kevin van Sonderveld. The attribute stripping routine is based on the regular expressions by mdw252 in one of the strip_tags manual page comments.

Source code

The complete source code for the StripTags function and StripTagsAndAttributes function with my test code can be found below:


using System;
using System.Text.RegularExpressions;

namespace StripHTML
{
	class MainClass
	{

        private static string ReplaceFirst(string haystack, string needle, string replacement)
        {
       		int pos = haystack.IndexOf(needle);
            if (pos < 0) return haystack;
            return haystack.Substring(0,pos) + replacement + haystack.Substring(pos+needle.Length);
        }

		private static string ReplaceAll(string haystack, string needle, string replacement)
        {
             int pos;
			 // Avoid a possible infinite loop
             if (needle == replacement) return haystack;
              while((pos = haystack.IndexOf(needle))>0)
                       haystack = haystack.Substring(0,pos) + replacement + haystack.Substring(pos+needle.Length);
                        return haystack;
        }		

		public static string StripTags(string Input, string[] AllowedTags)
		{
			Regex StripHTMLExp = new Regex(@"(<\/?[^>]+>)");
		    string Output = Input;

			foreach(Match Tag in StripHTMLExp.Matches(Input))
			{
				string HTMLTag = Tag.Value.ToLower();
				bool IsAllowed = false;

				foreach(string AllowedTag in AllowedTags)
				{
					int offset = -1;

					// Determine if it is an allowed tag
					// "<tag>" , "<tag " and "</tag"
					if (offset!=0) offset = HTMLTag.IndexOf('<'+AllowedTag+'>');
					if (offset!=0) offset = HTMLTag.IndexOf('<'+AllowedTag+' ');
					if (offset!=0) offset = HTMLTag.IndexOf("</"+AllowedTag);

					// If it matched any of the above the tag is allowed
					if (offset==0)
					{
					 	IsAllowed = true;
						break;
					}
				}

				// Remove tags that are not allowed
				if (!IsAllowed) Output = ReplaceFirst(Output,Tag.Value,"");
			}

			return Output;
		}

		public static string StripTagsAndAttributes(string Input, string[] AllowedTags)
		{
			/* Remove all unwanted tags first */
			string Output = StripTags(Input,AllowedTags);

			/* Lambda functions */
			MatchEvaluator HrefMatch = m => m.Groups[1].Value + "href..;,;.." + m.Groups[2].Value;
			MatchEvaluator ClassMatch = m => m.Groups[1].Value + "class..;,;.." + m.Groups[2].Value;
			MatchEvaluator UnsafeMatch = m => m.Groups[1].Value + m.Groups[4].Value;

			/* Allow the "href" attribute */
			Output = new Regex("(<a.*)href=(.*>)").Replace(Output,HrefMatch);

			/* Allow the "class" attribute */
			Output = new Regex("(<a.*)class=(.*>)").Replace(Output,ClassMatch);

			/* Remove unsafe attributes in any of the remaining tags */
			Output = new Regex(@"(<.*) .*=(\'|\""|\w)[\w|.|(|)]*(\'|\""|\w)(.*>)").Replace(Output,UnsafeMatch);

			/* Return the allowed tags to their proper form */
			Output = ReplaceAll(Output,"..;,;..", "=");

			return Output;
		}

		public static void Main(string[] args)
		{
			string test1 = StripTags("<p>George</p><b>W</b><i>Bush</i>", new string[]{"i","b"});
			string test2 = StripTags("<p>George <img src='someimage.png' onmouseover='someFunction()'>W <i>Bush</i></p>", new string[]{"p"});
			string test3 = StripTags("<a href='http://www.dijksterhuis.org'>Martijn <b>Dijksterhuis</b></a>", new string[]{"a"});

			Console.WriteLine(test1);
			Console.WriteLine(test2);
			Console.WriteLine(test3);

			string test4 = "<a class=\"classof69\" onClick='crosssite.boom()' href='http://www.dijksterhuis.org'>Martijn Dijksterhuis</a>";
			Console.WriteLine(StripTagsAndAttributes(test4, new string[]{"a"}));
		}
	}

Image credit: Jesper Rønn-Jensen’s

This is a post from Martijn's C# Coding Blog.

Safely cleaning HTML with strip_tags in C#

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112