Help with a REGEX

By Oli on Sunday, 10th December 2006. More information. Comments.

I've got a little bit of a problem... I'm in the middle of programming the forums and I'm just trying to implement some BBCode so people can post things without being able to trash the pages with HTML. Seeing as I can't find a prefab solution which does BBCode under ASPNET, I'm making my own

I've got a little bit of a problem... I'm in the middle of programming the forums and I'm just trying to implement some BBCode so people can post things without being able to trash the pages with HTML.

Seeing as I can't find a prefab solution which does BBCode under ASPNET, I'm making my own with regular expressions... The problem is: I want to do this easily but have it so it works well.

Say you have some BBCode like this:

[b]This should be bold.[/b]

Naturally, I want to search for an open an close tag. Simple enough. But what if someone doesn't close a tag? I need to seach for open tags from the front and search for their closing tags from the back... Getting a little more complicated.

I'd also (for performance sake) not like to have 150 REGEXes for each tag rather:

[*]content[\*]

So I match anything that looks like a tag and then I can iterate through the tags to see what I'm allowing and what to replace it with. For the more complex tags, such as URLs, I'll have a separate REGEX.

Here's the work in progress:

\[(?<tag>\w+)\].*\[/\k<tag>\]

It matches the outermost tags but it only matches one set of tags... Therefore if you try to parse [b][u]RAWR[/u][/b], it only gets b.

Grav

Written by Oli on Sunday, 10 December 2006. Tagged with regex, betas. Read 1549 times. If you liked it, please give it a digg.

#1 /* 2 years, 1 month ago */
I've found that the only way I could really cover myself in this user-posted-content situation to avoid ugly pages and security issues was to strip out all markup tags. For that I use:

return Regex.Replace(content, @"<(.|\n)*?>", string.Empty);

If you want to allow a couple like bold or italic, you could scan for those first and when you match "" or "" replace them with some weird string, strip out the tags, and replace those back.

If you try to get into matching up the tags, or covering the case when tags aren't closed, and so on, you'll just go crazy. In more ways than one, striping everything (but maybe a couple) is the safest way to go.
#2 — Author comment /* 2 years, 1 month ago */
Yeah the way I stop HTML now is just replacing < and > with their HTML chars.

>> If you try to get into matching up the tags, or covering the case when tags aren't closed, and so on, you'll just go crazy.

Way ahead of you =) I'm classifiably insane from these REGEXes.

I have just come up with this:
\[(?\w+)\](?=[\s\S]*(\[/\k\]))

Now that matches things good, but I've got a feeling that's going to give me pain when I try and replace the subgroups. I could be wrong.

I'm also seriously considering dumping the BBCode and going with plan [x]HTML, but as you say, if I do things that way, I'll need to convert tags (<> anyway) to an intermediary symbol like [tag] so they don't get messed up when I clean up non-tags.

With HTML, I could also have a rich edit box, which I'm quite fond of.

Yeah. Going mad.
#3 — Author comment /* 2 years, 1 month ago */
And on top of that madness I've got some large-looking concurrency issues starting to raise their ugly heads.

For some reason the memory-resident version of the site's data keeps being purged so it has to go back to the DB every hit and people's sessions get lost. It's a bit of a nightmare... Well... At least it's stopped doing it right now.
#4 — Author comment /* 2 years, 1 month ago */
Yeah OK.

I've had a basic HTML-cleansing function in place for some time... It's used on the signature that people can specify when they register, but regardless, I can pass that an array of elements that I wish to allow and it will check the input and convert out any tags which aren't those. I can also allow certain attributes through, but I don't think that's necessary at the moment. I might let

Don't just sit there like a lemon! Reply!

Got something to say? Now's the time to share it with the author and everybody else that reads this posting! Lemons need not apply.

edtBOX - xHTML: yes - bbcode:no
Home | Advertise | About | Contact | Legal © Oli Warner 2001—2007 Proud 9rules member