Regular Expression HTML Matching

I’ve been doing some HTML parsing & cleaning lately, which often involves a lot of regular expressions. Turns out that .* doesn’t match across newlines, though, so that if you want to grab an HTML page’s title value, the following regular expression:

<title>(.*?)</title>

works for this HTML

<title>this is my page title</title>

but not for this HTML

<title>this is my
page title</title>

I usually use regexlib.com for patterns, tips, & testing, & it didn’t say anything about the period . not matching against newlines. It supposedly matches any character, but apparently not CR or LF. So…after banging my head against a wall for a while I finally found a good tip at OsterMiller.org that suggested this pattern

<title>((.|[rn])*?)</title>

and voila, it worked! So I was able to write my GetTagValue function like so

public static string GetTagValue(string html, string tag)
{
string pattern = @”<s*” + tag + @”[^>]*>((.|[rn])*?)</s*” + tag + @”[^>]*>”;
Match m = Regex.Match(html, pattern, RegexOptions.IgnoreCase);

if (m == null)
return String.Empty;

if (!m.Success)
return String.Empty;

return m.Groups[1].Value;

}

0