Regular Expressions Analysis Tool

ScratchPad: /<BR \/>/gi /<(A).*<\/\1>/ig /<([IMG|A])/i /<([IMG|A])[\w\s\.*]/ig /<([IMG|A]).*<\/([A|IMG])>/i Match both <A> and <IMG> tags: /<([IMG|A])>[\w\s\.]+<\/\1>/i "<A[^>]*HREF=(\\S*)[^>]*>[^<]*</A>" ----------------------------------------------------------------------------- Regular Expressions - Special Characters: \077 - Octal character. \a - Alarm (bell). \b - A word boundary, outside [] only. \B - No word boundary. \c - Control character. \d - Match a digit character. \D - Match a nondigit character. \E - End case modification. \e - Escape. \f - Form feed. \l - Lowercase next character. \L - Lowercase until \E found. \n - Newline. \Q - Quote (disable) pattern metacharacters until \E found. \r - Return. \S - Match a non-white-space character. \s - Match a white-space character. \t - Tab. \u - Uppercase next character. \U - Uppercase untile \E found. \w - Match a word character (alphanumeric and "_"). \W - Match a non-word character. \x1 - Match a hex character. \| - Vertical bar. \[ - An open square bracket. \) - A closing parenthesis. \* - An asterisk. \^ - A carat symbol. \/ - A slash. \\ - A backslash. ----------------------------------------------------------------------------- Regular Expressions - Metacharacters: These characters must be escaped: \|()[{^$*+?. Regular Expressions - Quantifiers: Used to specify that a pattern must match a specific number of times. * - 0 or more of the last character. + - 1 or more of the last character. ? - 0 or 1 of the last character. {n} - Match n times. {n,} - Match a least n times. {n,m} - Match at least n times - but not more than m times. To match a minimum number of times, follow the quantifier with a ? (question mark). Example: *?, +?, ??, {n}?, {n,}?, {n,m}? Example: $_ = "That is some text, isn't it?"; s/.*is/That's/; print->That'sn't it? s/.*?is/That's/; print->That's some text, itn't it? Regular Expressions - Assertions: Used to match certain conditions in a string. ^ - Match the beginning of a line or string $ - Match the end of a line or string . - Any single character except a newline \b - Match a word boundary. \B - Match a non-word boundary. \A - Match only a beginning of string. \Z - Match only at end of string (or before newline@end). \z - Match only at end of string. \G - Match only where previous //g left off (only w/g). Regular Expressions - Extensions: These are useful when you want to make sure there's a certain string ahead or behind a match but don't want to include that string in the match. (?=EXPR) - Matches if EXPR would match next. (?!EXPR) - Matches if EXPR would not match next. (?<=EXPR) - Matches if EXPR would match previous. (?<!EXPR) - Matches if EXPR would not match previous. The above extensions are useful with special variables: $& - Holds the last match. $` - Holds the string behind the last match. $' - Holds the string ahead of the last match. Example: while($text =~ /]w+(?=\s)/g) print "$&\n"; Regular Expressions - Modifiers: i - Ignore alphabetic case. x - Ignore white-space in pattern and allow comments. g - Globally perform all possible operations. gc - Don't reset serarch position after a failed match. s - Let . character match newlines. m - Let ^ and $ match embedded \n characters. o - Compile the pattern only once. Example: /[a-z]/ig - Case insensitive global search ----------------------------------------------------------------------------- Regular Expressions - Character Classes: Characters can be grouped into a character class, and that class will match any one character inside it. A character class is enclosed in square brackets, [ and ]. You can also specify a range of characters with the - "dash" symbol. [aeiou] - matches and single vowel. [abc] - each occurence of 'a' or 'b' or 'c'. [^ab] - all characters [except] 'a' and 'b'. [a-z] - all lowercase alpha characters. [A-Z] - all uppercase alpha characters. [a-zA-Z] - all alpha characters. /[a-z]/gi - same as above. [1-9] - all numberic characters. ----------------------------------------------------------------------------- Alternative Match Patterns: Check if user type either exit, quit, or stop: if(/exit|quit|stop/) continue; if(/^(exit|quit|stop)$/) exit; The ^ and $ metacharacters will make sure it is not a sentence. It is good practice to put alternatives in parentheses to make it clear where they start and end. ----------------------------------------------------------------------------- Regular Expressions - Backreferences: To refer back to a previous match in an expression, use a backslash and a number representing the last match: \1 \2 \3 etc. In the following regular expression, the \1 is used to refer to the <HTML> tag found previously. Notice the escaped forward slash which designates the closing of the <HTML> tag. /<([IMG|A])>[\w\s\.]+<\/\1>/i You can also designate a match for later reference by enclosing its subpattern in parentheses (called the grouping or bracketing operator). Then the subpattern may be refered to outside the pattern with the pattern number prefaced with $ (for example: $1, $2, $3, etc). if( var =~ /(\d)/ ) print "$1" ----------------------------------------------------------------------------- Regular Expressions - Macros: /<.*>/gi - strip HTML tags. /\/\/.*/gi - strip JavaScript Comments. /\/\*.*\*\//gi - strip JavaScript Comments. \n\r - strip blank lines. \r\n - strip all line returns. s/^\s+// - strip leading whitespace. s/\s+$// - strip trailing whitespace. ----------------------------------------------------------------------------- Notes: t.e # t followed by anthing followed by e # This will match the # tre # tle # but not te # tale ^f # f at the beginning of a line^ftp # ftp at the beginning of a line e$ # e at the end of a line tle$ # tle at the end of a line und* # un followed by zero or more d characters # This will match un # und # undd # unddd (etc) .* # Any string without a newline. # This is because the . matches anything except a newline # and the * means zero or more of these. ^$ # A line with nothing in it. [qjk] # Either q or j or k [^qjk] # Neither q nor j nor k [a-z] # Anything from a to z inclusive [^a-z] # No lower case letters [a-zA-Z] # Any letter [a-z]+ # Any non-zero sequence of lower case letters jelly|cream # Either jelly or cream (eg|le)gs # Either eggs or legs (da)+ # Either da or dada or dadada or... Notes the difference b/w ^ in the following lines: /[^A-Za-z\s]+/ /^one_line$/ The use of the ^ character in the first example is used in a character class which means "NOT" match. In the second example, the ^ character is used to refer to the beginning of a line (b/c it is not character classed). The $ is refering to the end of a line respectively. PERL specifics: $text =~ m/EXPR/; # Match operator: if($text=~m/hello/) $text =~ s/EXPR/EXPR/; # Lets you replace one string w/another. Modifiers: e - Right-hand side is code to evaluate ee - Right-hand side is a string to eval and run as code. Example: $text =~ s/(\w+)/uc($1)/ge; The function is uc(). $text =~ tr/EXPR/EXPR/; # aka y//: Translate Modifiers: c - Complements the search list. d - Delete unreplaced characters. s - Delete duplicate replaced characters.