IDM PowerTips
Perl Regular Expressions in UltraEdit: Digging Deeper
In our previous Perl regex power tip, we covered some the fundamentals of Perl regex Find and Replace in UltraEdit. If you’re looking to expand your knowledge and harness the power of this popular and robust regular expression engine, take a deeper dive in this power tip.
Make dot “.” span multiple lines
If you want your wildcard “.” to span multiple lines and not stop at line breaks, then all you need to do is enable “single-line mode” by adding “(?s)” to the beginning of your regex string. For example, the following:
(?s)perl.*regex
…will match the highlighted portion of the following text:
of Perl regex Find and Replace in UltraEdit. If you're looking to
expand your knowledge and harness the power of this popular and robust
regular expression engine, take a deeper dive in this power tip.
If you want your wildcard "." to span multiple lines and not stop at
line breaks, then all you need to do is enable "single-line mode" by
adding "(?s)" to the beginning of your regex string.
Using lookbehinds and lookaheads
Perl regular expressions include powerful features called “lookbehinds” and “lookaheads”, which allow you to check the data in front of and behind a matched string of text, respectively, without including that data as part of your match. The syntax for lookbehinds and lookaheads is as follows:
(?<=a)b | Positive lookbehind*; matches "b" when it is preceded by "a" ("a" is not included in the match) |
(?!a)b | Negative lookbehind*; matches "b" when it is not preceded by "a" |
b(?=a) | Positive lookahead; matches "b" when it is followed by "a" ("a" is not included in the match) |
b(?!a) | Negative lookahead; matches "b" when it is not followed by "a" |
* Important note: Because regular expressions do not go backwards for pattern-matching, lookbehinds must be fixed-width, meaning that you must specify the exact length of what to look behind for. For example, the following would be a valid lookbehind:
(?<=foo|bar)hello (this is ok)
…because both “foo” and “bar” are exactly 3 bytes in length, so the Perl regex engine knows to check the 3 bytes preceding the matched string. However, the following would not be a valid lookbehind:
(?<=f.*)bar (this is invalid)
…because of the asterisk operator, which represents any number of the preceding character. The regex engine does not know how far back in the data to traverse to begin the lookbehind check. For this reason, it is not possible to use most wildcards and alternations when using lookbehinds.
To find all lines containing two strings regardless of which comes first, for example “cat” and “dog”, use the following expression:
^(?=.*cat)(?=.*dog).*$
In the above expression, the positive lookheads (?=.*cat) and (?=.*dog) are essentially telling the regexp engine to match the start of a line (via ^) and “look ahead” to ensure cat and dog exist somewhere before the end of the line. Then, .*$ matches everything up to the end of the line.
To find all lines containing one string, for example “cat”, but NOT another, for example “dog”, use the following expression:
^(?!.*dog).*cat.*$
In the above expression, the negative lookhead (?!.*dog) ensures that no line containing dog will be matched. Then the following .*cat.*$ will ensure that only full lines containing cat somewhere within are matched.
You can search for a particular pattern at a specific column position by using the following syntax:
(?<=.{39})f\w+bar
The above example will search for words at column 40 that begin with “f” and end with “bar”
Search for a character by its hex value, or even a range of hex values, by using syntax similar to the following which demonstrates a search for non-printable ASCII characters:
[\x00-\x08\x0B\x0C\x0E\x0F\x10-\x1F\x7F]
This technique is used in the “Zap Gremlins” script on our user-submitted script downloads page.
You can also use similar syntax to search for Unicode characters by their code point. For example, the following will search for the Unicode Greek capital Delta Δ (Unicode code point: 0394):
\x{0394}
You can see all Unicode characters and their associated code points here.