June 5, 2009

ColdFusion Regular Expression Optimization

Last night I bumped on article talking about regular expression performance tuning. After reading it and since we extensivly use regex to parse article & community content, I decided to see can we do something to bit boost performance on that side. So, here we go.

Startup facts and implications:

a) „ColdFusion is built on top of Java. ColdFusion data types (String, Query, Struct, Array, etc.) are really just custom Java classes that are built on top of things like strings, record sets, hash tables, collections, etc. „
=>
We can use java.lang.String methods on ColdFusion strings. This is more a fact then something we are going to use and you’ll see later why ;)

b) Regular expression string replacement based on java compiled patterns is much faster then using built-in ColdFusion function (REReplace, REReplaceNoCase)

c) Alternation really can slow down your programs. Regular expressions like "(X|Y|Z)" have a reputation for being slow, so watch out for them.
- Expression "ab(cd|ef)" is faster then "(abcd|abef)"

d) Regex engine will look/search through entire text to see if there is anything that will match your criteria. Whenever possible, check string for existens of "fixed" part pattern before you start looking for it via regex.

- Expression ".*(abcd|efgh|ijkl).*" is slower than using three calls to String.indexOf(), one for each alternative in the regular expression.


Test
Okay. Now test. I took and old cffunction (getClearText) that removes BBCode from text and refactored (getClearTextTest) to see how it is going to perform after I apply rules from above. At first try I used string.indexOf(„pattern_keyword“) (see fact a)) to test does text contains at least one occurence of pattern in it. Problem was that indexOf is case-sensitive and BBCode is used by users-free-will, therefore case can vary. I've also tried to toUpper/LowerCase() text, but doing it on large texts can be veeery slow. My test showed that uppercasing text of 4K characters can take up to 200ms. Hence, scenario textToPrase.toUpperCase().indexOf(parrent_keyword.toUpperCase()) was out of the question. Than I tried CF function findNoCase and it perfomed suprisingly well. Honestly I would like to know how it’s working.

How I test it?
Very simple. I fetched from database 50 articles.
select id,textcol from articles
order by article_id desc limit 50;

Through standard query loop I parsed textcol using both methods: getClearText and getClearTextTest to see how much time they will need to do work (measured by cftimer). I’ve also instantiated 2 different javaRegEx classes to be sure that one method is not using precompiled pattern of other.

Result:
getClearTextTest took (avg) 20% less time to complete task.
Case-sensitivity plays important role. Case-sensitive search methods (indexOf,find) are faster of thier case-insensitive counterparts (toUpperCase().indexOf,findNoCase).

Trivia
I've tried to impove pattern-check part of code, by replacing findNoCase method with dk.brics.automaton.RegExp pattern Matcher. It has reputation of blazing fast RegEx Java library. In short, in my case it didn't help at all. Pattern compilation time takes too much time.

Moral:
Don’t look unless you’re not sure it’s there. Peak first! :)

Recommanded readings:
Optimizing regular expressions in Java
Ben Nadel blog (he blogged alot on similar subjects)

PS
Code sample is comming as soon as I upload it...somewhere :)

No comments: