lazy stars within regular expressions

By | May 10

Today i stumbled across something really funny which is called “lazy star” and is used as a term within regular expressions. I did work on an expression and could not get it to work till i found the lazy “guy” which was the solution for the problem.

The problem briefly:
Let’s assume the following string (some pseudo html) is given…

  1. <script>
  2. ....
  3. </script>
  4. something in between!
  5. <script>
  6. ...
  7. </script>

and we want to strip all script tags out of the string including their contents. So that it looks like the following afterwards:

  1. something in between!

Yeah you are right we need a smart regular expression here which will strip those unwanted tags. My first attempt resulted in the following pattern which looks reasonable to me (i will use javascript as the language here but this can be applied to any language supporting regular expressions):

javascript
< view plain text >
  1. /<script[sS]*</script>/gi

“grab everything which starts with <script followed by any character (/s/S) any times (*) and finally ending with </script>” This whole thing works fine but it removes my whole content and the reason is because it grabs the first <script>-tag and strips everything till the last </script>-tag. Therfore unfortunately the “something in between!” string also vanished. Then I ended up playing around with expressions (what i reallly love - i cannot imagine doing something better) so that it grabs everything within the script tag but not if a new script tag already started. It took me at least 2 hours to manage this! damn i was really frustrated :)
But all of a sudden I found the MAGIC question mark (?) which makes stars to lazy stars. And because of this terminology my day was already saved (no coffee, no drugs, no pills anymore,…) and as a “nice” side effect the regular expression finally worked how it should.. here is the working pattern with the lazy star:

javascript
< view plain text >
  1. /<script[sS]*?</script>/gi

can you feel the lazyness? :) I did especially for myself. So as an explanation to this I refer to an article which describes this behavoir best:

<TAGb[^>]*>(.*?)</TAG> matches the opening and closing pair of a specific HTML tag. Anything between the tags is captured into the first backreference. The question mark in the regex makes the star lazy, to make sure it stops before the first closing tag rather than before the last, like a greedy star would do. This regex will not properly match tags nested inside themselves, like in <TAG>one<TAG>two</TAG>one</TAG>.

I have the feeling that I understand the usage of the question mark here somehow but I am not 100% confident about this cause the theory says the following about quantification ():

? The question mark indicates there is 0 or 1 of the previous expression. For example, “colou?r” matches both color and colour.
* The asterisk indicates there are 0, 1 or any number of the previous expression. For example, “go*gle” matches ggle, gogle, google, gooogle, etc.
+ The plus sign indicates that there is at least 1 of the previous expression. For example, “go+gle” matches gogle, google, gooogle, etc. (but not ggle).

Thats kinda strange to me because it does only mention how often that preceding expression is allowed to occur and not how the parsing is done. Maybe someone has a better explanation for me…

6 comments on “lazy stars within regular expressions

  1. I only know the term “greedy” instead of “lazy star”. Which is a common regexp option. for all german readers: it’s quite good explained on the selfhtml website:

    The question mark in general just means: The term before the question mark is optional = zero or one time. i.e.:

    “hor?se” matches “hose” or “horse”

    If you place it after a * or + it means: search non-greedy. i.e.:
    “hou*e” matches “houe”, “house”, “houuue” or “househouse” (zero or more)
    “hou+e” matches “house”, “houuue” or “househouse” (min. one or more)

    the ? after + or * just means: don’t be greedy. i.e.:
    “hou*?e” matches “houe”, “house”, “houuue” BUT NOT “househouse” (zero or more)
    “hou+?e” matches “house”, “houuue” BUT NOT “househouse” (min. one or more)

    michal, you should definitely buy a good regexp book :-)

    p.s. i’d recommend using the pattern ]*>(.*?). i can’t remember why, but the [^>] is important and all tag replacement patterns seem to use it.

  2. good explanation! the recommendation to use [^>] cannot be used in my case cause the script section could very likely contain a >. just think of some if conditions…
    book would be nice but i think i rather struggle around from time to time with it ;)

  3. just checked out some source code of mine and found (]*>) which works finde. just take a look and found this nice one (]*>)(.*)() at which uses backreferences to find the appropriate closing tag.

  4. Fabian mentions the term “greedy” but actually in what you are describing here we see the “lazy star” also sometimes referred to as NON-greedy pattern matching. That is, normally a regExp will want to match “as much as it can” (greedy, by default) in a pattern, and by the use of this non-greedy indicator, you will match “as little as possible.”

    And furthermore it should be noted that yes, the question mark may ALSO be found in a regexp to mean “optional” i.e., for something preceeding it to occur “{0,1}” times, but the sytnax is different and that’s not what will solve the problem of this original post. Excellent article, Michal!

  5. Why not try regexbuddy, it’s a brilliant regex debugger. You’ll never look back.

    Drak

  6. Great article and a useful discovery, the disappointment is the explanation about this lazy-star doesn’t exist in most regular expression articles or documentations. Nesting of tag is not a case here by its nature. So the code will work with any script blocks. This is also a way to get around the textContent issue in Firefox which I’m looking for a solution (when the tag, like , has script block inside)

    We usually know about matching “as much as possible”, now with lazy-star, “as less as possible”