99 lines
3.9 KiB
Plaintext
99 lines
3.9 KiB
Plaintext
From: bob at horvath.com (Bob Horvath)
|
|
Date: Tue, 11 May 1999 02:41:51 -0500
|
|
Subject: help with simple regular expression grouping with re
|
|
References: <000401be9b6e$cf1c5180$569e2299@tim>
|
|
Message-ID: <3737DF3F.4FE9E3AF@horvath.com>
|
|
Content-Length: 3672
|
|
X-UID: 1975
|
|
|
|
Tim Peters wrote:
|
|
|
|
> [Tim]
|
|
> > | import re
|
|
> > | pattern = re.compile(r"""
|
|
> > | " # match an open quote
|
|
> > | ( # start a group so re.findall returns only this part
|
|
> > | [^"]*? # match shortest run of non-quote characters
|
|
> > | ) # close the group
|
|
> > | " # and match the close quote
|
|
> > | """, re.VERBOSE)
|
|
> > |
|
|
> > | answer = re.findall(pattern, your_example)
|
|
> > | for field in answer:
|
|
> > | print field
|
|
>
|
|
> [Dan Schmidt]
|
|
> > This works for a tricky reason, which people should be aware of.
|
|
>
|
|
> *All* regexps work for a tricky reason -- or, at least, the ones that
|
|
> actually do work <wink>.
|
|
>
|
|
> > I had just written the following response to your code:
|
|
> >
|
|
> > Not that it's important, but technically, what you did was overkill.
|
|
> > Because *? is non-greedy, it won't match any quote characters,
|
|
> > because it will be happy to hand off the quote to the next element
|
|
> > of the regexp, which does match it.
|
|
> >
|
|
> > So "(.*?)" and "([^"]*)" both solve the problem; you don't need to
|
|
> > disallow quotes _and_ match non-greedily.
|
|
> >
|
|
> > And then I decided to test it, just to make sure (replacing '[^"]'
|
|
> > with '.'), and... it failed. Because '.' doesn't match newlines by
|
|
> > default. When I added re.DOTALL to the options at the end, it worked
|
|
> > fine.
|
|
> >
|
|
> > Your example works because the character class [^"] (everything
|
|
> > but a double quote) happens to include newlines too. (Actually, I
|
|
> > think you took the newlines out of the input string before you tested
|
|
> > it, so maybe you were just lucky).
|
|
>
|
|
> I tested it both ways, reported on one, and have no idea which way is
|
|
> correct: every time CSV parsing comes up, the questioner is unable to
|
|
> define what (exactly) the rules are, and the appearance of line breaks in
|
|
> the original example could simply be an artifact of a transport or mailer
|
|
> breaking a long line. In the face of the unknown, seemed better to be
|
|
> permissive.
|
|
|
|
Being the original poster....
|
|
|
|
My problem has CSV that does not cross word boundaries, and does not contain
|
|
quotes within the fields (I had to check), but probably could some day. I'll
|
|
have to try it and see what it does.
|
|
|
|
The line crossing will never happen though.
|
|
|
|
|
|
>
|
|
>
|
|
> > So my new claim is that the following is the 'best' regexp, for my
|
|
> > personal definition of best (internal comments deleted):
|
|
> >
|
|
> > pattern = re.compile(r'"(.*?)"', re.VERBOSE | re.DOTALL)
|
|
>
|
|
> The original was indeed overkill, but for another reason <wink>: it's also
|
|
> the case that whenever CSV parsing comes up, a later msg in the thread goes
|
|
> "oh! I forgot -- it can have *embedded* quotes too". Writing it [^"] is
|
|
> anticipating a step in how the regexp will need to be changed anyway to
|
|
> accommodate whichever escape convention they think they've
|
|
> reverse-engineered <0.1 wink>.
|
|
>
|
|
> Even without that prognostication, though, a greedy "([^"]*)" is (as Aahz
|
|
> said) likely to run faster than a non-greedy "(.*?)". [^"]* is also more
|
|
> robust, in that it unconditionally forbids matching a double quote in the
|
|
> guts; what .*? matches depends on context, and will happily chew up double
|
|
> quotes too if the context requires it for the *context* to match. In this
|
|
> particular regexp as a whole that won't happen, but under *modification*
|
|
> context-sensitive submatches are notoriously prone to surprises.
|
|
>
|
|
> In any case, I certainly didn't need to do both [^"] and *? in the original!
|
|
> My "best" would consist of removing the question mark <wink>.
|
|
>
|
|
> otoh-if-embedded-quotes-are-really-illegal-string.split-with-a-little-
|
|
> post-processing-would-be-best-of-all-ly y'rs - tim
|
|
|
|
|
|
|
|
|
|
|