PtokaX forum

Archive => Archived 5.0 boards => Help with scripts => Topic started by: Hellkeepa on 10 June, 2005, 22:50:43

Title: Question about RegExp implementation in Lua
Post by: Hellkeepa on 10 June, 2005, 22:50:43
HELLo!

The question I have is quite simple, but I seem to be unable to find a single resource which talks about it. Thus I ask here, hopefully someone in here knows.

How can I make a "non-capturing group" in Lua RegExp? In other words I want to do something like this:
regexp = "(?:\"(\a+)\")"That above will capture a quoted string, and only a quoted string, but without capturing the quotes themselves.

If anyone could help me answer this I'd appreciate it highly.

Edit, added:
Upon further testing I found that it's also impossible to make sub-groups conditional, in other words this fail to match as soon as the last question mark is added:
regexp = "test%s(%S+)(%s(%S+))?"
This leads me to believe that Sub-groups are totally void, except for capturing, in Lua. Can somone please confirm, or rather disprove, this?

Happy codin'!
Title:
Post by: plop on 11 June, 2005, 02:13:32
QuoteOriginally posted by Hellkeepa
HELLo!

The question I have is quite simple, but I seem to be unable to find a single resource which talks about it. Thus I ask here, hopefully someone in here knows.

How can I make a "non-capturing group" in Lua RegExp? In other words I want to do something like this:
regexp = "(?:\"(\a+)\")"That above will capture a quoted string, and only a quoted string, but without capturing the quotes themselves.

If anyone could help me answer this I'd appreciate it highly.
if i get it right you want to capture a string between  " and ".
then this should work.
"%""(.*)%""

QuoteOriginally posted by Hellkeepa
Edit, added:
Upon further testing I found that it's also impossible to make sub-groups conditional, in other words this fail to match as soon as the last question mark is added:
regexp = "test%s(%S+)(%s(%S+))?"
This leads me to believe that Sub-groups are totally void, except for capturing, in Lua. Can somone please confirm, or rather disprove, this?

Happy codin'!
change the last %S+ into a %S*
but this never returns nil, it's an empty string when not found.

for more info click here  (http://www.plop.nl/howto/pattern_matching.php)

plop
Title:
Post by: bastya_elvtars on 11 June, 2005, 09:50:13
It also works with \", we tested that one with TTB.
Title:
Post by: Hellkeepa on 11 June, 2005, 11:58:16
HELLo!

Unfortunately that was of little to no help of me, plop, but I appreciate that you're trying to help me out.
Let me explain a bit further, so you can better understand what I'm trying to do. ;)

I know that I could have just skipped the sub-group on that simple string, but I made the example as simple as possible as to not confuse people. Perhaps a more accurate example would have been this:
regexp = "%s(?:(?:\"([%S ]+?)\")|(%S+))%s";As you probably see this RegExp would either match a quoted string with spaces (and only if both quotes were present), or a unquoted string with no spaces. Currently this is impossible, as far as I've figured out, to do in Lua. The workaround is to check if the string contains quotes, and have two different RegExps available: One that requires quotes around all "space-enabled" content, and one that ignores the use of quotes.
Not only does this add upon the work needed to be done, but it's also not very dynamic which is the point of using RegExp in the first place.

Oh, and just for your info: Your example wouldn't work the quotation marks needs to be slash-escaped in a double-quoted string; You'd get an error after the first %-sign.

Your second example, while seemingly helpful, wasn't I'm afraid. This probably also due to the lacking complexity in my example, so let me give you a more detailed one of that too:
regexp = "(%S+)(?:%s\"([%S ]+)\"(?:%s\"([%S ]+)\")?)?"As you can see from this example, the last two groups are optional, so if last one is missing nothing is returned. The only way I've found to get this, seemingly, working is by making the spaces and quotation marks optional as well, like this:
regexp = "\"([%S ]+])\"%s*\"?(%S*)\"?%s*\"?(%S*)\"?"But then we enter a whole minefield of ambiguity: What should we do if no spaces are provided? What about if the a quotation mark is forgotten in the middle, or they've been applied haphazardly?

Hopefully you understand my predicament a bit better now, and remember that the Regular Expressions I work on are a bit more advanced than the examples I give here: I only post basic examples to avoid unneccesary complexity and confusion. ;)

Happy codin'!
Title:
Post by: plop on 11 June, 2005, 16:06:22
the problem from the error on %" isn't because of that but the " which are surrounding the regexp.
with ' around them it comes alive.
the % is a somewhat official escape but indeed the \ also works.

> s,e,some = string.find("hoi \"dat\" ook", '%"(.*)%"' ) print (some)
dat

can you give some example strings with the results you would like 2 have from them.
then i can play around 2 bit, makes it a bit easier.
currently the most confusion comes from your choice of words. lol

plop
Title:
Post by: Hellkeepa on 11 June, 2005, 17:52:29
HELLo!

Yes, I know single-quotes would work in that regard, thus I spesified "a double-quoted string". ;) However, this is digressing from the problem at hand.

The problem I have is that I have a string, with two optional parametres, that looks like this:
string_quotes = "!fadd 1 \"Testing this\" "bla.zip\" \"Optional 1\" \"Optional 2\"";
string_noquot = "!fadd 1 testing_this bla.zip Optional_1 Optional_2
And I have these two RegExps, so far:
regexp_quotes = "!fadd%s(%d+)%s\"([%S ]+)\"%s\"([%S ]+)\"%s\"([%S ]*)\"%s\"([%S ]*)\"";
regexp_noquot = "!fadd%s(%d+)%s(%S+)%s(%S+)%s(%S*)%s(%S*)";

The first question I posted was in order to make quotes optional, but required in pairs, so that they can be used in mixed mode too. Like this example:
string_mixed = "!fadd 1 \"testing this\" bla.zip opt_1 \"opt param 2\"";This would require a RegExp with non-capturing sub groups, in order to make the quotes required in pairs. Otherwise the quotes would be optional on an individual basis, and yield results like this example:
string_test_1 = "!fadd 1 \"Testing this\" \"bla.zip\" \"blah\"";
string_test_2 = "!fadd 1 \"Testing this\" bla.zip blah \"bleh 2\""

regexp_test = "!fadd%s(%d+)%s\"?([%S ]+)\"?%s\"?([%S ]+)\"?%s\"?([%S ]*)\"?%s\"?([%S ]*)\"?";

result_1 = 1 Testing this" bla.zip" blah"
result_2 = 1 Testing this" bla.zip blah bleh 2"

As you see, that's horribly wrong. Granted the first test-string "only" adds some quotations marks, but is able to split up the parametres correctly.
This is, however, not the case when you adds more complexity to the string, in the form of spaces. Something which is clearly visible in second test-string, as param 2 & 3 becomes interperated as one and param 5 gets split up into 2.

The correct result should have been this:
regexp_mixed = "!fadd%s(%d+)%s(?:(?:\"([%S ])+\")|(%S+)%s(?:(?:\"([%S ])+\")|(%S+)"..
"%s(?:(?:\"([%S ])+\")|(%S+)%s(?:(?:\"([%S ])+\")|(%S+)%s(?:(?:\"([%S ])+\")|(%S+)"

result_mixed = 1 Testing this bla.zip blah bleh 2

The second problem lies in making the last two parametres optional, right now it kind of works with making the last optional via this method:
string_test = "!fadd 1 \"Testing this\" \"bla.zip\" \"bleh 2\"";
regexp_test = "!fadd%s(%d+)%s\"([%S ]+)\"%s\"([%S ]+)\"%s\"([%S ]*)\"%s?\"?([%S ]*)\"?";
However, when the last parametre is added using that RegExp I get this result:
result_full = 1 Testing this" "bla.zip blah bleh 2Here we can, yet again, observe that param 2 & 3 becomes one, and the result is only 4 parametres captured compared to the 5 passed.

So, the solution to this would be to make an optional non-capturing sub-group, like this:
regex_orig = "!fadd%s(%d+)%s\"([%S ]+)\"%s\"([%S ]+)\"%s\"([%S ]*)\"%s\"([%S ]*)\"";
regexp_full = "!fadd%s(%d+)%s\"([%S ]+)\"%s\"([%S ]+)\"%s\"([%S ]*)\"(?:%s\"([%S ]*)\")?";
Observe the changed between the two regular expressions, pasted the "original" in the just above the "correct" just for comparative reasons.

So, here's the list over examples, following the results I would like from them (all from one RegExp, preferably):
Test_1 = "!fadd 1 testing simple.tst";
Test_2 = "!fadd 2 \"More complex\" test.bin";
Test_3 = "!fadd 3 \"even more\" \"complex testing.bin.tst\" \"of regular expressions\"";
Test_4 = "!fadd 10 \"most complex\" test_full.bin testing \"Should be the most complex structure found! ^_^\"";
And the results:
Res_1: 1 testing simple.tst
Res_2: 2 More complex test.bin
Res_3: 3 even more complex testing.bin.tst of regular expressions
Res_4: 10 most complex test_full.bin testing SHould be the most complex structure found! ^_^
Note that the tabs in the results list indicate a new captured group, just like how "print()" works with multiple variables.

More clear now? ?(

Edit: Split up the "regexp_mixed" into two lines, due to forum formatting issues.

Happy codin'!
Title:
Post by: NotRabidWombat on 11 June, 2005, 17:53:18
Heh! "HELLo"

Lua regex does not support the concept of groups (only captures). It also does not support alternation.

http://www.lua.org/manual/5.0/manual.html
Search for the second: "string.gsub". Patterns are defined underneath.

What you are trying to do will indeed be incredibly complicated, if not impossible, with a single lua regex.

If the project you are working on is open source, you can link the the lua binding Lrexlib (http://luaforge.net/projects/lrexlib) to lua, adding support for more powerful regex. You can also link if lua is compiled as a seperate dll/lib.

If you'd like to quickly test it out, Lua Cheia (http://luacheia.lua-users.org/) has it already included in their package.

If this development is for a closed source app that does not offer lua as a dll/lib (like PtokaX), you're SOL ;-)

-NotRabidWombat
Title:
Post by: Hellkeepa on 11 June, 2005, 18:12:34
HELLo!

*Sigh*
That was what I was afraid for, when I failed to find a single resource mentioning sub-groups in Lua. ;(
Thanks for letting me know though, and for the tips about how to fix this. :) Already got the Lua manual linked, and I've read through it (and swore at bit at it too :P)

Unfortunately I'm only "working" as a 3rd party scripter for an alternative, linux-based, HUB-soft (Don't know how they rule advertising here, so I won't say more than that. ;)). However, that particular HUB-soft is developed as Open Source, so I could probably put forth a request for the addition of that library into the project. ;)
In the meantime I guess I'll just have to try and make a work-around using "string.gsub()". Problem is just how to make the quotes optional in pairs...
Oh well, I'll probably figure it out. ;)

I'll keep you posted about my progress on this problem, if anyone's interested?

Happy codin'!
Title:
Post by: Hellkeepa on 17 June, 2005, 03:09:48
HELLo!

Well, no-one posted their interest, but maybe this can help someone else in the future.

This function will emulate the Regular Expression I was looking for, one with alternating, non-capturing sub-groups for quoted/unquoted strings. Enjoy. ;)

function ParseData(data)
-- Define local variables, for later use.
local ret, pos;
ret = {};

for run = 0, 10, 1 do
-- Check if parametre is quoted.
if (string.sub(data, 1, 1) == "\"") then
-- Parametre quoted, run through string to find end of quoted text.
for run2 = 2, string.len(data), 1 do
-- Check if end of quoted string is found.
if (string.sub(data, run2, run2) == "\"" and string.sub(data, run2 + 1, run2 + 1) == " ") then
-- Is found, set position and break from loop.
pos = run2;
break;
end;

-- No end-quote found, set pos to string-length.
pos = run2;
end;

-- Grab the text between the quotes.
ret[run] = string.sub(data, 2, pos - 1);

-- Remove the parametre from "data".
_, _, data = string.find(string.sub(data, string.len(ret[run]) + 2),"\"%s(.*)");
else
-- Parametre is not quoted, grab the text until the first space.
_, _, ret[run] = string.find(data, "(%S+)");

-- Remove the parametre from "data".
_, _, data = string.find(string.sub(data, string.len(ret[run]) + 1),"%s(.*)");
end;

-- Check length of data, to see if there's more to be parsed.
if (not data or string.len(data) <= 0) then
-- No more to check, break loop.
break;
end;
end;

-- Return the data parsed.
return ret;
end;
You'll notice that I've capped it to 10 results max, but this is merely to prevent it from ending up in a loop (though _that_ is highly unlikely). This limit can be changed, if anyone would need more than 10 results returned.

Happy codin'!
Title:
Post by: TTB on 17 June, 2005, 17:09:54
Hm, this is very interesting!

Where do you need this code for??? I wanna know that!
Title:
Post by: Hellkeepa on 11 July, 2005, 12:17:23
HELLo!

Hehe, glad you find it interesting. Sorry about the long delay in responding, but I've been away for a while now.

Anyway, this is needed for when you have a function/command with optional paramatres, and some (or all) of them can be quoted to allow for spaces to be used.

What I needed this for is a release bot, where the "path" and "comment" parametres could be left out. The qouted-string support comes from that while I use space to separate the different parametres it is also often used in file-names and paths, especially so in the comment. ;) I made them optional so that one wouldn't have to use them if one doesn't have any spaces in the parametre in question.

Hope that this was a sufficient explenation, and that you'll find it useful as well.

Happy codin'!