Question about RegExp implementation in Lua
 

News:

29 December 2022 - PtokaX 0.5.3.0 (20th anniversary edition) released...
11 April 2017 - PtokaX 0.5.2.2 released...
8 April 2015 Anti child and anti pedo pr0n scripts are not allowed anymore on this board!
28 September 2015 - PtokaX 0.5.2.1 for Windows 10 IoT released...
3 September 2015 - PtokaX 0.5.2.1 released...
16 August 2015 - PtokaX 0.5.2.0 released...
1 August 2015 - Crowdfunding for ADC protocol support in PtokaX ended. Clearly nobody want ADC support...
30 June 2015 - PtokaX 0.5.1.0 released...
30 April 2015 Crowdfunding for ADC protocol support in PtokaX
26 April 2015 New support hub!
20 February 2015 - PtokaX 0.5.0.3 released...
13 April 2014 - PtokaX 0.5.0.2 released...
23 March 2014 - PtokaX testing version 0.5.0.1 build 454 is available.
04 March 2014 - PtokaX.org sites were temporary down because of DDOS attacks and issues with hosting service provider.

Main Menu

Question about RegExp implementation in Lua

Started by Hellkeepa, 10 June, 2005, 22:50:43

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Hellkeepa

HELLo!

The question I have is quite simple, but I seem to be unable to find a single resource which talks about it. Thus I ask here, hopefully someone in here knows.

How can I make a "non-capturing group" in Lua RegExp? In other words I want to do something like this:
regexp = "(?:\"(\a+)\")"
That above will capture a quoted string, and only a quoted string, but without capturing the quotes themselves.

If anyone could help me answer this I'd appreciate it highly.

Edit, added:
Upon further testing I found that it's also impossible to make sub-groups conditional, in other words this fail to match as soon as the last question mark is added:
regexp = "test%s(%S+)(%s(%S+))?"

This leads me to believe that Sub-groups are totally void, except for capturing, in Lua. Can somone please confirm, or rather disprove, this?

Happy codin'!

plop

QuoteOriginally posted by Hellkeepa
HELLo!

The question I have is quite simple, but I seem to be unable to find a single resource which talks about it. Thus I ask here, hopefully someone in here knows.

How can I make a "non-capturing group" in Lua RegExp? In other words I want to do something like this:
regexp = "(?:\"(\a+)\")"
That above will capture a quoted string, and only a quoted string, but without capturing the quotes themselves.

If anyone could help me answer this I'd appreciate it highly.
if i get it right you want to capture a string between  " and ".
then this should work.
"%""(.*)%""

QuoteOriginally posted by Hellkeepa
Edit, added:
Upon further testing I found that it's also impossible to make sub-groups conditional, in other words this fail to match as soon as the last question mark is added:
regexp = "test%s(%S+)(%s(%S+))?"

This leads me to believe that Sub-groups are totally void, except for capturing, in Lua. Can somone please confirm, or rather disprove, this?

Happy codin'!
change the last %S+ into a %S*
but this never returns nil, it's an empty string when not found.

for more info click here

plop
http://www.plop.nl lua scripts/howto\'s.
http://www.thegoldenangel.net
http://www.vikingshub.com
http://www.lua.org

>>----> he who fights hatred with hatred, drives the spreading of hatred <----<<

bastya_elvtars

It also works with \", we tested that one with TTB.
Everything could have been anything else and it would have just as much meaning.

Hellkeepa

#3
HELLo!

Unfortunately that was of little to no help of me, plop, but I appreciate that you're trying to help me out.
Let me explain a bit further, so you can better understand what I'm trying to do. ;)

I know that I could have just skipped the sub-group on that simple string, but I made the example as simple as possible as to not confuse people. Perhaps a more accurate example would have been this:
regexp = "%s(?:(?:\"([%S ]+?)\")|(%S+))%s";
As you probably see this RegExp would either match a quoted string with spaces (and only if both quotes were present), or a unquoted string with no spaces. Currently this is impossible, as far as I've figured out, to do in Lua. The workaround is to check if the string contains quotes, and have two different RegExps available: One that requires quotes around all "space-enabled" content, and one that ignores the use of quotes.
Not only does this add upon the work needed to be done, but it's also not very dynamic which is the point of using RegExp in the first place.

Oh, and just for your info: Your example wouldn't work the quotation marks needs to be slash-escaped in a double-quoted string; You'd get an error after the first %-sign.

Your second example, while seemingly helpful, wasn't I'm afraid. This probably also due to the lacking complexity in my example, so let me give you a more detailed one of that too:
regexp = "(%S+)(?:%s\"([%S ]+)\"(?:%s\"([%S ]+)\")?)?"
As you can see from this example, the last two groups are optional, so if last one is missing nothing is returned. The only way I've found to get this, seemingly, working is by making the spaces and quotation marks optional as well, like this:
regexp = "\"([%S ]+])\"%s*\"?(%S*)\"?%s*\"?(%S*)\"?"
But then we enter a whole minefield of ambiguity: What should we do if no spaces are provided? What about if the a quotation mark is forgotten in the middle, or they've been applied haphazardly?

Hopefully you understand my predicament a bit better now, and remember that the Regular Expressions I work on are a bit more advanced than the examples I give here: I only post basic examples to avoid unneccesary complexity and confusion. ;)

Happy codin'!

plop

the problem from the error on %" isn't because of that but the " which are surrounding the regexp.
with ' around them it comes alive.
the % is a somewhat official escape but indeed the \ also works.

> s,e,some = string.find("hoi \"dat\" ook", '%"(.*)%"' ) print (some)
dat

can you give some example strings with the results you would like 2 have from them.
then i can play around 2 bit, makes it a bit easier.
currently the most confusion comes from your choice of words. lol

plop
http://www.plop.nl lua scripts/howto\'s.
http://www.thegoldenangel.net
http://www.vikingshub.com
http://www.lua.org

>>----> he who fights hatred with hatred, drives the spreading of hatred <----<<

Hellkeepa

#5
HELLo!

Yes, I know single-quotes would work in that regard, thus I spesified "a double-quoted string". ;) However, this is digressing from the problem at hand.

The problem I have is that I have a string, with two optional parametres, that looks like this:
string_quotes = "!fadd 1 \"Testing this\" "bla.zip\" \"Optional 1\" \"Optional 2\"";
string_noquot = "!fadd 1 testing_this bla.zip Optional_1 Optional_2
And I have these two RegExps, so far:
regexp_quotes = "!fadd%s(%d+)%s\"([%S ]+)\"%s\"([%S ]+)\"%s\"([%S ]*)\"%s\"([%S ]*)\"";
regexp_noquot = "!fadd%s(%d+)%s(%S+)%s(%S+)%s(%S*)%s(%S*)";

The first question I posted was in order to make quotes optional, but required in pairs, so that they can be used in mixed mode too. Like this example:
string_mixed = "!fadd 1 \"testing this\" bla.zip opt_1 \"opt param 2\"";
This would require a RegExp with non-capturing sub groups, in order to make the quotes required in pairs. Otherwise the quotes would be optional on an individual basis, and yield results like this example:
string_test_1 = "!fadd 1 \"Testing this\" \"bla.zip\" \"blah\"";
string_test_2 = "!fadd 1 \"Testing this\" bla.zip blah \"bleh 2\""

regexp_test = "!fadd%s(%d+)%s\"?([%S ]+)\"?%s\"?([%S ]+)\"?%s\"?([%S ]*)\"?%s\"?([%S ]*)\"?";

result_1 = 1	Testing	this"	bla.zip"	blah"
result_2 = 1	Testing this" bla.zip	blah	bleh	2"

As you see, that's horribly wrong. Granted the first test-string "only" adds some quotations marks, but is able to split up the parametres correctly.
This is, however, not the case when you adds more complexity to the string, in the form of spaces. Something which is clearly visible in second test-string, as param 2 & 3 becomes interperated as one and param 5 gets split up into 2.

The correct result should have been this:
regexp_mixed = "!fadd%s(%d+)%s(?:(?:\"([%S ])+\")|(%S+)%s(?:(?:\"([%S ])+\")|(%S+)"..
	"%s(?:(?:\"([%S ])+\")|(%S+)%s(?:(?:\"([%S ])+\")|(%S+)%s(?:(?:\"([%S ])+\")|(%S+)"

result_mixed = 1	Testing this	bla.zip	blah	bleh 2

The second problem lies in making the last two parametres optional, right now it kind of works with making the last optional via this method:
string_test = "!fadd 1 \"Testing this\" \"bla.zip\" \"bleh 2\"";
regexp_test = "!fadd%s(%d+)%s\"([%S ]+)\"%s\"([%S ]+)\"%s\"([%S ]*)\"%s?\"?([%S ]*)\"?";
However, when the last parametre is added using that RegExp I get this result:
result_full = 1	Testing this" "bla.zip	blah	bleh 2
Here we can, yet again, observe that param 2 & 3 becomes one, and the result is only 4 parametres captured compared to the 5 passed.

So, the solution to this would be to make an optional non-capturing sub-group, like this:
regex_orig = "!fadd%s(%d+)%s\"([%S ]+)\"%s\"([%S ]+)\"%s\"([%S ]*)\"%s\"([%S ]*)\"";
regexp_full = "!fadd%s(%d+)%s\"([%S ]+)\"%s\"([%S ]+)\"%s\"([%S ]*)\"(?:%s\"([%S ]*)\")?";
Observe the changed between the two regular expressions, pasted the "original" in the just above the "correct" just for comparative reasons.

So, here's the list over examples, following the results I would like from them (all from one RegExp, preferably):
Test_1 = "!fadd 1 testing simple.tst";
Test_2 = "!fadd 2 \"More complex\" test.bin";
Test_3 = "!fadd 3 \"even more\" \"complex testing.bin.tst\" \"of regular expressions\"";
Test_4 = "!fadd 10 \"most complex\" test_full.bin testing \"Should be the most complex structure found! ^_^\"";
And the results:
Res_1: 1	testing	simple.tst
Res_2: 2	More complex	test.bin
Res_3: 3	even more	complex testing.bin.tst	of regular expressions
Res_4: 10	most complex	test_full.bin	testing	SHould be the most complex structure found! ^_^
Note that the tabs in the results list indicate a new captured group, just like how "print()" works with multiple variables.

More clear now? ?(

Edit: Split up the "regexp_mixed" into two lines, due to forum formatting issues.

Happy codin'!

NotRabidWombat

#6
Heh! "HELLo"

Lua regex does not support the concept of groups (only captures). It also does not support alternation.

http://www.lua.org/manual/5.0/manual.html
Search for the second: "string.gsub". Patterns are defined underneath.

What you are trying to do will indeed be incredibly complicated, if not impossible, with a single lua regex.

If the project you are working on is open source, you can link the the lua binding Lrexlib (http://luaforge.net/projects/lrexlib) to lua, adding support for more powerful regex. You can also link if lua is compiled as a seperate dll/lib.

If you'd like to quickly test it out, Lua Cheia (http://luacheia.lua-users.org/) has it already included in their package.

If this development is for a closed source app that does not offer lua as a dll/lib (like PtokaX), you're SOL ;-)

-NotRabidWombat


I like childish behavior. Maybe this post will be deleted next.

Hellkeepa

HELLo!

*Sigh*
That was what I was afraid for, when I failed to find a single resource mentioning sub-groups in Lua. ;(
Thanks for letting me know though, and for the tips about how to fix this. :) Already got the Lua manual linked, and I've read through it (and swore at bit at it too :P)

Unfortunately I'm only "working" as a 3rd party scripter for an alternative, linux-based, HUB-soft (Don't know how they rule advertising here, so I won't say more than that. ;)). However, that particular HUB-soft is developed as Open Source, so I could probably put forth a request for the addition of that library into the project. ;)
In the meantime I guess I'll just have to try and make a work-around using "string.gsub()". Problem is just how to make the quotes optional in pairs...
Oh well, I'll probably figure it out. ;)

I'll keep you posted about my progress on this problem, if anyone's interested?

Happy codin'!

Hellkeepa

HELLo!

Well, no-one posted their interest, but maybe this can help someone else in the future.

This function will emulate the Regular Expression I was looking for, one with alternating, non-capturing sub-groups for quoted/unquoted strings. Enjoy. ;)

function ParseData(data)
	-- Define local variables, for later use.
	local ret, pos;
	ret = {};

	for run = 0, 10, 1 do
		-- Check if parametre is quoted.
		if (string.sub(data, 1, 1) == "\"") then
			-- Parametre quoted, run through string to find end of quoted text.
			for run2 = 2, string.len(data), 1 do
				-- Check if end of quoted string is found.
				if (string.sub(data, run2, run2) == "\"" and string.sub(data, run2 + 1, run2 + 1) == " ") then
					-- Is found, set position and break from loop.
					pos = run2;
					break;
				end;

				-- No end-quote found, set pos to string-length.
				pos = run2;
			end;

			-- Grab the text between the quotes.
			ret[run] = string.sub(data, 2, pos - 1);

			-- Remove the parametre from "data".
			_, _, data = string.find(string.sub(data, string.len(ret[run]) + 2),"\"%s(.*)");
		else
			-- Parametre is not quoted, grab the text until the first space.
			_, _, ret[run] = string.find(data, "(%S+)");

			-- Remove the parametre from "data".
			_, _, data = string.find(string.sub(data, string.len(ret[run]) + 1),"%s(.*)");
		end;

		-- Check length of data, to see if there's more to be parsed.
		if (not data or string.len(data) <= 0) then
			-- No more to check, break loop.
			break;
		end;
	end;

	-- Return the data parsed.
	return ret;
end;
You'll notice that I've capped it to 10 results max, but this is merely to prevent it from ending up in a loop (though _that_ is highly unlikely). This limit can be changed, if anyone would need more than 10 results returned.

Happy codin'!

TTB

Hm, this is very interesting!

Where do you need this code for??? I wanna know that!
TTB

(? ?.??.-> Admin @ Surfnet hubs <-.??.???)

Hellkeepa

HELLo!

Hehe, glad you find it interesting. Sorry about the long delay in responding, but I've been away for a while now.

Anyway, this is needed for when you have a function/command with optional paramatres, and some (or all) of them can be quoted to allow for spaces to be used.

What I needed this for is a release bot, where the "path" and "comment" parametres could be left out. The qouted-string support comes from that while I use space to separate the different parametres it is also often used in file-names and paths, especially so in the comment. ;) I made them optional so that one wouldn't have to use them if one doesn't have any spaces in the parametre in question.

Hope that this was a sufficient explenation, and that you'll find it useful as well.

Happy codin'!

SMF spam blocked by CleanTalk