Discussion:
Adoption of perl/javascript-style regexp syntax
(too old to reply)
unknown
2009-06-01 09:29:11 UTC
Permalink
I've just come across re and I like it :)

The only issue I have with it is that I have to specify regexps as
strings. This leads to ugly-as-hell constucts like these:

{ok, Re} = re:compile("(?<!\\\\)#")

It actually tries to find two backslashes there... Or just one? I
don't know :) What if Erlang could allow this:

Re = /(?<!\\)#/

?

Benefits:
- Less error-prone
- Expressions written this way can be parsed and compiled by the
compiler (boost in performance, syntax checked at compile-time)


Any thoughts?
unknown
2009-06-01 13:39:49 UTC
Permalink
2009/6/1 Dmitrii Dimandt <dmitriid>
Post by unknown
I've just come across re and I like it :)
The only issue I have with it is that I have to specify regexps as strings.
{ok, Re} = re:compile("(?<!\\\\)#")
It actually tries to find two backslashes there... Or just one? I don't
Re = /(?<!\\)#/
?
- Less error-prone
- Expressions written this way can be parsed and compiled by the compiler
(boost in performance, syntax checked at compile-time)
Without getting into discussions if we *want* to do this I can say that I
doubt this will boost performance as the PCRE package which the base of re
takes its input regular expression as a (C) string. From what I can
understand its various parts aren't modular. This is a pity.

Robert
unknown
2009-06-01 14:17:01 UTC
Permalink
Post by unknown
I've just come across re and I like it :)
The only issue I have with it is that I have to specify regexps as
{ok, Re} = re:compile("(?<!\\\\)#")
It actually tries to find two backslashes there... Or just one? I
Re = /(?<!\\)#/
?
- Less error-prone
- Expressions written this way can be parsed and compiled by the
compiler (boost in performance, syntax checked at compile-time)
It's not going to boost performance, as this is just
a preprocessor issue. But having to escape the backslashes
when working with regexps is a pain.

Perhaps a better syntax would be to imitate the
LaTex \verb command. It allows you to specify the
delimiter, and then consumes all chars until it finds
that delimiter, e.g. \verb!gdl4$%\^\$?$!

Since this exact syntax doesn't work in Erlang, a
slight adjustment is in order. The scanner recognizes
backticks today, but the parser doesn't. So, if we
change the scanner to recognize ` as the Erlang version
of \verb, we can write:


1> re:split("foo\nbar",`!\n!).
[<<"foo">>,<<"bar">>]

where

2> `!\n!.
"\\n"


Diff follows. It was a quick hack, so it needs improvement.

--- /home/uwiger/src/otp/otp_src_R13B/lib/stdlib/src/erl_scan.erl 2009-04-16 05:23:36.000000000 -0400
+++ erl_scan.erl 2009-06-01 09:09:49.000000000 -0400
@@ -559,4 +559,2 @@
tok2(Cs, St, Line, Col, Toks, "^", '^', 1);
-scan1([$`|Cs], St, Line, Col, Toks) ->
- tok2(Cs, St, Line, Col, Toks, "`", '`', 1);
scan1([$~|Cs], St, Line, Col, Toks) ->
@@ -565,2 +563,4 @@
tok2(Cs, St, Line, Col, Toks, "&", '&', 1);
+scan1([$`|Cs], St, Line, Col, Toks) ->
+ scan_verb(Cs, St, Line, Col, Toks, []);
%% End of optimization.
@@ -580,2 +580,27 @@

+scan_verb([], _St, Line, Col, Toks, Acc) ->
+ {more, {[],Col,Toks,Line,Acc,fun scan_verb/6}};
+scan_verb([Delim|Cs0], St, Line, Col, Toks, Acc) when Delim =/= $\n,
+ Delim =/= $\\ ->
+ {Str, Cs, Line1, Col1} = scan_verb_chars(
+ Cs0, St, Line, Col, Toks, {Acc,Delim}),
+ tok3(Cs, St, Line1, Col1, Toks, string, Str, Str, 0).
+
+scan_verb_chars([], _St, Line, Col, Toks, {Acc, Delim}) ->
+ {more, {[], Col, Toks, Line, {Acc,Delim}, fun scan_verb_chars/6}};
+scan_verb_chars([Delim|Cs], _St, Line, Col, Toks, {Acc, Delim}) ->
+ {lists:reverse(Acc), Cs, Line, Col};
+scan_verb_chars([C|Cs], St, Line, Col, Toks, {Acc, Delim}) when C =/= Delim->
+ {Line1,Col1} = case C of
+ $\n ->
+ {Line+1, Col};
+ _ ->
+ {Line, inc_col(Col,1)}
+ end,
+ scan_verb_chars(Cs, St, Line1, Col1, Toks, {[C|Acc], Delim}).
+
+inc_col(no_col,_) -> no_col;
+inc_col(C, N) when is_integer(C) -> C+N.
+
+
scan_atom(Cs0, St, Line, Col, Toks, Ncs0) ->
--
Ulf Wiger
CTO, Erlang Training & Consulting Ltd.
http://www.erlang-consulting.com
unknown
2009-06-01 23:51:53 UTC
Permalink
We have discussed the issue of including other-language syntax in
Erlang several times. If I recall correctly, there are several
proposals for how one might deal with it, including one of mine
which would allow any number of notations without "nesting".

In the mean time, may I respectfully point out to something that
seems pretty much kindergarten level to me, but doesn't seem to
be taught effectively:

PROGRAMS ARE DATA.

Just because it's source code, that doesn't mean it had to be
typed in by a human being. For example, suppose we wanted
something like shell here-files. I'll just do the literal case
(no substitution) to make my point.

<input> ::= (<line> | <here-file>)*
<here-file> ::= .{{<newline>
<input>
.}}<newline>
<line> ::= <any text not starting with .{{ or .}}><newline>

A trivial AWK script can recognise this and turn a <here-file> into
a string literal, generating whatever quoting is necessary. All you
have to do is write less than a page of AWK (once), and then tell
your build tools how to turn .erl-hf files into .erl files.

For a non-entirely-unrelated idea, see the UNIX 'xstr(1)' program.
unknown
2009-06-02 07:08:34 UTC
Permalink
Hi,
Post by unknown
In the mean time, may I respectfully point out to something that
seems pretty much kindergarten level to me, but doesn't seem to
? ?PROGRAMS ARE DATA.
........
A trivial AWK script can recognise this and turn a <here-file> into
a string literal, generating whatever quoting is necessary. ?All you
have to do is write less than a page of AWK (once), and then tell
your build tools how to turn .erl-hf files into .erl files.
As a programmer I like this way of handling this kind of issues
because it works now and it's easy.

As developer of a source handling tool I can't help but cringe at the
prospect of getting requests to support all kinds of homegrown
syntaxes...

Another problem with external processing of the source files is that
it is at the same level as the preprocessor, which many people would
like to see replaced with one that understands Erlang code.

best regards,
Vlad
unknown
2009-06-02 21:54:00 UTC
Permalink
Post by unknown
As a programmer I like this way of handling this kind of issues
because it works now and it's easy.
As developer of a source handling tool I can't help but cringe at the
prospect of getting requests to support all kinds of homegrown
syntaxes...
You mean like regular expression syntaxes?
I've lost count of the number of different variations of
regular expression syntax I've seen in UNIX.

The point of the wee tool I mentioned of course was to provide
*non*-syntax.
Post by unknown
Another problem with external processing of the source files is that
it is at the same level as the preprocessor,
Well, no, it understands far less of Erlang syntax than the
Erlang preprocessor does, and operates way before it. But
*any* program that computes source code by *any* means can
be called a "preprocessor". I have a Smalltalk-to-C compiler.
You could call that a preprocessor if you like. I don't think
the word itself helps our understanding very much.
Post by unknown
which many people would
like to see replaced with one that understands Erlang code.
Up to the word "replaced" I was happy. We have parse transforms.
We have Lisp-Flavoured Erlang. If you want preprocessing that
can "intelligently" deal with Erlang source code, LFE is _it_.

But that leaves complicated backslashy regular expressions where
they stood.


There is of course a much better way to deal with regular
expressions in a language like Lisp or Erlang. One of my pet
slogans is "STRINGS ARE WRONG". The way to represent something
like "^[[:alpha:]_][[:alnum:]_]*:[[:space:]]" is
rex:seq([rex:bol(),rex:id(),rex:space()])
where regular expression syntax is replaced by Erlang syntax.
This is so much more powerful than fancy quoting schemes for
strings that it just isn't funny: you can compute any subexpression
at any time you find useful _without_ new syntax, and without any
run-time parsing.
Post by unknown
best regards,
Vlad
unknown
2009-06-03 09:35:32 UTC
Permalink
Hi,
Post by unknown
Post by unknown
As a programmer I like this way of handling this kind of issues
because it works now and it's easy.
As developer of a source handling tool I can't help but cringe at the
prospect of getting requests to support all kinds of homegrown
syntaxes...
You mean like regular expression syntaxes?
I've lost count of the number of different variations of
regular expression syntax I've seen in UNIX.
The point of the wee tool I mentioned of course was to provide
*non*-syntax.
No, what i mean is syntaxes for allowing people to mark some strings
as regular expressions so that a tool can process them and add
backslashes or whatever. A source file containing such a marker would
no longer be an Erlang source file, and it can't be handled by Erlang
tools anymore.
Post by unknown
Post by unknown
Another problem with external processing of the source files is that
it is at the same level as the preprocessor,
Well, no, it understands far less of Erlang syntax than the
Erlang preprocessor does, and operates way before it.
Even worse, then. I was being nice.
Post by unknown
But *any* program that computes source code by *any* means can
be called a "preprocessor". ?I have a Smalltalk-to-C compiler.
You could call that a preprocessor if you like. ?I don't think
the word itself helps our understanding very much.
It can be called that, but nobody did so and I'm not sure what that
has to do with the current issue.
Post by unknown
We have Lisp-Flavoured Erlang. ?If you want preprocessing that
can "intelligently" deal with Erlang source code, LFE is _it_.
LFE can intelligently preprocess LFE source code which is quite
different than Erlang source code. How does it help me handle a
vanilla Erlang module in erlide or emacs?
Post by unknown
There is of course a much better way to deal with regular
expressions in a language like Lisp or Erlang. ?One of my pet
slogans is "STRINGS ARE WRONG".
I suppose that you mean something like "embedded strings in a language
are wrong when representing anything else than plain text". And I
couldn't agree more, they are evil - strings that represent for
example a regexp should be a different data type than a text message
string.
Post by unknown
The way to represent something
like "^[[:alpha:]_][[:alnum:]_]*:[[:space:]]" is
rex:seq([rex:bol(),rex:id(),rex:space()])
where regular expression syntax is replaced by Erlang syntax.
This is so much more powerful than fancy quoting schemes for
strings that it just isn't funny: you can compute any subexpression
at any time you find useful _without_ new syntax, and without any
run-time parsing.
[I am sure you already know all of the following, Richard, but from
your answer above you might have forgot it in the spur of the moment]

The same could be said about writing Erlang or C or Java parse trees
directly instead of letting the parser build them for us from a
string. Yet we don't do that because the textual representation has
some advantages: it's easier to read, it is higher level, it's easier
to modify and we're not bound to a specific internal representation.

The whole point with a parser is that the resulting AST is equivalent
to the input string. If the textual representation has restrictions on
what it can express, then it is so because the designer deemed it best
so (or it's a bug, but we can ignore that here). Bypassing that and
going directly to the parse tree might open a whole new can of worms.
For embedded languages that are more complicated than regexps or xml,
it might also be practically impossible to get it right manually.

Regexps are (as you say) a structured datatype. Nobody disagrees. But
we have a widespread, standard and compact way to represent them. Why
wouldn't we want to use that instead of Erlang terms? Given a compiler
that understands this, the following examples will generate exactly
the same code:
identifier() -> {seq,{cset,letters()},{star,{cset,continuers()}}}.
identifier() -> "{letters}{continuers}*".
I know which one I find easier to read and understand.

Regarding your security concerns about cross-scripting, I don't think
they are 100% relevant in this discussion. Those problems appear when
one takes a string from the external world and "pastes" it mindlessly
inside a program that is then executed. We are talking here about
being able to let a string (the erlang source file) be tokenized and
parsed by several scanners and parsers. There is no part in this
string that is injected from the outside so that the programmer's
intentions can be abused.


All in all, regular expressions are just a particular case of embedded
language. If there is to be any change to the Erlang syntax, I
wouldn't want it tailored to a specific language. For example, I want
to be able to embed Erlang code inside Erlang, which would allow
macros like LFE has and other goodies.

best regards,
Vlad
unknown
2009-06-03 10:48:07 UTC
Permalink
Greetings,

I think that the reason not to use the widespread, standard and compact
way to represent regular expression, is that it is really difficult to
keep track of the escapes (\) and to compose them (sub expressions).


bengt
Post by unknown
Hi,
Post by unknown
Post by unknown
As a programmer I like this way of handling this kind of issues
because it works now and it's easy.
As developer of a source handling tool I can't help but cringe at the
prospect of getting requests to support all kinds of homegrown
syntaxes...
You mean like regular expression syntaxes?
I've lost count of the number of different variations of
regular expression syntax I've seen in UNIX.
The point of the wee tool I mentioned of course was to provide
*non*-syntax.
No, what i mean is syntaxes for allowing people to mark some strings
as regular expressions so that a tool can process them and add
backslashes or whatever. A source file containing such a marker would
no longer be an Erlang source file, and it can't be handled by Erlang
tools anymore.
Post by unknown
Post by unknown
Another problem with external processing of the source files is that
it is at the same level as the preprocessor,
Well, no, it understands far less of Erlang syntax than the
Erlang preprocessor does, and operates way before it.
Even worse, then. I was being nice.
Post by unknown
But *any* program that computes source code by *any* means can
be called a "preprocessor". I have a Smalltalk-to-C compiler.
You could call that a preprocessor if you like. I don't think
the word itself helps our understanding very much.
It can be called that, but nobody did so and I'm not sure what that
has to do with the current issue.
Post by unknown
We have Lisp-Flavoured Erlang. If you want preprocessing that
can "intelligently" deal with Erlang source code, LFE is _it_.
LFE can intelligently preprocess LFE source code which is quite
different than Erlang source code. How does it help me handle a
vanilla Erlang module in erlide or emacs?
Post by unknown
There is of course a much better way to deal with regular
expressions in a language like Lisp or Erlang. One of my pet
slogans is "STRINGS ARE WRONG".
I suppose that you mean something like "embedded strings in a language
are wrong when representing anything else than plain text". And I
couldn't agree more, they are evil - strings that represent for
example a regexp should be a different data type than a text message
string.
Post by unknown
The way to represent something
like "^[[:alpha:]_][[:alnum:]_]*:[[:space:]]" is
rex:seq([rex:bol(),rex:id(),rex:space()])
where regular expression syntax is replaced by Erlang syntax.
This is so much more powerful than fancy quoting schemes for
strings that it just isn't funny: you can compute any subexpression
at any time you find useful _without_ new syntax, and without any
run-time parsing.
[I am sure you already know all of the following, Richard, but from
your answer above you might have forgot it in the spur of the moment]
The same could be said about writing Erlang or C or Java parse trees
directly instead of letting the parser build them for us from a
string. Yet we don't do that because the textual representation has
some advantages: it's easier to read, it is higher level, it's easier
to modify and we're not bound to a specific internal representation.
The whole point with a parser is that the resulting AST is equivalent
to the input string. If the textual representation has restrictions on
what it can express, then it is so because the designer deemed it best
so (or it's a bug, but we can ignore that here). Bypassing that and
going directly to the parse tree might open a whole new can of worms.
For embedded languages that are more complicated than regexps or xml,
it might also be practically impossible to get it right manually.
Regexps are (as you say) a structured datatype. Nobody disagrees. But
we have a widespread, standard and compact way to represent them. Why
wouldn't we want to use that instead of Erlang terms? Given a compiler
that understands this, the following examples will generate exactly
identifier() -> {seq,{cset,letters()},{star,{cset,continuers()}}}.
identifier() -> "{letters}{continuers}*".
I know which one I find easier to read and understand.
Regarding your security concerns about cross-scripting, I don't think
they are 100% relevant in this discussion. Those problems appear when
one takes a string from the external world and "pastes" it mindlessly
inside a program that is then executed. We are talking here about
being able to let a string (the erlang source file) be tokenized and
parsed by several scanners and parsers. There is no part in this
string that is injected from the outside so that the programmer's
intentions can be abused.
All in all, regular expressions are just a particular case of embedded
language. If there is to be any change to the Erlang syntax, I
wouldn't want it tailored to a specific language. For example, I want
to be able to embed Erlang code inside Erlang, which would allow
macros like LFE has and other goodies.
best regards,
Vlad
________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org
unknown
2009-06-04 02:09:47 UTC
Permalink
Post by unknown
Post by unknown
There is of course a much better way to deal with regular
expressions in a language like Lisp or Erlang. One of my pet
slogans is "STRINGS ARE WRONG".
I suppose that you mean something like "embedded strings in a language
are wrong when representing anything else than plain text". And I
couldn't agree more, they are evil - strings that represent for
example a regexp should be a different data type than a text message
string.
If we agree about that, everything else is less important.
Post by unknown
Post by unknown
The way to represent something
like "^[[:alpha:]_][[:alnum:]_]*:[[:space:]]" is
rex:seq([rex:bol(),rex:id(),rex:space()])
where regular expression syntax is replaced by Erlang syntax.
This is so much more powerful than fancy quoting schemes for
strings that it just isn't funny: you can compute any subexpression
at any time you find useful _without_ new syntax, and without any
run-time parsing.
[I am sure you already know all of the following, Richard, but from
your answer above you might have forgot it in the spur of the moment]
The same could be said about writing Erlang or C or Java parse trees
directly instead of letting the parser build them for us from a
string.
If you want to build them dynamically, or in another language,
yes. Absolutely.
Post by unknown
Yet we don't do that because the textual representation has
some advantages: it's easier to read, it is higher level, it's easier
to modify and we're not bound to a specific internal representation.
It may be easier to READ, but it is far harder to WRITE correctly.
As for modifying, no, it is NOT easy to read. And strings *are*
a specific internal representation.
Post by unknown
Regexps are (as you say) a structured datatype. Nobody disagrees. But
we have a widespread, standard and compact way to represent them.
Wrong. We have *many* ways to represent them. We have shell
syntax, understood by fnmatch() and glob(). We have two POSIX
syntaxes. We have AWK syntax, which though POSIX, isn't quite
identical to either of the others. Oh, and lex/flex/jflex et all,
which are somewhat different again. We have HyTime syntax. We
have Perl. We have PCRE where the "C" is pretty good but not
perfect. We have Java regexp syntax, which is subtly different
again. It simply is not even close to true that we have *A*
standard way to do it.

And this is another reason why trees are better.
Because we can express a regular expression in a way that is
independent of the target linear notation. (Not independent
of the capabilities of the target _engine_ -- few 'regular
expression' engines support recursion, as misbegotten Perl does --
but independent of the fine details of the _notation_.)

To take just one example, given the pattern a\10b, what character
does the \10 represent? Is it backspace, or newline? If we
generate linear notation only when needed to communicate with
some other system, it is no longer *our* problem.
Post by unknown
Why
wouldn't we want to use that instead of Erlang terms?
Because there simply is no one "that" for us to use.
Post by unknown
Given a compiler
that understands this, the following examples will generate exactly
identifier() -> {seq,{cset,letters()},{star,{cset,continuers()}}}.
identifier() -> "{letters}{continuers}*".
I know which one I find easier to read and understand.
Me too: the first one. Because the second one is a literal string.
It contains the _text_ l,e,t,t,e,r,s, but not in any reasonable sense
the _identifier_ letters. I can create the first one AT RUN TIME.
When does "{letters}{continuers}*" when and when does
"{le"++"tter"++"s}{c"++"ontinuer"++"s}*" not work? The second
approach creates such monstrous problems. The first one eliminates
them.

It is also simpler to write and test a compiler that deals correctly
with the first than one that deals with the second.
Post by unknown
Regarding your security concerns about cross-scripting, I don't think
they are 100% relevant in this discussion. Those problems appear when
one takes a string from the external world and "pastes" it mindlessly
inside a program that is then executed.
Yup. Exactly what we are talking about here.

Remember, I'm _also_ talking about receiving a string at run time
and including it in a regular expression which is then included
in something else. I don't understand why anyone is satisfied
with compile-time-only semi-solutions.
Post by unknown
We are talking here about
being able to let a string (the erlang source file) be tokenized and
parsed by several scanners and parsers. There is no part in this
string that is injected from the outside so that the programmer's
intentions can be abused.
Oh? And who said that all Erlang source files were constructed
by hand?
Post by unknown
All in all, regular expressions are just a particular case of embedded
language.
Yes. And as my string/JavaScript/XML/string example points out,
a particularly simple case.
Post by unknown
If there is to be any change to the Erlang syntax, I
wouldn't want it tailored to a specific language.
And as the same thing points out, a technique that deals with
just ONE level of language embedding doesn't solve the problem
generally enough.

We need one conceptually simple approach that can be used to nest
and dynamically create instances of any number of languages.
Trees are much much better at that job than strings.
Post by unknown
For example, I want
to be able to embed Erlang code inside Erlang, which would allow
macros like LFE has and other goodies.
I am familiar with 'cc and xoc and I've seen something similar for
Java, not to mention Template Haskell. But when I generate C
code from inside C, I use trees and love them.
unknown
2009-06-04 06:57:44 UTC
Permalink
Hi,
Post by unknown
Post by unknown
I suppose that you mean something like "embedded strings in a language
are wrong when representing anything else than plain text". And I
couldn't agree more, they are evil - strings that represent for
example a regexp should be a different data type than a text message
string.
If we agree about that, everything else is less important.
Very good, then we'll just have to sort out the devil that's in the
details :-) I think most of the controversy in this thread is caused
by the fact that each and everyone of us have our own baggage of
presuppositions, making us not really talking about the same things.
Post by unknown
Post by unknown
Yet we don't do that because the textual representation has
some advantages: it's easier to read, it is higher level, it's easier
to modify and we're not bound to a specific internal representation.
It may be easier to READ, but it is far harder to WRITE correctly.
As for modifying, no, it is NOT easy to read. ?And strings *are*
a specific internal representation.
I see strings as an external representation, I don't know of any
regexp engine that doesn't compiel them into something else.
Post by unknown
Post by unknown
Regexps are (as you say) a structured datatype. Nobody disagrees. But
we have a widespread, standard and compact way to represent them.
Wrong. ?We have *many* ways to represent them. ?We have shell
syntax, understood by fnmatch() and glob(). ?We have two POSIX
<snip>
The compact way I was referring to was as a string. The syntax of the
string's content is another issue.
Post by unknown
And this is another reason why trees are better.
Because we can express a regular expression in a way that is
independent of the target linear notation.
Only if we use the same tree representation. If each of us were to
write implementations of this library, we would get incompatible ones
(different names, maybe even different basic elements). If we use the
same library, then we could just as well agree on using POSIX string
syntax.

For me, the linear notation is not a "target" notation, it is a
"source" notation.
Post by unknown
Post by unknown
Given a compiler
that understands this, the following examples will generate exactly
? identifier() -> {seq,{cset,letters()},{star,{cset,continuers()}}}.
? identifier() -> "{letters}{continuers}*".
I know which one I find easier to read and understand.
Me too: ?the first one. ?Because the second one is a literal string.
It contains the _text_ l,e,t,t,e,r,s, but not in any reasonable sense
the _identifier_ letters. ?I can create the first one AT RUN TIME.
Please note that I said "Given a compiler that understands this",
meaning that the compiler would recognize {letters} as an identifier
(the syntax as a regular string may be confusing, the compiler should
know it's a regexp and not a normal string).
Post by unknown
Remember, I'm _also_ talking about receiving a string at run time
and including it in a regular expression which is then included
in something else. ?I don't understand why anyone is satisfied
with compile-time-only semi-solutions.
You lost me here, probably you went too fast.How does a tree
representation help you handle runtime strings? If you're receiving a
string at runtime, how do you suggest to include it in a tree data
structure? I suppose the string could have structure too (otherwise
it's a trivial issue), wouldn't you still have to parse it? And if you
must have such a parser anyway, why not use it in the source code too?
Post by unknown
And as the same thing points out, a technique that deals with
just ONE level of language embedding doesn't solve the problem
generally enough.
I agree. Regexps are just a special case of a more general problem,
but they are much more widely used than most other embedded languages.
But then we are digressing from the original topic which was about
regexps (I'm aware that is partly of my doing, sorry for that).

I wil answer that in a separate message, as it feels it becomes
slightly off-topic.

regards,
Vlad
unknown
2009-06-02 00:44:33 UTC
Permalink
Python provides a method of specifying strings they call "raw strings,"
which I find quite interesting. Basically, you prefix your string with r
or R, and any backslashes are treated as literal characters rather than
Post by unknown
Post by unknown
'\b'
'\x08'
Post by unknown
Post by unknown
r'\b'
'\\b'

More info in the docs:
http://docs.python.org/3.0/reference/lexical_analysis.html#string-and-bytes-literals

I'm not sure how well it would work in Erlang, but it's certainly useful
in Python for avoiding the headache-inducing backslash acrobatics
necessary when writing the occasional complex regular expression.

Geoff
Post by unknown
Post by unknown
I've just come across re and I like it :)
The only issue I have with it is that I have to specify regexps as
{ok, Re} = re:compile("(?<!\\\\)#")
It actually tries to find two backslashes there... Or just one? I
Re = /(?<!\\)#/
?
- Less error-prone
- Expressions written this way can be parsed and compiled by the
compiler (boost in performance, syntax checked at compile-time)
It's not going to boost performance, as this is just
a preprocessor issue. But having to escape the backslashes
when working with regexps is a pain.
Perhaps a better syntax would be to imitate the
LaTex \verb command. It allows you to specify the
delimiter, and then consumes all chars until it finds
that delimiter, e.g. \verb!gdl4$%\^\$?$!
Since this exact syntax doesn't work in Erlang, a
slight adjustment is in order. The scanner recognizes
backticks today, but the parser doesn't. So, if we
change the scanner to recognize ` as the Erlang version
1> re:split("foo\nbar",`!\n!).
[<<"foo">>,<<"bar">>]
where
2> `!\n!.
"\\n"
Diff follows. It was a quick hack, so it needs improvement.
--- /home/uwiger/src/otp/otp_src_R13B/lib/stdlib/src/erl_scan.erl 2009-04-16 05:23:36.000000000 -0400
+++ erl_scan.erl 2009-06-01 09:09:49.000000000 -0400
@@ -559,4 +559,2 @@
tok2(Cs, St, Line, Col, Toks, "^", '^', 1);
-scan1([$`|Cs], St, Line, Col, Toks) ->
- tok2(Cs, St, Line, Col, Toks, "`", '`', 1);
scan1([$~|Cs], St, Line, Col, Toks) ->
@@ -565,2 +563,4 @@
tok2(Cs, St, Line, Col, Toks, "&", '&', 1);
+scan1([$`|Cs], St, Line, Col, Toks) ->
+ scan_verb(Cs, St, Line, Col, Toks, []);
%% End of optimization.
@@ -580,2 +580,27 @@
+scan_verb([], _St, Line, Col, Toks, Acc) ->
+ {more, {[],Col,Toks,Line,Acc,fun scan_verb/6}};
+scan_verb([Delim|Cs0], St, Line, Col, Toks, Acc) when Delim =/= $\n,
+ Delim =/= $\\ ->
+ {Str, Cs, Line1, Col1} = scan_verb_chars(
+ Cs0, St, Line, Col, Toks, {Acc,Delim}),
+ tok3(Cs, St, Line1, Col1, Toks, string, Str, Str, 0).
+
+scan_verb_chars([], _St, Line, Col, Toks, {Acc, Delim}) ->
+ {more, {[], Col, Toks, Line, {Acc,Delim}, fun scan_verb_chars/6}};
+scan_verb_chars([Delim|Cs], _St, Line, Col, Toks, {Acc, Delim}) ->
+ {lists:reverse(Acc), Cs, Line, Col};
+scan_verb_chars([C|Cs], St, Line, Col, Toks, {Acc, Delim}) when C =/= Delim->
+ {Line1,Col1} = case C of
+ $\n ->
+ {Line+1, Col};
+ _ ->
+ {Line, inc_col(Col,1)}
+ end,
+ scan_verb_chars(Cs, St, Line1, Col1, Toks, {[C|Acc], Delim}).
+
+inc_col(no_col,_) -> no_col;
+inc_col(C, N) when is_integer(C) -> C+N.
+
+
scan_atom(Cs0, St, Line, Col, Toks, Ncs0) ->
unknown
2009-06-02 07:59:53 UTC
Permalink
Post by unknown
Python provides a method of specifying strings they call "raw
strings," which I find quite interesting. Basically, you prefix your
string with r or R, and any backslashes are treated as literal
'\b'
'\x08'
r'\b'
'\\b'
http://docs.python.org/3.0/reference/lexical_analysis.html#string-and-bytes-literals
I'm not sure how well it would work in Erlang, but it's certainly
useful in Python for avoiding the headache-inducing backslash
acrobatics necessary when writing the occasional complex regular
expression.
+1

mats
unknown
2009-06-02 08:14:54 UTC
Permalink
Greetings,

If the only problem solved with "raw strings" is regular expressions I
would not recommend it. Instead I would suggest moving away from strings
for regular expressions. For an example see SCSH
(http://www.scsh.net/mail-archive/scsh-users/2003-01/msg00048.html).


bengt
Post by unknown
Python provides a method of specifying strings they call "raw
strings," which I find quite interesting. Basically, you prefix your
string with r or R, and any backslashes are treated as literal
'\b'
'\x08'
r'\b'
'\\b'
http://docs.python.org/3.0/reference/lexical_analysis.html#string-and-bytes-literals
I'm not sure how well it would work in Erlang, but it's certainly
useful in Python for avoiding the headache-inducing backslash
acrobatics necessary when writing the occasional complex regular
expression.
+1
mats
________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org
unknown
2009-06-02 19:45:18 UTC
Permalink
2009/6/2 Bengt Kleberg <bengt.kleberg>
Post by unknown
Greetings,
If the only problem solved with "raw strings" is regular expressions I
would not recommend it. Instead I would suggest moving away from strings
for regular expressions. For an example see SCSH
(http://www.scsh.net/mail-archive/scsh-users/2003-01/msg00048.html).
This is way cool and definitely the way to go if you want readable regexps.
Unfortunately it wouldn't look quite as good in erlang as it does in lisp
without some form of syntactic support. But it is definitely the way to go.

Robert
unknown
2009-06-03 02:40:58 UTC
Permalink
On 3 Jun 2009, at 7:45 am, Robert Virding wrote:
[about regular expressions in SCSH]
Post by unknown
This is way cool and definitely the way to go if you want readable regexps.
Unfortunately it wouldn't look quite as good in erlang as it does in lisp
without some form of syntactic support. But it is definitely the way to go.
Well, we do have Lisp-Flavoured Erlang...
This is _exactly_ what I had in mind when saying we should use
trees. Don't we already have something about this in the archive?
unknown
2009-06-02 22:03:28 UTC
Permalink
Post by unknown
Python provides a method of specifying strings they call "raw
strings," which I find quite interesting. Basically, you prefix your
string with r or R, and any backslashes are treated as literal
'\b'
'\x08'
r'\b'
'\\b'
http://docs.python.org/3.0/reference/lexical_analysis.html#string-and-bytes-literals
I'm not sure how well it would work in Erlang, but it's certainly
useful in Python for avoiding the headache-inducing backslash
acrobatics necessary when writing the occasional complex regular
expression.
+1
-1

This is simply the wrong way to deal with complex regular expressions.
Introducing elaborate mechanisms to hide from the compiler what's
going on, in order to parse things at run time that could have been
done earlier?

What's needed is a DOMAIN-SPECIFIC EMBEDDED LANGUAGE for regular
expressions, and all we need for that is lists, strings, constants,
and function calls. By golly, we've GOT them!
unknown
2009-06-02 22:48:30 UTC
Permalink
Post by unknown
-1
This is simply the wrong way to deal with complex regular expressions.
Introducing elaborate mechanisms to hide from the compiler what's
going on, in order to parse things at run time that could have been
done earlier?
What's needed is a DOMAIN-SPECIFIC EMBEDDED LANGUAGE for regular
expressions, and all we need for that is lists, strings, constants,
and function calls. By golly, we've GOT them!
While you've amply made your point that there are very
good alternatives to regexps (which is something at least
erlang old-timers have no problem accepting, since we've
never had a really performant regexp library until quite
recently),

But... elaborate mechanisms to hide from the compiler?

Regexps are not the only strings where escaping can be
an issue. I think most of us have on occasion come across
a problem where the string syntax in erlang creates
unwanted noise, but not at the pain level where it would
be warranted to start inventing a preprocessor step
(which I find much more elaborate than accepting an alternative
way of entering strings - something that many language
environments already provide.)

BR,
Ulf W
--
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com
unknown
2009-06-02 23:25:47 UTC
Permalink
2009/6/3 Ulf Wiger <ulf.wiger>
Post by unknown
Post by unknown
-1
This is simply the wrong way to deal with complex regular expressions.
Introducing elaborate mechanisms to hide from the compiler what's
going on, in order to parse things at run time that could have been
done earlier?
What's needed is a DOMAIN-SPECIFIC EMBEDDED LANGUAGE for regular
expressions, and all we need for that is lists, strings, constants,
and function calls. By golly, we've GOT them!
While you've amply made your point that there are very
good alternatives to regexps (which is something at least
erlang old-timers have no problem accepting, since we've
never had a really performant regexp library until quite
recently),
I think what Richard is getting at here is not regular expressions as such
but representing them as strings. And then wishing to extend the string
syntax so as to reduce the interaction between string \ and regexp \, which
makes the regexp even more difficult to read.

The alternative being to use normal erlang syntax to represent regexps
instead of strings. I personally like the idea. Look at how scsh does it.

Robert
unknown
2009-06-03 02:38:38 UTC
Permalink
Post by unknown
But... elaborate mechanisms to hide from the compiler?
Have you looked at some of the things that have been developed
outside Erlang to handle unholy mixtures of XML, regular expressions,
PHP/Ruby/JavaScript/whatever?

Cross-Site Scripting is a way of attacking systems that have
this kind of language mishmash. Here's a sample URL from a
web page about XSS:

http://www.example.com/search.pl?text=<script>alert(document.cookie)</
script>

Chances are that the <script>...</script> bit came from
some source that said "relax, it's just a string, relax,
it's just a string"...

"Hiding from the compiler" is not an intemperate phrase for
what's going on.

(1) There is an intrinsically STRUCTURED data type.
It might be XML or regular expressions or JavasScript or ...

(2) That data is linearised into a string.

(3) That string is interpolated into some other structured data.

(4) Which is itself linearised into a string.

And now you have multiple levels of quoting to worry about
and if you are not very careful, XSS vulnerabilities as well.

The answer is NOT to turn things into strings.
If you have something structured, LEAVE it structured.
Don't parse it at run time.
Post by unknown
Regexps are not the only strings where escaping can be
an issue.
Exactly so. But let me rephrase that:
regexps are not the only DATA TYPE where people run into
serious trouble because they insist on treating them as
strings when they really aren't.

We should represent regexps as regexp _trees_.
We should represent XML as XML _trees_.
We should represent CSS data as CSS _trees_.
Linearising is the penultimate step of processing, just before
something is written to a file or socket &c.

Strings are for *storing* and *transmitting* information.
They are really lousy tool for *processing* information.

If you are fortunate enough to have a programming language
like Lisp or Erlang (or even, to some extent, Eiffel) where
you _can_ easily write tree structures _as_ tree structures,
it is really very foolish to try to write them as strings.

Of course, if you are fetching information out of a data base,
then strings are probably what you're going to get, though I
note that XML is part of the current SQL standard and that
even the free version of DB2 from IBM is supposed to speak
XML "natively". And of course Erlang has Mnesia, meaning that
storing logically tree-structured data _as_ trees is the more
attractive option. Heck, even JavaScript has JSON.
Post by unknown
I think most of us have on occasion come across
a problem where the string syntax in erlang creates
unwanted noise, but not at the pain level where it would
be warranted to start inventing a preprocessor step
(which I find much more elaborate than accepting an alternative
way of entering strings - something that many language
environments already provide.)
De gustibus no disputandem est.
I find writing and using a preprocessor far easier than
hacking on the lexical analyser for the language. It took
me just 40 SLOC of C, and now I have the _same_ "bigquote"
processor for all of C, C++, AWK, Erlang, Haskell, SML, and Prolog.
Without hacking on any compilers whatever.

The ultimate point though is that hacking on the language to
make it easier for people to do the WRONG thing does not strike
me as a good use of anyone's time. That pain level is there
for a good reason: if the Erlang string syntax is giving you
that much of a headache, it's because STRINGS ARE WRONG and you
should almost certainly be using trees instead.

I really shouldn't have mentioned the .{{ ... .}} hack, because
the real point there was DON'T CHANGE THE LANGUAGE because in
this case it's not the language that's wrong.
unknown
2009-06-03 06:59:24 UTC
Permalink
Post by unknown
The ultimate point though is that hacking on the language to
make it easier for people to do the WRONG thing does not strike
me as a good use of anyone's time.
A truism. But I don't think the correct definition of "WRONG" is
"whatever Richard O'Keefe dislikes."
Post by unknown
That pain level is there for a good reason: if the Erlang string
syntax is giving you that much of a headache, it's because STRINGS ARE
WRONG and you should almost certainly be using trees instead.
Alas, re wants strings, and there's not much I can do about that.
Post by unknown
I really shouldn't have mentioned the .{{ ... .}} hack, because
the real point there was DON'T CHANGE THE LANGUAGE because in
this case it's not the language that's wrong.
I disagree.

mats
unknown
2009-06-03 09:24:44 UTC
Permalink
2009/6/3 mats cronqvist <masse>
Post by unknown
Post by unknown
That pain level is there for a good reason: if the Erlang string
syntax is giving you that much of a headache, it's because STRINGS ARE
WRONG and you should almost certainly be using trees instead.
Alas, re wants strings, and there's not much I can do about that.
Not quite true. The *underlying implementation* wants the regexp as a string
but this does not mean that the user has to supply it as a string. You could
very well allow the regexp as an AST (as well as?) and then internally
convert it to a string.

Robert
unknown
2009-06-03 09:31:35 UTC
Permalink
2009/6/3 mats cronqvist <masse <mailto:masse>>
Alas, re wants strings, and there's not much I can do about that.
Not quite true. The *underlying implementation* wants the regexp as a
string but this does not mean that the user has to supply it as a
string. You could very well allow the regexp as an AST (as well as?) and
then internally convert it to a string.
Robert
This would be a nice addition at any rate.
It could well be offered as a library on top of re.

BR,
Ulf W
unknown
2009-06-03 10:01:55 UTC
Permalink
Post by unknown
2009/6/3 mats cronqvist <masse>
Post by unknown
Post by unknown
That pain level is there for a good reason: if the Erlang string
syntax is giving you that much of a headache, it's because STRINGS ARE
WRONG and you should almost certainly be using trees instead.
Alas, re wants strings, and there's not much I can do about that.
Not quite true. The *underlying implementation* wants the regexp as a string
but this does not mean that the user has to supply it as a string. You could
very well allow the regexp as an AST (as well as?) and then internally
convert it to a string.
Scheez. The re module, as it exists in R13, wants strings. Full stop.

You could of course write a different front end to pcre, or a pure
erlang regexp module, or an infinite number of other fine products for
which this isn't true, but that really has no bearing on this
discussion.

mats
unknown
2009-06-04 01:04:39 UTC
Permalink
Post by unknown
Post by unknown
The ultimate point though is that hacking on the language to
make it easier for people to do the WRONG thing does not strike
me as a good use of anyone's time.
A truism. But I don't think the correct definition of "WRONG" is
"whatever Richard O'Keefe dislikes."
Oh come *ON*. Resort to unwarranted ad hominem attacks is an
admission of failure.

It is not that I say strings are wrong because I don't like them,
but that I do not like them behave I have learned painfully and
repeatedly that they are wrong, as in difficult to use and highly
error prone.
Post by unknown
Post by unknown
That pain level is there for a good reason: if the Erlang string
syntax is giving you that much of a headache, it's because STRINGS ARE
WRONG and you should almost certainly be using trees instead.
Alas, re wants strings, and there's not much I can do about that.
Of *course* there is.

First off, who said you had to use re?

Second, Erlang/OTP is open source. You and I and all of us have
access to the source code. Building our own better_re that, _as
well as_ strings, _also_ accepts some kind of tree, is hardly
rocket science. If I weren't busy working on compilers for two
other languages, preparing lectures, and marking assignments
I'd do it myself. When I can find some breathing time, I expect
I will.

Third, who says ((we have trees) AND (re gets strings)) are
incompatible? It's not that slashification cannot be done, it's
that it is painful to do by hand. So who says we have to do it
by hand? Again, it's not rocket science to write a function that
takes a tree and linearises it as a string (for re to then
parse, undoing the linearisation). I've done it once in the past,
for Prolog to talk to C.

Let's take a very simple case: the replacement string.
The discussion in 're' is a little vague, and a little puzzling.
Why is Perl's \0 not supported? How do you tell whether \123
is (substring 1)23 or (substring 12)3 or (substring 123)?
Do & and \# sequences count inside binaries?

<replacement>
::= [] empty
| [<replacement> | <replacement>] concatenation
| <character code> that literal character
| <binary> that binary
| {match,all} &
| {match,N} \N

linearise_replacement(R) ->
linearise_replacement(R, []).

linearise_replacement([], E) ->
E;
linearise_replacement([H|T], E) ->
linearise_replacement(H, linearise_replacement(T, E));
linearise_replacement(C, E) when is_integer(C), C >= 0 ->
case C
of $& -> [$\\,C|E]
; $\\ -> [$\\,C|E]
; _ -> [ C|E]
end;
linearise_replacement(B, E) when is_binary(B) ->
binary_to_list(B) ++ E;
linearise_replacement({match,all}, E) ->
[$&|E];
linearise_replacement({match,N}, E)
when is_integer(N), N >= 1, N =< 9 ->
[$\\,N+$0|E].

Now let's take an example from the re: manual.
<quote>
Example:
re:replace("abcd","c","[&]",[{return,list}]).
gives

"ab[c]d"
while

re:replace("abcd","c","[\\&]",[{return,list}]).
gives

"ab[&]d"
</quote>
If we define
replace(Subject, Pattern, Replacement, Options) ->
re:replace(Subject, Pattern,
linearise_replacement(Replacement), Options).
then everything becomes clear and trouble-free:
replace("abcd", "c", "[&]", [{return,list}])
gives
"ab[&]d"
while
replace("abcd", "c", ["[",{match,all},"]"], [{return,list}])
gives
"ab[c]d"
I'll have to upgrade my Erlang release to test this, but the rest
of the afternoon will be spent talking with students, so that will
have to wait. There's already an issue about binaries and Unicode.
That's not relevant to the point, which is that providing a
nice clean _safe_ tree-based interface to something with a
string-based interface is not in fact at all hard. It is something
we can do NOW, any of us, without language changes, because it is
NOT the language that is wrong, it's using strings.

"Strings are the opiate of the masses."
unknown
2009-06-03 06:50:17 UTC
Permalink
Post by unknown
Post by unknown
Python provides a method of specifying strings they call "raw
strings," which I find quite interesting. Basically, you prefix your
string with r or R, and any backslashes are treated as literal
'\b'
'\x08'
r'\b'
'\\b'
http://docs.python.org/3.0/reference/lexical_analysis.html#string-and-bytes-literals
I'm not sure how well it would work in Erlang, but it's certainly
useful in Python for avoiding the headache-inducing backslash
acrobatics necessary when writing the occasional complex regular
expression.
+1
-1
This is simply the wrong way to deal with complex regular expressions.
this discussion is about how to represent strings with many escapes,
not about regexps per se.

mats
unknown
2009-06-03 06:59:06 UTC
Permalink
Greetings,

Are there any reasons to have strings with many escapes, apart from when
doing regular expressions?


bengt
Post by unknown
Post by unknown
Post by unknown
Python provides a method of specifying strings they call "raw
strings," which I find quite interesting. Basically, you prefix your
string with r or R, and any backslashes are treated as literal
'\b'
'\x08'
r'\b'
'\\b'
http://docs.python.org/3.0/reference/lexical_analysis.html#string-and-bytes-literals
I'm not sure how well it would work in Erlang, but it's certainly
useful in Python for avoiding the headache-inducing backslash
acrobatics necessary when writing the occasional complex regular
expression.
+1
-1
This is simply the wrong way to deal with complex regular expressions.
this discussion is about how to represent strings with many escapes,
not about regexps per se.
mats
________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org
unknown
2009-06-03 08:56:53 UTC
Permalink
Post by unknown
Greetings,
Are there any reasons to have strings with many escapes, apart from when
doing regular expressions?
Yes, any time you have a number of backslashes or quotation
marks in the original test, you will need to insert escapes
(or, as ROK points out, write a program that takes care of
it for you.)

This commonly occurs in e.g. LaTex, HTML, XML, shell commands,
JavaScript, etc. - in just about every text format that is meant
to be processed by another program.

Every time this becomes the main task of your program, I agree
that it makes sense in general to raise the abstraction level
and avoid messing about with "structured text". A very good
example of this is of course generation of Erlang code, where
it is *much* better to generate abstract forms, and if necessary,
produce source code by pretty-printing the forms.

But there are lots of occurrences where this doesn't apply as well.
As I have stated (at least four times already in this thread), I'm
not a fan of inventing a new syntax for a specific problem, either
by hacking the scanner or adding a preprocessor - *especially*
when working in a large project, where most of the work on your
code will be done by others than yourself. And even if it wouldn't
be frowned upon, it is an investment in time and effort that may
well be worse than battling with the string syntax in the few places
where it's warranted.

Also, as Mats alluded to, the re library requires strings. One may
argue about the virtue of this, but the fact remains that for many
string parsing tasks, re is by far the most efficient tool available
to Erlang programmers.

Every new syntax addition should of course be evaluated based on the
expected benefits vs the slippery-slope problem of constantly adding
and never removing stuff. This is a valid argument. Support for
raw strings may not be important enough to warrant a syntax addition.

Telling people to work around the problem can be helpful, but often
isn't. As a general rule, I don't think that programming languages
should go out of their way to make things difficult because one
would like programmers to tackle the problem differently. In some
cases, there will be a tradeoff - e.g. immutability, where
disallowing destructive updates has some distinct drawbacks, but
offer great benefits in return. I don't really see the great
benefit in making life hard on those who want to use regexps...

Most attacks on the problem* will suffer some drawbacks. This is also
true for the "suck it up" approach, obviously. The r"..." approach
suffers from abusing a regular atom, but also from some fairly unclear
escaping rules (you still have to escape ", using \", which means that
you can't end the string with a \"). The `D...D approach suffers from
forcing you to choose a delimiter that doesn't appear in the string -
which may vary from time to time, making it a bit less intuitive
while being quite generic.

* The problem here being how to *conveniently* enter strings
without having to struggle with annotating it with escapes,
whether it be regexps or anything else.

Having said all this, I'm fairly neutral about whether Erlang
adds support for raw strings. It has not been a great pain for
me personally. I'm just a bit picky about having my arguments
misrepresented.

BR,
Ulf W
unknown
2009-06-04 00:16:42 UTC
Permalink
Post by unknown
this discussion is about how to represent strings with many escapes,
not about regexps per se.
You did read the subject?

Actually, it's about people mistakenly THINKING they need strings with
many escapes, when what they really need is to get away from strings.
Regular expressions are one good example, but I've already provided
another: Windows UNC file names. URLS, being similar to UNC names,
are an obvious third.

Imagine that you have
- string
- inside a JavaScript expression
- inside an XML attribute
- where the XML has to appear as data in an Erlang program.
Not an unusual occurrence these days.

r"xxx" or `xxx` just saves you ONE level of escaping, the very
last. It does *NOTHING* to help with the others. No amount of
hacking on the Erlang tokeniser will do anything about JavaScript
syntax or XML syntax. And these fancy Perl-envious alternative
ways of quoting strings only help with _literal_ data, they don't
help with dynamically generated data.

Before arguing about the details of the solution,
shouldn't we make sure we are solving the right problem?

The right problem is how to handle MULTIPLE levels of nested languages
as SOME kind of data in such a way as to make it easier to get right.

The regular expression model shows us an excellent answer to the
_general_ problem.
unknown
2009-06-04 05:44:01 UTC
Permalink
Post by unknown
Post by unknown
this discussion is about how to represent strings with many escapes,
not about regexps per se.
You did read the subject?
I even went back and read the original mail.
He did say in there:

"The only issue I have with it is that I have to
specify regexps as strings."

So presumably the OP is eagerly awaiting a library which
supports regexps represented as something other than
strings... ROK, you have a small project on your hands. :)
Post by unknown
Actually, it's about people mistakenly THINKING they need
strings with many escapes, when what they really need is
to get away from strings.
There's no getting away from strings in practice,
and some strings have many escapes in them.


Going back to the OP, the string "(?<!\\\\)#" can of course
already be expressed in Erlang without escaping issues:

[40,63,60,33,92,92,41,35]

Forgoing *all* syntactical convenience, we would instead write:

[40|[63|[60|[33|[92|[41|[35|[]]]]]]]]

...but of course that would be awkward.
(Of course, when generating code, I will have to write it as
'cons' tuples, but when generating code, this is fine.)

Surely we can all agree that syntactic sugar is sometimes
a Good Thing. Too much of it is not, and one man's convenience
is another man's cruft.

Perhaps we can agree that there can be differences in opinion
as to whether there is still room for more syntactic sugar
regarding strings in Erlang?

The claim that anyone who's unhappy with the current convenience
level is simply confused, and in need of the therapeutic pain
caused by rubbing up against double escaping in complex strings,
is subjective. Wouldn't you agree?

BR,
Ulf W
unknown
2009-06-04 07:38:05 UTC
Permalink
Greetings,

I am writing under the assumption that "you" below is directed to all of
us reading erlang-questions.

My opinion is that there is no therapy involved when having to
read/write double (or more) escaping in strings. The alternative (not
using strings, to avoid the problem) is therapeutic.


bengt
Post by unknown
Post by unknown
Post by unknown
this discussion is about how to represent strings with many escapes,
not about regexps per se.
You did read the subject?
I even went back and read the original mail.
"The only issue I have with it is that I have to
specify regexps as strings."
So presumably the OP is eagerly awaiting a library which
supports regexps represented as something other than
strings... ROK, you have a small project on your hands. :)
Post by unknown
Actually, it's about people mistakenly THINKING they need
strings with many escapes, when what they really need is
to get away from strings.
There's no getting away from strings in practice,
and some strings have many escapes in them.
Going back to the OP, the string "(?<!\\\\)#" can of course
[40,63,60,33,92,92,41,35]
[40|[63|[60|[33|[92|[41|[35|[]]]]]]]]
...but of course that would be awkward.
(Of course, when generating code, I will have to write it as
'cons' tuples, but when generating code, this is fine.)
Surely we can all agree that syntactic sugar is sometimes
a Good Thing. Too much of it is not, and one man's convenience
is another man's cruft.
Perhaps we can agree that there can be differences in opinion
as to whether there is still room for more syntactic sugar
regarding strings in Erlang?
The claim that anyone who's unhappy with the current convenience
level is simply confused, and in need of the therapeutic pain
caused by rubbing up against double escaping in complex strings,
is subjective. Wouldn't you agree?
BR,
Ulf W
________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org
unknown
2009-06-05 03:46:58 UTC
Permalink
Post by unknown
There's no getting away from strings in practice,
and some strings have many escapes in them.
Going back to the OP, the string "(?<!\\\\)#" can of course
[40,63,60,33,92,92,41,35]
One of the most thought-provoking ideas I heard about programming
came from Xerox PARC: "A program is not a listing." Tie that in
with the notion of "resources" from Classic MacOS, now widespread
(in Java, for example).

Accepted that a regular expression is needed.
Accepted that it is needed in a particular module.
But why does it have to be WRITTEN in that module?

One of the reasons for "resources" is so that they can be
revised (localised or otherwise maintained) without having
to modify the program, perhaps even without access to the
source code. And yes, I _have_ edited the resources of a
Classic C program that I didn't have the sources for. (I
found that using Cmd-Q to mean "query" when I expected it
to mean "Quit" was just too confusing; there were quite a
number of issues like that with it.) Many of you will have
had similar experiences.

So if you have complex regular expressions, which because of
their complexity are very likely to need maintenance, why not
get them from a resource bundle?

I'll skip over the reduction to absurdity of a position I don't
hold and haven't even hinted at. I'm not against additions to
Erlang syntax. I've proposed quite a few myself. I'm against
*THIS* one because it doesn't actually solve the problem it has
been proposed for, and because there are simpler, cleaner, more
powerful techniques already available

once you get over the idea that non-Erlang stuff
has to be string literals in source code.

I have as little to do with SQL as I can, and even so, I've
seen so much SQL-generation done wrong with strings that whenever
anyone talks about wanting something strange done to strings to
make it easier to embed some other language in string literals
that I immediately think of SQL. And of course the really rather
nasty way that the DOM reveals *comments* of all things in XML,
because of the way people used to hide scripts inside comments
so they wouldn't have to turn < into &lt;.

Please, by all means, LET'S have an addition to Erlang syntax that
makes it easy to deal with the regular expression problem
*AND* the XML problem *AND* the SQL program *AND* ... *AND* all of
them together. But let's NOT waste our time with clunky
incomplete solutions.

There are *lots* of things we can do in the mean time.
- trees
- preprocessors
- resource bundles
- the sky is your limit IF you aren't fixated on string literals.
Post by unknown
Perhaps we can agree that there can be differences in opinion
as to whether there is still room for more syntactic sugar
regarding strings in Erlang?
I am having difficulty interpreting that statement in any way that
doesn't have the answer "NO, it is a simple fact that there is
plenty of room." The question is what to DO with that room, and
in particular, whether to spend any of it on a clunky NON-solution
to a pervasive and important problem.
Post by unknown
The claim that anyone who's unhappy with the current convenience
level is simply confused, and in need of the therapeutic pain
caused by rubbing up against double escaping in complex strings,
is subjective. Wouldn't you agree?
Again, I'm having trouble with that. To start with, as far as I
know, nobody has made that claim. People who are unhappy with the
current convenience level of (doing the wrong thing) are correctly
unhappy. It _is_ inconvenient. Nor are they "simply confused".
They may be ignorant, or unimaginative, or simply seduced by other
languages that have gone down the immediately easy but long term
worthless route. They may even simply be concentrating on the
problem at hand so much that they aren't _looking_ for a general
approach. Mental "set" is a well known issue in problem-solving.
None of these things is "simple confusion". But yes, if your
head is hurting from banging against a brick wall, maybe you should
stop the head banging instead of asking for a softer wall.

It is NOT 'subjective' that simply adding something like Lua's
unquoted string literals "[" "="^n "[" <char>* "]" "="^n "]"
-- though Erlang would have to require n>=1 instead of n>=0 --
would only help with ONE level of notation nesting, so it
won't help with an SQL pattern inside an SQL string inside an
Erlang string.
unknown
2009-06-05 09:23:52 UTC
Permalink
Post by unknown
I'll skip over the reduction to absurdity of a position I
don't hold and haven't even hinted at.
On the other hand, you have put forth a number of
absurdly complex examples of expressions represented
as strings that certainly all people involved in this
particular thread would attack using a structured data
type rather than strings. ;-)

This time, the example included SQL. Yariv Sadan wrote
a very nice library called ErlyDB which made it possible
to express SQL queries as simple erlang terms.

As elegant as it was, it also had limitations. For the
longest time, it only supported MySQL. I would guess that
for many applications, it would have been possible to
instead use ODBC and formulate SQL queries that were
regular enough that they would work against just about
any DBMS.

(Sure, Yariv could e.g. have built ErlyDB on top of ODBC
and used text as an intermediate representation. This
would have made ErlyDB just as compatible as ODBC is,
but I'm sure he had lots of good reasons for not doing
this. As ErlyDB isn't (?) actively maintained anymore,
perhaps parts of it could indeed be converted into a
library on top of OTP's ODBC and made available as a
library for simpler handling of SQL queries? I'm
shooting from the hip here - there may be reasons I'm
not aware of why this would be a lousy idea.)

Regarding the examples with XML and the suggestions
that it is *much* better to generate XML from structured
erlang terms than to write them as strings, surely you are
aware that I wrote the library that is now in OTP for doing
just that. I remember that you were even at the EUC when I
first presented it. My recollection is backed up by the
fact that you even presented right before me (EUC 00).
Granted, this was long ago, but given your almost unlimited
capacity for anectodal references (which I admire and appreciate
btw), I thought it might not have escaped your memory.

When I last had reason to deal with large volumes of
imported text, which was to be translated into compiled
erlang code, I used Joe's ML9 for storing the text chunks
on disk, thereby avoiding all escaping issues, and built
abstract forms that I pretty-printed as-needed for
debugging purposes.

So, of course, when the problem calls for it, most of us
would do just what you suggest.
Post by unknown
I'm not against additions to
Erlang syntax. I've proposed quite a few myself. I'm against
*THIS* one because it doesn't actually solve the problem it has
been proposed for, and because there are simpler, cleaner, more
powerful techniques already available
once you get over the idea that non-Erlang stuff
has to be string literals in source code.
And this seems to me the source of the disconnect.
Neither Vlad, Mats or I have any hangups of the sort.
I don't believe there has been any evidence on the list
to suggest that we do. On the contrary, I would say.

Granted, Mats might be a suspect since he was one of the
early advocates of including a fast regexp library in
Erlang. OTOH, his work on gtknode is a great example of
how some clever compile-time processing can eliminate
a huge amount of complex and error-prone programming.
Post by unknown
Please, by all means, LET'S have an addition to Erlang
syntax that makes it easy to deal with the regular
expression problem *AND* the XML problem *AND* the
SQL program *AND* ... *AND* all of them together.
But let's NOT waste our time with clunky
incomplete solutions.
Sure, by all means.

But I'm not sure I have been able to glean from your
posts so far what is "clunky" about the slight additional
sugar that I proposed (essentially taking the LaTex \verb
command and replacing \verb with some suitable token - I
suggested `, which is perhaps a bit too quiet, but "clunky"?)

Incomplete - yes of course. It is not a complete solution by
any reasonable definition of 'complete'. Nor was it ever
advertised as such. It was intended as a lightweight way of
simplifying the use of e.g. the re library (but not being
limited to regexps in any way.)
Post by unknown
Post by unknown
The claim that anyone who's unhappy with the current convenience
level is simply confused, and in need of the therapeutic pain
caused by rubbing up against double escaping in complex strings,
is subjective. Wouldn't you agree?
Again, I'm having trouble with that. To start with, as far as I
know, nobody has made that claim.
I'm glad if that is the case. It may have been my misinterpretation
Post by unknown
That pain level is there
for a good reason: if the Erlang string syntax is giving you
that much of a headache, it's because STRINGS ARE WRONG and you
should almost certainly be using trees instead.
But then you go on and appear to back up that claim.
Post by unknown
People who are unhappy with the current convenience level
of (doing the wrong thing) are correctly unhappy.
Am I not to interpret this as "they deserve to be unhappy"?
Post by unknown
It _is_ inconvenient. Nor are they "simply confused".
They may be ignorant, or unimaginative, [...]
...or neither, but simply accepting the fact that stringy
representations of e.g. regexps are more or less a fact of
life. They may be working hard against a tight deadline
and unwilling to spend a significant portion of that
writing a support layer, or introducing special scripts
or macros that they will then have to convince the customer
to use. There may be all sorts of good reasons why they
would think that a tiny bit of added sugar on top of the
string syntax would be *just the thing* to solve 90% of
their headaches. For the remainder, I'm pretty sure they
would, just like you and I, resort to a more structured
way of addressing the problem, rather than banging their
heads against the wall.

BR,
Ulf W
--
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com
unknown
2009-06-02 12:39:51 UTC
Permalink
Post by unknown
I'm not sure how well it would work in Erlang, but it's certainly
useful in Python for avoiding the headache-inducing backslash
acrobatics necessary when writing the occasional complex regular
expression.
It also has the fairly big advantage of *not* being limited solely to
regular expressions. It's less nice than the /pattern/ syntax of Perl/
Ruby specifically for regular expressions, but that's mostly because
since they're all OO to a point in e.g. Ruby you can write `/
pattern/.match(whatever)` while in Python you have to compile your
string (or use some matching function). That however is irrelevant to
Erlang.

An area where the raw string approach is superior to literal patterns
syntax (on top of Erlang not having pattern/re objects in the first
place) is that it also allows writing e.g. UNC paths (which use
backslashes as separators). I'm pretty sure that's a big factor in C#
having gone for raw strings rather than literal pattern objects.

Finally, extending (via a prefix) the string syntax instead of adding
a completely different syntax better opens up future extension venues
(using other prefixes).
unknown
2009-06-03 03:07:02 UTC
Permalink
Post by unknown
An area where the raw string approach is superior to literal
patterns syntax (on top of Erlang not having pattern/re objects in
the first place) is that it also allows writing e.g. UNC paths
(which use backslashes as separators).
But UNC paths are yet another case of a structured data type.
According to one source I found, UNC paths actually started in
the UNIX world, and indeed, POSIX to this day says that a file
name starting with two forward slashes is special. So something
like
\\Shared1_svr\Shared1\WGroups\Network\Orders.xls
really ought to be
{unc,Server,Volume,Path,FileName}, e.g.,
{unc,"Shared1_svr","Shared1",["WGroups","Network"],"Orders.xls"}
which can be slashified or backslashified at the point where it is
needed, *which need not be the machine it was read on*. For what
it's worth, this should be a non-issue for UNC paths anyway, since
Windows has always accepted forward slashes as well as backslashes.
I've seen some really horrible code making UNC paths and taking them
apart that would have been ever so simple using {unc,_,_,_} trees.
unknown
2009-06-02 13:02:35 UTC
Permalink
Post by unknown
Python provides a method of specifying strings they call "raw strings,"
which I find quite interesting. Basically, you prefix your string with r
or R, and any backslashes are treated as literal characters rather than
'\b'
'\x08'
r'\b'
'\\b'
The problem is that this uses regular tokens and has a valid
parse scan result today:

4> erl_scan:string("r'\b'.").
{ok,[{atom,1,r},{atom,1,'\b'},{dot,1}],1}

To support it, one would have to make r' a token in its own
right, which *might* actually break existing code (albeit
unlikely) - or complicate the scanner by having it look ahead
in a form of quick parse in order to figure out whether this
is a string or not.

That was one reason why I went for the backtick. It's not
recognized by the parser today.

Another problem, of course, is that while the r'...' syntax
lets you write \ without escaping, it still has some issues
with escaping, which I find a bit unintuitive.

By contrast, the `P...P is pretty simple to understand (you
just have to pick a delimiter that doesn't show up in the
string - it could be `'foo', `&foo&, or whatever. The way I
wrote it, you couldn't pick \ or \n as the delimiter, although
\ would actually work, I guess... (a newline would work too, but
that I find unintuitive.)

BR,
Ulf W
--
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com
unknown
2009-06-02 13:50:49 UTC
Permalink
Post by unknown
Post by unknown
Python provides a method of specifying strings they call "raw
strings," which I find quite interesting. Basically, you prefix your
string with r or R, and any backslashes are treated as literal
'\b'
'\x08'
r'\b'
'\\b'
The problem is that this uses regular tokens and has a valid
4> erl_scan:string("r'\b'.").
{ok,[{atom,1,r},{atom,1,'\b'},{dot,1}],1}
eh?

first off, it should be r"\b".

second, it will scan, but not parse...

1> {ok,Ts,_} = erl_scan:string("r\"\b\".").
{ok,[{atom,1,r},{string,1,"\b"},{dot,1}],1}

2> erl_parse:parse(Ts).
{error,{1,erl_parse,["syntax error before: ","\"\\b\""]}}

or do I misunderstand you?

mats
unknown
2009-06-02 14:02:58 UTC
Permalink
Post by unknown
Post by unknown
The problem is that this uses regular tokens and has a valid
4> erl_scan:string("r'\b'.").
{ok,[{atom,1,r},{atom,1,'\b'},{dot,1}],1}
eh?
first off, it should be r"\b".
Same objection applies.
Post by unknown
second, it will scan, but not parse...
1> {ok,Ts,_} = erl_scan:string("r\"\b\".").
{ok,[{atom,1,r},{string,1,"\b"},{dot,1}],1}
2> erl_parse:parse(Ts).
{error,{1,erl_parse,["syntax error before: ","\"\\b\""]}}
or do I misunderstand you?
First, I was going to hang my head in shame for posting
without thinking hard enough, but what I meant was that it
could be part of a valid parse.

This *will* compile:

-module(m).
-export([f/0]).

-r"\b".

f() ->
b.


Changing the meaning of r"\b", it will no longer compile.

(I did say that breaking existing code was unlikely... ;-)

BR,
Ulf W
--
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com
unknown
2009-06-02 14:58:15 UTC
Permalink
Post by unknown
-module(m).
-export([f/0]).
-r"\b".
f() ->
b.
argh...
Post by unknown
Changing the meaning of r"\b", it will no longer compile.
but... the -r is handled by the preprocessor, not? so the parser
will never see the -r"\b" bit, and it will still compile.
No, the preprocessor (epp) gets the 'attribute' form from the
parser, and only deals with certain attributes. The compiler
deals with some others, and external tools (like dialyzer)
can also rely on attributes.

They are part of the grammar.

attribute -> '-' atom attr_val : build_attribute('$2', '$3').
...

attr_val -> expr : ['$1'].
attr_val -> expr ',' exprs : ['$1' | '$3'].
attr_val -> '(' expr ',' exprs ')' : ['$2' | '$4'].


(From erl_parse.yrl)

BR,
Ulf W
--
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com
unknown
2009-06-02 15:45:55 UTC
Permalink
Post by unknown
Post by unknown
-module(m).
-export([f/0]).
-r"\b".
f() ->
b.
argh...
Post by unknown
Changing the meaning of r"\b", it will no longer compile.
but... the -r is handled by the preprocessor, not? so the parser
will never see the -r"\b" bit, and it will still compile.
No, the preprocessor (epp) gets the 'attribute' form from the
parser, and only deals with certain attributes. The compiler
deals with some others, and external tools (like dialyzer)
can also rely on attributes.
They are part of the grammar.
attribute -> '-' atom attr_val : build_attribute('$2', '$3').
...
attr_val -> expr : ['$1'].
attr_val -> expr ',' exprs : ['$1' | '$3'].
attr_val -> '(' expr ',' exprs ')' : ['$2' | '$4'].
(From erl_parse.yrl)
but still. couldn't the grammar can be extended to compile this;

-module(m).

-r"bla".

foo() ->
r"foo".


without changing the meaning of this;

-module(m).

-r"bla".

foo() ->
"foo".


by adding something like

raw_string -> 'r' raw_string : build_raw_string('$2')

to the grammar?

I obviously have no idea what I'm talking about here...

mats
unknown
2009-06-02 17:14:25 UTC
Permalink
Post by unknown
but still. couldn't the grammar can be extended to compile this;
-module(m).
-r"bla".
foo() ->
r"foo".
without changing the meaning of this;
-module(m).
-r"bla".
foo() ->
"foo".
No, the attribute would still not compile, since
it would convert to a "raw string", which is not a
legal attribute (and certainly not the attribute that it
was before.)
Post by unknown
by adding something like
raw_string -> 'r' raw_string : build_raw_string('$2')
to the grammar?
The problem with that is that we'd have to make
'r' a reserved word (a terminal) - otherwise, it will be
a dirty hack to the parser. This will mean that all
instances of the atom r in existing source code would
have to be quoted.

BR,
Ulf W
--
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com
unknown
2009-06-03 06:47:37 UTC
Permalink
Post by unknown
Post by unknown
but still. couldn't the grammar can be extended to compile this;
-module(m).
-r"bla".
foo() ->
r"foo".
without changing the meaning of this;
-module(m).
-r"bla".
foo() ->
"foo".
No, the attribute would still not compile, since
it would convert to a "raw string", which is not a
legal attribute (and certainly not the attribute that it
was before.)
Post by unknown
by adding something like
raw_string -> 'r' raw_string : build_raw_string('$2')
to the grammar?
The problem with that is that we'd have to make
'r' a reserved word (a terminal) - otherwise, it will be
a dirty hack to the parser. This will mean that all
instances of the atom r in existing source code would
have to be quoted.
Well, "dirty hack" or not, it is doable. And on the grand scale of
suckiness it's still better than allowing the -r"bla" madness.

mats
unknown
2009-06-02 12:18:23 UTC
Permalink
As an ex-representative of large software development projects,
I too think that home-grown syntax pre-processing tools are
very hard to control on a large scale. Having done my share
of parse transform modules in this sort of environment, I can
testify that anything that departs from regular erlang syntax
and semantics has a tendency to trigger stress reactions in
people.

One obvious reason is that practically all code written in
these projects will be inherited and maintained by other
people (a rule of thumb is that 80% of the cost of a
program is in the maintenance phase), and a significant
portion of the work reading and trying to understand the
code will be by first- and second-line support people in
other locations, without access to the build environment.
All non-standard syntaxes and programming conventions
increase the cost and tends to reduce the effectiveness
of first- and second-line support.

BR,
Ulf W
Hi,
On Tue, Jun 2, 2009 at 01:51, Richard O'Keefe <ok>
Post by unknown
In the mean time, may I respectfully point out to something that
seems pretty much kindergarten level to me, but doesn't seem to
? ?PROGRAMS ARE DATA.
........
A trivial AWK script can recognise this and turn a <here-file> into
a string literal, generating whatever quoting is necessary. ?All
you
Post by unknown
have to do is write less than a page of AWK (once), and then tell
your build tools how to turn .erl-hf files into .erl files.
As a programmer I like this way of handling this kind of issues
because it works now and it's easy.
As developer of a source handling tool I can't help but cringe at the
prospect of getting requests to support all kinds of homegrown
syntaxes...
Another problem with external processing of the source files is that
it is at the same level as the preprocessor, which many people would
like to see replaced with one that understands Erlang code.
best regards,
Vlad
________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org
--
Ulf Wiger
CTO, Erlang Training & Consulting Ltd.
http://www.erlang-consulting.com
unknown
2009-06-03 04:08:47 UTC
Permalink
Post by unknown
- Less error-prone
- Expressions written this way can be parsed and compiled by the compiler
(boost in performance, syntax checked at compile-time)
Any thoughts?
Have you looked at Reia?

http://wiki.reia-lang.org/

As a quick intro: Reia is a Ruby-like scripting language which compiles to
Erlang (and from there into Erlang bytecode) and supports Ruby-like regular
expression syntax (although the actual regex syntax is Perl compatible and
uses Erlang's re module). Reia runs on the Erlang VM and provides excellent
interoperability with Erlang.

You can read a little bit about the regex syntax at:

http://wiki.reia-lang.org/wiki/Data_types#Regular_expressions
--
Tony Arcieri
medioh.com
unknown
2009-06-03 05:03:39 UTC
Permalink
Post by unknown
http://wiki.reia-lang.org/wiki/Data_types#Regular_expressions
How very limiting.

No, not Reia. Just AWK-style regular expression literals.
They weren't good enough for AWK, which allows you to compute
regular expressions as strings and then use them. (I don't
know about other AWKs, but mawk caches the compiled form, so
that using the same string repeatedly as a pattern doesn't
result in repeated recompilation.) Of course, computing
regular expressions as strings is about as straightforward as
representing biological pathways in Excel spreadsheets...
AND IT ISN'T JUST STRING LITERAL SYNTAX THAT MAKES THIS SO.
(That's not Reia's fault. It isn't Erlang's either.)

There are several stages in the compilation of a regular
expression, at least notionally:
linear representation -> AST
AST -> matching engine
It's good that Reia has a clue about regular expression literals
(as AWK did). It would be even better if it _also_ provided an
API for the AST, so that one could say
"I want that regular expression followed by this string
followed by that regular expression."
It would be nice to tag the matches one wants with atoms rather
than invisible integers. And so on.
unknown
2009-06-03 05:09:24 UTC
Permalink
Post by unknown
There are several stages in the compilation of a regular
linear representation -> AST
AST -> matching engine
It's good that Reia has a clue about regular expression literals
(as AWK did). It would be even better if it _also_ provided an
API for the AST, so that one could say
"I want that regular expression followed by this string
followed by that regular expression."
It would be nice to tag the matches one wants with atoms rather
than invisible integers. And so on.
So, you'd like it to be Perl, then?


-kevin
unknown
2009-06-03 06:02:53 UTC
Permalink
Post by unknown
So, you'd like it to be Perl, then?
No, absolutely the complete and total reverse.

Perl is a horrible example of how to do it WRONG.

Perl doesn't do any of the things I have stressed as important.
In fact, it goes about as far as it can in the opposite direction.

The kind of regular expression I want is the POSIX kind: the
kind that can be implemented to run in linear time, not the
Perl kind that takes exponential time.
I just want to be able to compute them *AS* regular expression
values, not as strings. It's like XML: if you want to build
XML, strings are a horrible way to do it.

Unlike Perl, I don't want *any* special syntax for regular expressions.

I've now found a page that describes fairly precisely what I want
for regular expressions in Erlang, except the page describes how to
do it for Scheme.

http://www.scsh.net/docu/post/sre.html

One of the examples from that page is matching
c[ad]+r. In Erlang, something like
{seq,"c",{plus,{alt,"a","d"}},"r"}
would do it. Now this is clumsier than /c[ad]+r/,
but with building blocks like that you can write

matcher(Start,Opt1,Opt2,Finish) ->
{seq,Start,{plus,{alt,Opt1,Opt2}},Finish}.

where the things passed in are *regular expressions*,
including strings as a special case, so there is no
quoting problem.

Take a real example from an AWK script.

/^[a-zA-Z][a-zA-Z0-9.]*[ ]*<-[ ]*function/

I'd like to be able to write this:

opt_space() -> {star,{cset," \t"}}.

letters() -> "a-zA-Z".

continuers() -> "0-9." ++ letters().

identifier() -> {seq,{cset,letters()},{star,{cset,continuers()}}}.

operator(X) -> {seq,opt_space(),X,opt_space()}.

pattern() ->
{seq,bol,identifier(),operator("<-"),"function"}.

It's longer, but in a complete program, I'm likely to have a use
for most of these bits, and I am _certainly_ going to find it
easier to get this right one step at a time.

Do you see any Perl here? I don't.
unknown
2009-06-03 18:53:59 UTC
Permalink
Post by unknown
Post by unknown
So, you'd like it to be Perl, then?
No, absolutely the complete and total reverse.
Perl is a horrible example of how to do it WRONG.
Perl doesn't do any of the things I have stressed as important.
In fact, it goes about as far as it can in the opposite direction.
I shamefully admit to mostly making the comment to get a rise out of
Post by unknown
There are several stages in the compilation of a regular
linear representation -> AST
AST -> matching engine
It's good that Reia has a clue about regular expression literals
(as AWK did). It would be even better if it _also_ provided an
API for the AST, so that one could say
"I want that regular expression followed by this string
followed by that regular expression."
It would be nice to tag the matches one wants with atoms rather
than invisible integers. And so on.
are provided by perl. Regexs in Perl are in no way strings. They are
AST-like data structures, created during the compile phase, which are
fed into the matching engine during the run phase. (Yes, you can
coerce a string into a regex at runtime, but that's not the general
behavior of the built-in syntax.)

You can compose regex objects in just the way you describe:

$re1 = qr/abc/
$re2 = qr/123/
$re3 = qr/${re1}asdf${re2}/

You can have a named capture:

/\w*\.(?<suffix>\w)/
Post by unknown
I just want to be able to compute them *AS* regular expression
values, not as strings. It's like XML: if you want to build
XML, strings are a horrible way to do it.
Well, fortunately regexes in perl are only strings insomuch as any
source code construct is a string.
Post by unknown
Take a real example from an AWK script.
/^[a-zA-Z][a-zA-Z0-9.]*[ ]*<-[ ]*function/
opt_space() -> {star,{cset," \t"}}.
letters() -> "a-zA-Z".
continuers() -> "0-9." ++ letters().
identifier() -> {seq,{cset,letters()},{star,{cset,continuers()}}}.
operator(X) -> {seq,opt_space(),X,opt_space()}.
pattern() ->
{seq,bol,identifier(),operator("<-"),"function"}.
It's longer, but in a complete program, I'm likely to have a use
for most of these bits, and I am _certainly_ going to find it
easier to get this right one step at a time.
Do you see any Perl here? I don't.
$opt_space = qr/[ \t]*/;
$letters = qr/[a-zA-Z]/;
$continuers = qr/[0-9\.]|${letters}/;
$identifier = qr/${letters}${continuers}*/;

sub operator {
$x = shift;
qr/${opt_space}${x}${opt_space}/;
}

$larrow = operator("<-");
$pattern = qr/^${identifier}${larrow}function/;


So, other than your opposition to a concise syntax, I'm not really
seeing your point.



-kevin
unknown
2009-06-04 04:00:33 UTC
Permalink
Post by unknown
$re1 = qr/abc/
$re2 = qr/123/
$re3 = qr/${re1}asdf${re2}/
/\w*\.(?<suffix>\w)/
Anyone out there remember TECO?
Once said to be the only programming language where
a valid program was indistinguishable from line noise.
Now joined by Perl.
Post by unknown
Well, fortunately regexes in perl are only strings insomuch as any
source code construct is a string.
Not true. Statement blocks nest without backslashes.
Expressions nest without backslashes.

Given Perl's string interpolation, arguably Perl's _strings_
aren't "strings", but they DO require special syntax, including
backslashes, to embed expressions in them.
Post by unknown
$opt_space = qr/[ \t]*/;
$letters = qr/[a-zA-Z]/;
$continuers = qr/[0-9\.]|${letters}/;
$identifier = qr/${letters}${continuers}*/;
sub operator {
$x = shift;
qr/${opt_space}${x}${opt_space}/;
}
$larrow = operator("<-");
$pattern = qr/^${identifier}${larrow}function/;
So, other than your opposition to a concise syntax, I'm not really
seeing your point.
I'm not opposed to a concise syntax.
I'm opposed to _complex_ syntax.
I'm opposed to an _unreadable_ syntax.
I'm opposed to an _error-prone_ syntax.
I'm opposed to _nested_ syntaxes with conflicting rules.

Like for example the way that "$" in a Perl regular expression
might mean end-of-string or might mean here-comes-a-variable.

Regular expressions are nice and concise and readable
when and only when they are _simple_.

Regular expressions are _still_ a really great tool when they
are complex, but stringy syntax is then no longer a good way to
write them.
unknown
2009-06-03 06:00:05 UTC
Permalink
Post by unknown
It would be nice to tag the matches one wants with atoms rather
than invisible integers.
I've thought about combining pattern matching with named capture groups,
ala:

/(?<a>f.o)(?<b>b.r)(?<c>b.z)/ = "foobarbaz"

would bind:

a: "foo"
b: "bar"
c: "baz"
Post by unknown
"foobarbaz".match(/(?<a>f.o)(?<b>b.r)(?<c>b.z)/)[:b]
=> "bar"
--
Tony Arcieri
medioh.com
unknown
2009-06-03 18:14:52 UTC
Permalink
Post by unknown
Post by unknown
It would be nice to tag the matches one wants with atoms rather
than invisible integers.
I've thought about combining pattern matching with named capture groups,
/(?<a>f.o)(?<b>b.r)(?<c>b.z)/ = "foobarbaz"
a: "foo"
b: "bar"
c: "baz"
Post by unknown
"foobarbaz".match(/(?<a>f.o)(?<b>b.r)(?<c>b.z)/)[:b]
=> "bar"
I'd also like to note: the main reason I haven't implemented nifty
functionality with named capture groups yet is because the re module
provides no interface to getting a list of the capture group names out of a
compiled regex. I'm not the only one who's been bitten by this problem
either:

http://www.nabble.com/Extracting-capture-group-names-from-regular-expressions-td19848494.html
--
Tony Arcieri
medioh.com
unknown
2009-06-04 04:05:10 UTC
Permalink
LPeg does exactly what you want, though its host language is Lua.
http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html
Thanks for that link. My failure to learn Lua is becoming
increasingly irrational, especially now that LuaTeX exists.
unknown
2009-06-04 14:00:29 UTC
Permalink
For what it's worth, I started working on a PEG-based module in Erlang a few
weeks ago, but, in the end I started adding all sorts of niceties that
parsec already has, so I've instead taken to reading the Haskell source code
for parsec and translating it to Erlang. I know there is already an Erlang
module for parsec, but it is incomplete and undocumented, so I am
documenting as I go along (I also like to have tests [I use common test],
which the Erlang parsec did not have). Also, this gave me a good excuse to
learn Haskell.
-----Original Message-----
From: erlang-questions [mailto:erlang-questions] On
Behalf Of Richard O'Keefe
Sent: Wednesday, June 03, 2009 11:05 PM
To: Tony Finch
Cc: Tony Arcieri; Dmitrii Dimandt; erlang-questions
Subject: Re: [erlang-questions] Adoption of perl/javascript-style regexp
syntax
LPeg does exactly what you want, though its host language is Lua.
http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html
Thanks for that link. My failure to learn Lua is becoming
increasingly irrational, especially now that LuaTeX exists.
________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org
unknown
2009-06-04 14:08:49 UTC
Permalink
Post by unknown
For what it's worth, I started working on a PEG-based module in Erlang a few
weeks ago, but, in the end I started adding all sorts of niceties that
parsec already has, so I've instead taken to reading the Haskell source code
for parsec and translating it to Erlang. I know there is already an Erlang
module for parsec, but it is incomplete and undocumented, so I am
documenting as I go along (I also like to have tests [I use common test],
which the Erlang parsec did not have). Also, this gave me a good excuse to
learn Haskell.
That makes at least three known parsec-like implementations; maybe some
cooperation would be a good idea:
http://seancribbs.com/tech/2009/05/29/building-a-parser-generator-in-erlang-part-2/

/Richard

(who is increasingly glad he postponed writing his own peg library.)
unknown
2009-06-04 14:39:08 UTC
Permalink
Once I get done with translating the Haskell to Erlang, I might create a
parsec-based PEG module. For me, the "secret sauce" of parsec is the
monadic bind operator, whose use in Erlang is unwieldy, but effective.

Here's how unpretty it is in Erlang, however:

Haskell:

do{ tagName <- xmlOpeningTag()
; xmlFragment()
; xmlClosingTag tagName
}

Erlang:

bind(xmlOpeningTag(),
fun(TagName) -> bind(xmlFragment(),
fun(_) -> xmlClosingTag(TagName) end)
end).

The unprettiness goes beyond having to repeat the bind/2 call and match up
end's and right parens; I didn't even know how to indent it to make it
clear. However, there is a lot of functionality abstracted into that bind/2
function which would have to be repeated again and again without it.

I suspect a parse transform could be written that would allow a nice
Haskell-like syntax, but I haven't gotten that far yet.
-----Original Message-----
From: erlang-questions [mailto:erlang-questions] On
Behalf Of Richard Carlsson
Sent: Thursday, June 04, 2009 9:09 AM
To: David Mercer
Cc: 'Richard O'Keefe'; 'Tony Finch'; 'Tony Arcieri'; 'Dmitrii Dimandt';
erlang-questions
Subject: Re: [erlang-questions] Adoption of perl/javascript-style regexp
syntax
Post by unknown
For what it's worth, I started working on a PEG-based module in Erlang a
few
Post by unknown
weeks ago, but, in the end I started adding all sorts of niceties that
parsec already has, so I've instead taken to reading the Haskell source
code
Post by unknown
for parsec and translating it to Erlang. I know there is already an
Erlang
Post by unknown
module for parsec, but it is incomplete and undocumented, so I am
documenting as I go along (I also like to have tests [I use common
test],
Post by unknown
which the Erlang parsec did not have). Also, this gave me a good excuse
to
Post by unknown
learn Haskell.
That makes at least three known parsec-like implementations; maybe some
http://seancribbs.com/tech/2009/05/29/building-a-parser-generator-in-
erlang-part-2/
/Richard
(who is increasingly glad he postponed writing his own peg library.)
________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org
unknown
2009-06-05 03:54:15 UTC
Permalink
Post by unknown
Once I get done with translating the Haskell to Erlang, I might create a
parsec-based PEG module.
I've been told that Joe Armstrong once wrote a PEG for Erlang but never
released it.

Joe, any chance of that code seeing the light of day? As someone trying to
build a language with interpolated strings on top of Erlang I could really,
really use it.
--
Tony Arcieri
medioh.com
Continue reading on narkive:
Loading...