Discussion:
[erlang-questions] Rant: I hate parsing XML with Erlang
unknown
2007-10-23 10:52:28 UTC
Permalink
I only have so much horizontal real-estate on my screen and I totally
hate how xmlElement and xmlAttribute take so much of it. I also hate
how XML parsing code looks in Erlang.

Am I the only one?

I think Simon Peyton-Jones (Haskell) said that the syntax is the user
interface of a language. If so then Erlang has a user interface only
a mother could love, at least when it comes to parsing XML!

Joel

--
http://wagerlabs.com
unknown
2007-10-23 11:57:39 UTC
Permalink
Hi,

Just to clarify.
You happens to hate XML handling with the XMERL application that is
part of the Erlang/OTP distribution.
We are very well aware of that XMERL and the support for XML in the
distribution needs improvements and they will happen.

But I don't think you can draw any conclusions regarding Erlang as a
language and
how well suited it is for XML programming because of this.

I would appreciate if you maybe can be a little more specific about
what you think needs to be improved in XMERL.
The output from xmerl_scan:file/2 is big but is not intended for
viewing on the screen. We think there is need for a more compact
output as well, among other things for performance reasons.

/Kenneth Erlang/OTP group at Ericsson
Post by unknown
I only have so much horizontal real-estate on my screen and I totally
hate how xmlElement and xmlAttribute take so much of it. I also hate
how XML parsing code looks in Erlang.
Am I the only one?
I think Simon Peyton-Jones (Haskell) said that the syntax is the user
interface of a language. If so then Erlang has a user interface only
a mother could love, at least when it comes to parsing XML!
Joel
--
http://wagerlabs.com
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
unknown
2007-10-23 12:36:24 UTC
Permalink
Kenneth,
Post by unknown
But I don't think you can draw any conclusions regarding Erlang as a
language and how well suited it is for XML programming because of
this.
Take a look at the following [1] and try to visualize an
implementation in Erlang. More thoughts after the example.

The data:

<Export>
<Product>
<SKU>403276</SKU>
<ItemName>Trivet</ItemName>
<CollectionNo>0</CollectionNo>
<Pages>0</Pages>
</Product>
</Export>

The Ruby hPricot code:

FIELDS = %w[SKU ItemName CollectionNo Pages]

doc = Hpricot.parse(File.read("my.xml"))
(doc/:product).each do |xml_product|
product = Product.new
for field in FIELDS
product[field] = (xml_product/field.intern).first.innerHTML
end
product.save
end

This dovetails with the metaprogramming EPP that Vlad has just
submitted. Erlang is mind-numbingly rigid in its syntax. I cannot
emphasize this more!!!

The simplest example is that the "synchronous message passing"
notation of !! (double exclamation point). It could be used to neatly
translate the above into XmlProduct !! {search, Field !! intern} !!
first !! innerHTML. Alas, this is impossible even with a parse
transform since there's no !! in Erlang.

Any suggestions to using call(...) instead of !! will only prove my
point about the ugly user interface.

Why does it upset me so much you may ask? It's very simple! Erlang as
a huge unstoppable HTML and XML transformation engine. Throw together
a bunch of boxes and you can start sucking in web pages and spitting
out RSS feeds, rivers of news, etc.

Erlang enables mashups on a large scale but this type of application
needs a friendly user interface, one that seems to be impossible to
build right now due to syntax limitations!

Thanks, Joel

[1] http://errtheblog.com/post/8

--
http://wagerlabs.com
unknown
2007-10-23 14:47:46 UTC
Permalink
On Tue, 23 Oct 2007, Joel Reymont wrote:

JR> Take a look at the following [1] and try to visualize an
JR> implementation in Erlang. More thoughts after the example.
JR>
JR> The data:
JR>
JR> <Export>
JR> <Product>
JR> <SKU>403276</SKU>
JR> <ItemName>Trivet</ItemName>
JR> <CollectionNo>0</CollectionNo>
JR> <Pages>0</Pages>
JR> </Product>
JR> </Export>
JR>
JR> The Ruby hPricot code:
JR>
JR> FIELDS = %w[SKU ItemName CollectionNo Pages]
JR>
JR> doc = Hpricot.parse(File.read("my.xml"))
JR> (doc/:product).each do |xml_product|
JR> product = Product.new
JR> for field in FIELDS
JR> product[field] = (xml_product/field.intern).first.innerHTML
JR> end
JR> product.save
JR> end

At a first glance your Ruby code looks impressively
compact. But the corresponding implementation in
Erlang is about the same size. What's the point in
adding some syntactic sugar in order to make it even
more compact? It is just a matter of taste.

% cat product.erl
-module(product).
-compile(export_all).
-include_lib("xmerl/include/xmerl.hrl").

parse(File) ->
{#xmlElement{content = Exports}, _} = xmerl_scan:file(File),
[{Tag, Val} || #xmlElement{content = Products} <- Exports,
#xmlElement{content = Fields} <- Products,
#xmlText{parents = [{Tag, _} | _], value = Val} <- Fields].

% erl
Erlang (BEAM) emulator version 5.5.5 [source] [async-threads:0] [kernel-poll:false]

Eshell V5.5.5 (abort with ^G)
1> product:parse("my.xml").
[{'SKU',"403276"},{'ItemName',"Trivet"},{'CollectionNo',"0"},{'Pages',"0"}]
2>

/H?kan
unknown
2007-10-23 15:57:05 UTC
Permalink
Post by unknown
JR> Take a look at the following [1] and try to visualize an
JR> implementation in Erlang. More thoughts after the example.
JR>
JR>
JR> <Export>
JR> <Product>
JR> <SKU>403276</SKU>
JR> <ItemName>Trivet</ItemName>
JR> <CollectionNo>0</CollectionNo>
JR> <Pages>0</Pages>
JR> </Product>
JR> </Export>
JR>
JR>
JR> FIELDS = %w[SKU ItemName CollectionNo Pages]
JR>
JR> doc = Hpricot.parse(File.read("my.xml"))
JR> (doc/:product).each do |xml_product|
JR> product = Product.new
JR> for field in FIELDS
JR> product[field] = (xml_product/field.intern).first.innerHTML
JR> end
JR> product.save
JR> end
At a first glance your Ruby code looks impressively
compact. But the corresponding implementation in
Erlang is about the same size. What's the point in
adding some syntactic sugar in order to make it even
more compact? It is just a matter of taste.
% cat product.erl
-module(product).
-compile(export_all).
-include_lib("xmerl/include/xmerl.hrl").
parse(File) ->
{#xmlElement{content = Exports}, _} = xmerl_scan:file(File),
[{Tag, Val} || #xmlElement{content = Products} <- Exports,
#xmlElement{content = Fields} <- Products,
#xmlText{parents = [{Tag, _} | _], value = Val} <- Fields].
% erl
Erlang (BEAM) emulator version 5.5.5 [source] [async-threads:0] [kernel-poll:false]
Eshell V5.5.5 (abort with ^G)
1> product:parse("my.xml").
[{'SKU',"403276"},{'ItemName',"Trivet"},{'CollectionNo',"0"},{'Pages',"0"}]
2>
/H?kan
Well done H?kan ;-)

Here is another (not as nice as your) solution, which however is
rather fun, making use of Xpath:
--------------------------------------------
-module(xp).
-export([go/0, go/1]).

-include_lib("xmerl/include/xmerl.hrl").

-define(Val(X),
(fun() ->
[#xmlElement{name = N, content = [#xmlText{value =
V}|_]}] = X,
{N,V} end)()).

go() ->
go("/home/tobbe/hej.xml").

go(File) ->
{Xml, _} = xmerl_scan:file(File),
[?Val(xmerl_xpath:string("//SKU", Xml)),
?Val(xmerl_xpath:string("//ItemName", Xml)),
?Val(xmerl_xpath:string("//CollectionNo", Xml)),
?Val(xmerl_xpath:string("//Pages", Xml))].
-------------------------------------------------

5> xp:go().
[{'SKU',"403276"},{'ItemName',"Trivet"},{'CollectionNo',"0"},{'Pages',"0"}]


Cheers, Tobbe
unknown
2007-10-24 21:57:58 UTC
Permalink
Post by unknown
JR> Take a look at the following [1] and try to visualize an
JR> implementation in Erlang. More thoughts after the example.
JR>
JR>
JR> <Export>
JR> <Product>
JR> <SKU>403276</SKU>
JR> <ItemName>Trivet</ItemName>
JR> <CollectionNo>0</CollectionNo>
JR> <Pages>0</Pages>
JR> </Product>
JR> </Export>
The first thing I note here is that this is not what I would call
well designed XML. A good rule of design in general is
"if data are naturally unordered, do not impose any more order on them
than you can help". Here we have four string properties identified by
name. XML has a mechanism that is specifically designed for that job:
attributes. The design rule has an exception "it is sometimes OK to
represent a set or bag as a sequence, provided you ensure that the
results are invariant under permutation." So in *good* XML this
example is
<Export>
<Product SKU="403276" Item="Trivet" Collection="0" Pages="0"/>
</Export>
There are two more benefits here. One is that long-winded element
names like ItemName and CollectionNo can be replaced by shorter
attribute names like Item and Collection because attributes are
inherently contextual: it's really Product and Product
The second is that the attribute version is 83 bytes,
while the element version is 144 bytes, both using 1 space per level
indentation. That reduces space by >1.7 times, and as well all know,
data *space* = I/O *time*. In fact the better XML is even better than
that: when extended to a million products the space saving is better
than 1.95 and so is the time saving when parsing.

Does anyone remember that I proposed that Erlang could be
extended quite simply with XML expressions and XML patterns?

f() ->
[ #product{sku=S,item=I,collection=C,Pages=P}
|| <'Export'>L</> <- xml:parse_file("my.xml")
, <'Product' 'SKU'=S 'Item'=I 'Collection'=C 'Pages'=P/> <- L].

Without that extension, it would have to be something like

f() ->
[ #product{sku=S,item=I,collection=C,Pages=P}
|| {'Export',_,L} <- xml:parse_file("my.xml")
, {'Product',[{'Collection',C},{'Item',I},{'Pages',P},{'SKU',S}],
[]} <- L].

except that this wouldn't work with additional attributes, and the
previous version would.

I also happen to think that good XML style imitates XHTML, SVG, MathML,
and other such standards in preferring lower case starts or even
avoiding
upper case starts entirely, rather than imitating Visual Basic style.
Coincidentally that means rather less quoting in Erlang, so

f() ->
[ #product{sku=S,item=I,collection=C,Pages=P}
|| {export,_,L} <- xml:parse_file("my.xml")
, {product,[{'SKU',S},{collection,C},{item,I},{pages,P}], []}
<- L].

This is doable *now* in Erlang, using an XML parser I wrote back in
June 2001. I honestly cannot see this as inferior to the Ruby version.
So now let's revert to the inferior application of XML and see what
that looks like:

f() ->
[ #product{sku=S,item=I,collection=C,Pages=P}
|| {'Export',_,L} <- xml:parse_file("my.xml")
, {'Product',_,D} <- L
, {'SKU',_,[S]} <- D
, {'ItemName',_,[I]} <- D
, {'CollectionNo',_,[C]} <- D
, {'Pages',_,[P]} <- D
].

That wasn't *too* bad, was it? Shorter than the Ruby version...

Handling namespaces is trickier, but thankfully, Erlang lets us
include *bound* variables in patterns, so we could do something like
this:

f() ->
Main_NS = "http://www.example.org/silly/main",
Attr_NS = "http://www.example.org/silly/attr",
Export = xml:name('Export', Main_NS),
Product = xml:name('Product', Main_NS),
SKU = xml:name('SKU', Attr_NS),
ItemName = xml:name('ItemName', Attr_NS),
CollectionNo = xml:name('CollectionNo', Attr_NS),
Pages = xml:name('Pages', Attr_NS),

[ #product{sku=S,item=I,collection=C,Pages=P}
|| {Export,_,L} <- xml:parse_file("my.xml")
, {Product,_,D} <- L
, {SKU,_,[S]} <- D
, {ItemName,_,[I]} <- D
, {CollectionNo,_,[C]} <- D
, {Pages,_,[P]} <- D
].

It doesn't get much easier than this, anywhere. It so happens that
xml:name/2 doesn't exist. It would be
name(Name, NS) -> {Name,NS}.
but my old parser doesn't do namespaces. Not much point in changing it
when there are more capable parsers around.

ML: doesn't allow bound variables in patterns, doesn't have list
comprehension
Haskell and Clean: don't allow bound variables in patterns,
do have list comprehension
Erlang: does allow bound variables in patterns, does have list
comprehension, making this way of picking XML apart dead easy.
unknown
2007-10-23 19:35:13 UTC
Permalink
Using erlsom, you can write:

-module(product).
-compile(export_all).

parse(File) ->
{ok, Model} = erlsom:compile_xsd_file("product.xsd"),
{ok, Result, _} = erlsom:scan_file(File, Model),
Result.
Then you can do:
1> erlsom:write_xsd_hrl_file("product.xsd", "product.hrl", []).
2> rr("product.hrl").
3> product:parse("export.xml").
#'Export'{anyAttribs = [],
'Product' = #'Product'{anyAttribs = [],
'SKU' = "403276",
'ItemName' = "Trivet",
'CollectionNo' = 0,
'Pages' = 0}}

Very different from the example, but also nice :) And maybe more useful,
depending on what you want to do with it.

You need to provide a schema, of course. I am pasting an example schema for
this XML at the end of this email. Using the schema has the advantage that
the xml document will be validated. Having a schema is useful as well to
document the interface. Even if you don't like XML schema's (I still have
problems with them, even after writing the parser), having a specification
should be a good thing. Isn't this a bit like ASN.1, actually?

If you don't like the approach with the schema, you can also do this:

-module(product_sax).
-compile(export_all).

parse(File) ->
{ok, Bin} = file:read_file(File),
{R, _} = erlsom_sax:parseDocument(binary_to_list(Bin), {s1, []},
fun callback/2),
lists:reverse(R).

callback({startElement, _, "Product", _, _}, {s1, S}) ->
{s2, S};
callback({startElement, _, Tag, _, _}, {s2, S}) ->
{s3, {Tag, S}};
callback({characters, Value}, {s3, {Tag, List}}) ->
{s2, [{Tag, Value} | List]};
callback({endElement, _, "Product", _}, {_, S}) -> S;
callback(_, S) -> S.
4> product_sax:parse("export.xml").
[{"SKU","403276"},{"ItemName","Trivet"},{"CollectionNo","0"},{"Pages","0"}]

Using a callback and a simple sort of state machine - also nice, and very
efficient.

I have written a new version of the sax parser that can parse a file in
blocks, so that you can use it to parse very big files or streams of data.
At the moment I am doing some final testing, finishing the documentation
etc. Not the kind of work I like, so it is likely to take a while (1 - 2
weeks). The new release also fixes some bugs in the XML Schema related code,
and it has some features that should improve the capabilities to use erlsom
for SOAP.

Regards,
Willem

--------------------------
The schema:

<xsd:schema xmlns:xsd='http://www.w3.org/2001/XMLSchema'>

<xsd:element name='Export'>
<xsd:complexType>
<xsd:sequence>
<xsd:element name = 'Product' type='Product'/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>

<xsd:complexType name='Product'>
<xsd:sequence>
<xsd:element name='SKU' type='xsd:string'/>
<xsd:element name='ItemName' type='xsd:string'/>
<xsd:element name='CollectionNo' type='xsd:integer'/>
<xsd:element name='Pages' type='xsd:integer'/>
</xsd:sequence>
</xsd:complexType>

</xsd:schema>
Post by unknown
JR> Take a look at the following [1] and try to visualize an
JR> implementation in Erlang. More thoughts after the example.
JR>
JR>
JR> <Export>
JR> <Product>
JR> <SKU>403276</SKU>
JR> <ItemName>Trivet</ItemName>
JR> <CollectionNo>0</CollectionNo>
JR> <Pages>0</Pages>
JR> </Product>
JR> </Export>
JR>
JR>
JR> FIELDS = %w[SKU ItemName CollectionNo Pages]
JR>
JR> doc = Hpricot.parse(File.read("my.xml"))
JR> (doc/:product).each do |xml_product|
JR> product = Product.new
JR> for field in FIELDS
JR> product[field] = (xml_product/field.intern).first.innerHTML
JR> end
JR> product.save
JR> end
At a first glance your Ruby code looks impressively
compact. But the corresponding implementation in
Erlang is about the same size. What's the point in
adding some syntactic sugar in order to make it even
more compact? It is just a matter of taste.
% cat product.erl
-module(product).
-compile(export_all).
-include_lib("xmerl/include/xmerl.hrl").
parse(File) ->
{#xmlElement{content = Exports}, _} = xmerl_scan:file(File),
[{Tag, Val} || #xmlElement{content = Products} <- Exports,
#xmlElement{content = Fields} <- Products,
#xmlText{parents = [{Tag, _} | _], value = Val} <- Fields].
% erl
Erlang (BEAM) emulator version 5.5.5 [source] [async-threads:0] [kernel-poll:false]
Eshell V5.5.5 (abort with ^G)
1> product:parse("my.xml").
[{'SKU',"403276"},{'ItemName',"Trivet"},{'CollectionNo',"0"},{'Pages',"0"}]
2>
/H?kan
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20071023/ae2c6128/attachment.html>
unknown
2007-10-24 08:00:51 UTC
Permalink
Post by unknown
At a first glance your Ruby code looks impressively
compact. But the corresponding implementation in
Erlang is about the same size. What's the point in
adding some syntactic sugar in order to make it even
more compact? It is just a matter of taste.
% cat product.erl
-module(product).
-compile(export_all).
-include_lib("xmerl/include/xmerl.hrl").
parse(File) ->
{#xmlElement{content = Exports}, _} = xmerl_scan:file(File),
[{Tag, Val} || #xmlElement{content = Products} <- Exports,
#xmlElement{content = Fields} <- Products,
#xmlText{parents = [{Tag, _} | _], value = Val} <- Fields].
% erl
Erlang (BEAM) emulator version 5.5.5 [source] [async-threads:0] [kernel-poll:false]
Eshell V5.5.5 (abort with ^G)
1> product:parse("my.xml").
[{'SKU',"403276"},{'ItemName',"Trivet"},{'CollectionNo',"0"},{'Pages',"0"}]
2>
There is a function, xmerl_lib:simplify_element(E), which accomplishes the
same thing:

1> {Data,_} = xmerl_scan:string(Str),xmerl_lib:simplify_element(Data).
{'Export',[],
["\n\n ",
{'Product',[],
["\n\n ",
{'SKU',[],["403276"]},
"\n\n ",
{'ItemName',[],["Trivet"]},
"\n\n ",
{'CollectionNo',[],["0"]},
"\n\n ",
{'Pages',[],["0"]},
"\n\n "]},
"\n\n"]}

Unfortunately, it's not documented.

BR,
Ulf W
unknown
2007-10-23 12:07:20 UTC
Permalink
Post by unknown
I only have so much horizontal real-estate on my screen and I totally
hate how xmlElement and xmlAttribute take so much of it. I also hate
how XML parsing code looks in Erlang.
Am I the only one?
Nope, you are not. And from the record names it seems you are using xmerl.
I have to admit that I never got my head around it... However I have more
than positive experience with erlsom library so might take a look at it.

Hope this helps,

Ladislav Lenart
unknown
2007-10-23 12:42:44 UTC
Permalink
Post by unknown
And from the record names it seems you are using xmerl.
It's just an easy target to pick on.
Post by unknown
However I have more
than positive experience with erlsom library so might take a look at it.
I did this morning and discarded it immediately as it wants me to
supply schema for everything I want to parse. I'm interested in HTML
scraping and mashups and erlsom does not seem to be useful for this
purpose.

Thanks, Joel

--
http://wagerlabs.com
unknown
2007-10-23 12:56:08 UTC
Permalink
If you can get by with a SAX-based approach erlsom can work without
schema.

Atomizer (http://code.google.com/p/atomizer/) uses this approach to
parse Atom feeds.

--Kevin
Post by unknown
Post by unknown
And from the record names it seems you are using xmerl.
It's just an easy target to pick on.
Post by unknown
However I have more
than positive experience with erlsom library so might take a look at it.
I did this morning and discarded it immediately as it wants me to
supply schema for everything I want to parse. I'm interested in HTML
scraping and mashups and erlsom does not seem to be useful for this
purpose.
Thanks, Joel
--
http://wagerlabs.com
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
unknown
2007-10-23 15:05:32 UTC
Permalink
Post by unknown
If you can get by with a SAX-based approach erlsom can work without
schema.
Atomizer (http://code.google.com/p/atomizer/) uses this approach to
parse Atom feeds.
Just to be clear, you could "easily" write a wrapper around
xmerl_scan that fits more or less exactly into the atom_parser
structure. This was pretty much the idea with xmerl, but the
documentation is terse enough that most people have missed
that.

Also, while there are some wrappers provided with xmerl,
there is no SAX wrapper.

Just to illustrate with an (incomplete) wrapper:

-module(xmerl_sax).

-export([file/2]).

-include_lib("xmerl/include/xmerl.hrl").

file(F, CB) when is_function(CB, 3) ->
xmerl_scan:file(F, [{event_fun, fun(E, S) ->
event(E, CB, S)
end},
{acc_fun, fun(_, Acc, S) ->
{Acc, S}
end}]).


event(#xmerl_event{event = E,
data = D}, CB, S) ->
case D of
#xmlPI{} -> S;
#xmlComment{} -> S;
#xmlDecl{} -> S;
_ ->
ES = xmerl_scan:event_state(S),
ES1 = CB(E, data(D), ES),
xmerl_scan:event_state(ES1, S)
end.

data(#xmlAttribute{name = N, value = V}) ->
{attribute, N, V};
data(#xmlElement{name = N, attributes = As, content = C}) ->
{element, N, [{K,V} || #xmlAttribute{name = K, value = V} <- As], C};
data(document) -> document;
data(#xmlText{value = V}) ->
{text, V}.


11>
xmerl_sax:file("/home/etxuwig/contribs/xmerl-0.18.1/priv/testdata/test3.xml",fun(E,Info,S)
-> io:format("E = ~p, Info = ~p~n", [E,Info]), S end).
E = started, Info = document
E = ended, Info = {attribute,encoding,"iso-8859-1"}
E = started, Info = {element,'People',[],[]}
E = started, Info = {text,undefined}
E = ended, Info = {text,"\n "}
E = started, Info = {element,comment,[],[]}
E = started, Info = {text,undefined}
E = ended, Info = {text,"This is a comment"}
E = ended, Info = {element,comment,[],[]}
E = started, Info = {text,undefined}
E = ended, Info = {text,"\n "}
E = ended, Info = {attribute,'Type',"Personal"}
E = started, Info = {element,'Person',[{'Type',"Personal"}],[]}
E = started, Info = {text,undefined}
E = ended, Info = {text,"\n "}
E = ended, Info = {element,'Person',[{'Type',"Personal"}],[]}
E = started, Info = {text,undefined}
E = ended, Info = {text,"\n"}
E = ended, Info = {element,'People',[],[]}
E = ended, Info = document



...then, xmerl spits out an xmlElement record anyway,
which is a bug, IMO.

Another bug is that you can't tell xmerl which accumulator
to use as the return value. This would be easily fixed.

I agree with Joe: it's pretty easy to write a limited
XML parser that handles > 90% of all XML out there and
returns something that's visually appealing.

Writing a few front-ends to xmerl that are leagues more
user friendly than the generic back-end is not rocket-
science, but I have no problem accepting that people
don't think they should have to do that.

BR,
Ulf W
unknown
2007-10-23 13:02:45 UTC
Permalink
Hi Joel,
Post by unknown
I'm interested in HTML
scraping and mashups and erlsom does not seem to be useful for this
purpose.
Do you try to scrape arbitrary HTML? I don't think a XML parser will
help that much in such a case, because HTML is only a distant cousin
of XML...

regards,
Vlad
unknown
2007-10-23 13:08:13 UTC
Permalink
Post by unknown
Do you try to scrape arbitrary HTML? I don't think a XML parser will
help that much in such a case, because HTML is only a distant cousin
of XML...
Completely arbitrary HTML. Any web page out there. The syntax and
approach won't be much for HTML, assuming you had a robust parser. My
rant is about the syntax.

--
http://wagerlabs.com
unknown
2007-10-23 13:30:24 UTC
Permalink
Take a look at yaws_html.erl. That is quite a nice parser that
doesn't produce the same bloat as xmerl

Sean
Post by unknown
Post by unknown
Do you try to scrape arbitrary HTML? I don't think a XML parser will
help that much in such a case, because HTML is only a distant cousin
of XML...
Completely arbitrary HTML. Any web page out there. The syntax and
approach won't be much for HTML, assuming you had a robust parser. My
rant is about the syntax.
--
http://wagerlabs.com
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
unknown
2007-10-23 13:46:34 UTC
Permalink
Post by unknown
Take a look at yaws_html.erl. That is quite a nice parser that
doesn't produce the same bloat as xmerl
Are there any examples of using yaws_html as well as the output that
it produces? Would be nice to include in this thread.

Thanks, Joel

--
http://wagerlabs.com
unknown
2007-10-23 13:46:49 UTC
Permalink
Maybe that should be packaged separately? Seems odd that you'd have
to get the webserver just for a HTML parser....

FWIW, I tried writing a very permissive feedparser but lost interest
partially due to the ugliness of Erlang's XML parsing APIs.

--Kevin

--Kevin
Post by unknown
Take a look at yaws_html.erl. That is quite a nice parser that
doesn't produce the same bloat as xmerl
Sean
Post by unknown
Post by unknown
Do you try to scrape arbitrary HTML? I don't think a XML parser will
help that much in such a case, because HTML is only a distant cousin
of XML...
Completely arbitrary HTML. Any web page out there. The syntax and
approach won't be much for HTML, assuming you had a robust parser. My
rant is about the syntax.
--
http://wagerlabs.com
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
unknown
2007-10-23 14:01:11 UTC
Permalink
Post by unknown
FWIW, I tried writing a very permissive feedparser but lost
interest partially due to the ugliness of Erlang's XML parsing APIs.
Running yaws_html:parse/1 on a sample RSS feed works just fine. I
suspect you can't get anymore permissive than that.

--
http://wagerlabs.com
unknown
2007-10-23 14:27:49 UTC
Permalink
Post by unknown
Post by unknown
FWIW, I tried writing a very permissive feedparser but lost
interest partially due to the ugliness of Erlang's XML parsing APIs.
Running yaws_html:parse/1 on a sample RSS feed works just fine. I
suspect you can't get anymore permissive than that.
I tried to use it a couple of years ago and it was of no help to me since
it actually requires correct HTML. Which the sites I tried to scrape
refused to provide, (missing end tags and so on).

Anders
unknown
2007-10-23 14:58:43 UTC
Permalink
This indicates that you don't want an XML parser. If the XML (HTNL) is not
well formed then you probably just want a tag parser

My guess is that if you tokenise the input into a sequence of tags and
then pattern
match over the tags you'll get what you want.

The tokenised file looks like this

[...,
{sTag,a,[{href,"..."}]}
{eTag, img,[{src,"..."}]},
{eTag,img},
{eTag,a},
{stag,p,[]},
{raw,"...."}
...
]

Then you write pattens to extract the content
...

this is described here

http://www.trapexit.org/forum/viewtopic.php?p=20670&highlight=&sid=ab39db1f70f1a3a68602f830091ea547

From what has been posted I get the following picture

1) There are lots of XML libraries around (I have a 6-pack)
other people have mentioned libraries that I was unaware of
2) The code for these cannot be found in one place
3) The documentation for how to use these is non-existent

The solution is

- move all code to one site
- organise it
- document it

This is a lot of work -

/Joe
Post by unknown
Post by unknown
Post by unknown
FWIW, I tried writing a very permissive feedparser but lost
interest partially due to the ugliness of Erlang's XML parsing APIs.
Running yaws_html:parse/1 on a sample RSS feed works just fine. I
suspect you can't get anymore permissive than that.
I tried to use it a couple of years ago and it was of no help to me since
it actually requires correct HTML. Which the sites I tried to scrape
refused to provide, (missing end tags and so on).
Anders
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
unknown
2007-10-23 15:47:26 UTC
Permalink
Post by unknown
I tried to use it a couple of years ago and it was of no help to me since
it actually requires correct HTML. Which the sites I tried to scrape
refused to provide, (missing end tags and so on).
It is not necessarily incorrect for an HTML document to have missing end
tags. For some elements the end tag is optional. Trying to parse an HTML
document with an XML parser is not likely to work well, however. One
must either use an SGML parser or make sure you only point your XML
parser at an XHTML document.

Peter

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Peter.Chapin.vcf
Type: text/x-vcard
Size: 308 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20071023/51a412d3/attachment.vcf>
unknown
2007-10-23 14:51:02 UTC
Permalink
Possibly. My understanding was that it still required well-formed
documents to function. A lot of feeds feature varying amounts of
"well-formedness", sadly.

--Kevin
Post by unknown
Post by unknown
FWIW, I tried writing a very permissive feedparser but lost
interest partially due to the ugliness of Erlang's XML parsing APIs.
Running yaws_html:parse/1 on a sample RSS feed works just fine. I
suspect you can't get anymore permissive than that.
--
http://wagerlabs.com
unknown
2007-10-23 15:09:11 UTC
Permalink
I've seen some work on parsing badly formed HTML.

If I remember rightly you keep a stack of the currently open tags
then stacks for things like <font> <b> <i> tags etc. So you end up with
several small stacks. Each new open or close tag pushes or pops
things onto these stacks.

When you hit raw data you pattern match over the stacks to figure out
what to do.

As an aside it occurred to me that mozilla is probably pretty good at
sceen scraping (or whatever it's called) - so it should be possible to write
a Firefox extension to do this that talks through a socket to Erlang.
<somebody told me this was easy, but they obviously knew more than I do>

You could then use Erlang as a coordination language controlling
a load of firefoxes on different machines, telling them to go get pages and
scrape the pages for data which they send back to Erlang.

if we could use firefox as a component then we could avoid reinventing
the wheel (again)

/Joe
Post by unknown
Possibly. My understanding was that it still required well-formed
documents to function. A lot of feeds feature varying amounts of
"well-formedness", sadly.
--Kevin
Post by unknown
Post by unknown
FWIW, I tried writing a very permissive feedparser but lost
interest partially due to the ugliness of Erlang's XML parsing APIs.
Running yaws_html:parse/1 on a sample RSS feed works just fine. I
suspect you can't get anymore permissive than that.
--
http://wagerlabs.com
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
unknown
2007-10-23 15:23:30 UTC
Permalink
Post by unknown
You could then use Erlang as a coordination language controlling
a load of firefoxes on different machines, telling them to go get pages and
scrape the pages for data which they send back to Erlang.
This is nuts!!! /With all due respect to Joe/

--
http://wagerlabs.com
unknown
2007-10-23 20:01:35 UTC
Permalink
The point is (or was) that firefox has code to parse virtally any kind of broken
warped incomprehensable html - letting firefox figure out the "meaning" of
deeply crippled and totally incomprehensible html and then scanning the result
(the generated DOM) seems a lot easier than figuring out how to parse
crippled HTML yourself - using other stuff as components to do what they are
good at doesn't seem that crazy to me.

/Joe
Post by unknown
Post by unknown
You could then use Erlang as a coordination language controlling
a load of firefoxes on different machines, telling them to go get pages and
scrape the pages for data which they send back to Erlang.
This is nuts!!! /With all due respect to Joe/
--
http://wagerlabs.com
unknown
2007-10-23 20:56:20 UTC
Permalink
using IE on Windows or Firefox on Linux is actually the best way to implement
Web Automation, i.e. bot logging into website/webapp, clicking links and
buttons, etc. This way not only bad-formed HTML, but virtually any Web
technology, like cookies, javascript, AJAX, Flash, plugins,Java applets, can
be supported.
The only problems with this approach:
1. It requires much more resources (i.e more heavyweight, than jsut HTML
parsing).
2. When running multiple firefox instances on the same node, there are canbe
security problems.
3. In server environment it should be possible to run Firefox in headless
mode (i.e. without X).

Zvi
Post by unknown
The point is (or was) that firefox has code to parse virtally any kind of broken
warped incomprehensable html - letting firefox figure out the "meaning" of
deeply crippled and totally incomprehensible html and then scanning the result
(the generated DOM) seems a lot easier than figuring out how to parse
crippled HTML yourself - using other stuff as components to do what they are
good at doesn't seem that crazy to me.
/Joe
Post by unknown
Post by unknown
You could then use Erlang as a coordination language controlling
a load of firefoxes on different machines, telling them to go get pages and
scrape the pages for data which they send back to Erlang.
This is nuts!!! /With all due respect to Joe/
--
http://wagerlabs.com
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
--
View this message in context: http://www.nabble.com/Rant%3A-I-hate-parsing-XML-with-Erlang-tf4676760.html#a13373590
Sent from the Erlang Questions mailing list archive at Nabble.com.
unknown
2007-10-23 21:58:03 UTC
Permalink
Agreed - utilizing firefox or IE will further allow you to handle javascript
generated DOMs much more easily then having to write a javascript parser
yourself, which will enable handling of a much larger sets of pages.

But is this *easy* to do within Erlang?
Post by unknown
The point is (or was) that firefox has code to parse virtally any kind of broken
warped incomprehensable html - letting firefox figure out the "meaning" of
deeply crippled and totally incomprehensible html and then scanning the result
(the generated DOM) seems a lot easier than figuring out how to parse
crippled HTML yourself - using other stuff as components to do what they are
good at doesn't seem that crazy to me.
/Joe
Post by unknown
Post by unknown
You could then use Erlang as a coordination language controlling
a load of firefoxes on different machines, telling them to go get pages and
scrape the pages for data which they send back to Erlang.
This is nuts!!! /With all due respect to Joe/
--
http://wagerlabs.com
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20071023/22b920aa/attachment.html>
unknown
2007-10-23 22:14:37 UTC
Permalink
Of course - others on the internet have thought of the same issues and have
blogged about it...

http://emacspeak.blogspot.com/2007/06/firebox-put-fox-in-box.html

Which can run firefox headless + having an REPL with firefox @ the same
time... the rest would just be figuring out the vocabs to talk to firefox
over socket...

And apparently you can do all that in emacs -
http://emacspeak.googlecode.com/svn/trunk/lisp/emacspeak-moz.el.

And the engine behind the REPL - http://beta.hyperstruct.net/projects/mozlab.
Post by unknown
Agreed - utilizing firefox or IE will further allow you to handle
javascript generated DOMs much more easily then having to write a javascript
parser yourself, which will enable handling of a much larger sets of pages.
But is this *easy* to do within Erlang?
Post by unknown
The point is (or was) that firefox has code to parse virtally any kind of broken
warped incomprehensable html - letting firefox figure out the "meaning" of
deeply crippled and totally incomprehensible html and then scanning the result
(the generated DOM) seems a lot easier than figuring out how to parse
crippled HTML yourself - using other stuff as components to do what they are
good at doesn't seem that crazy to me.
/Joe
Post by unknown
Post by unknown
You could then use Erlang as a coordination language controlling
a load of firefoxes on different machines, telling them to go get pages and
scrape the pages for data which they send back to Erlang.
This is nuts!!! /With all due respect to Joe/
--
http://wagerlabs.com
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20071023/dc26afc7/attachment.html>
unknown
2007-10-23 23:00:56 UTC
Permalink
On MS Windows I use classMechanizeIE.php
(http://www.cgi-interactive-uk.com/com_functions_php_ie.html) and a
small PHP script to grab pages by controlling MS Internet Explorer
through its COM interface. My Erlang program manages the various
jobs and parses the resultant text files created, sending alerts as
needed.

My preference would have been something built-in to Erlang for the
COM control but Comet no longer is integral with the distribution
(I do not know if it would have been suitable for the task, anyway).
I had to use Internet Explorer as the browser because the environment
in which I am doing this task will check for valid login when you go
to the page (that is, somehow the server knows if you are logged in
to your workstation and requires you to use MS Internet Explorer to
automatically authenticate when you go to specific pages). Simply
using http:request/4 or lynx or telnet would not authenticate
properly.


~Michael
Post by unknown
Agreed - utilizing firefox or IE will further allow you to handle javascript
generated DOMs much more easily then having to write a javascript parser
yourself, which will enable handling of a much larger sets of pages.
But is this *easy* to do within Erlang?
The point is (or was) that firefox has code to parse virtally any kind of broken
warped incomprehensable html - letting firefox figure out the "meaning" of
deeply crippled and totally incomprehensible html and then scanning the result
(the generated DOM) seems a lot easier than figuring out how to parse
crippled HTML yourself - using other stuff as components to do what they are
good at doesn't seem that crazy to me.
/Joe
Post by unknown
Post by unknown
You could then use Erlang as a coordination language controlling
a load of firefoxes on different machines, telling them to go get pages and
scrape the pages for data which they send back to Erlang.
This is nuts!!! /With all due respect to Joe/
--
http://wagerlabs.com
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
!DSPAM:52,471e700950982146018883!
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
!DSPAM:52,471e700950982146018883!
--
Michael McDaniel
Portland, Oregon, USA
http://autosys.us
+1 503 283 5284
unknown
2007-10-23 15:30:25 UTC
Permalink
http://tidy.sourceforge.net/ is the typical library I've seen used to
transform arbitrary HTML into a valid document quickly and without
re-inventing the wheel. Much easier than trying to integrate with
Mozilla.

-bob
Post by unknown
I've seen some work on parsing badly formed HTML.
If I remember rightly you keep a stack of the currently open tags
then stacks for things like <font> <b> <i> tags etc. So you end up with
several small stacks. Each new open or close tag pushes or pops
things onto these stacks.
When you hit raw data you pattern match over the stacks to figure out
what to do.
As an aside it occurred to me that mozilla is probably pretty good at
sceen scraping (or whatever it's called) - so it should be possible to write
a Firefox extension to do this that talks through a socket to Erlang.
<somebody told me this was easy, but they obviously knew more than I do>
You could then use Erlang as a coordination language controlling
a load of firefoxes on different machines, telling them to go get pages and
scrape the pages for data which they send back to Erlang.
if we could use firefox as a component then we could avoid reinventing
the wheel (again)
/Joe
Post by unknown
Possibly. My understanding was that it still required well-formed
documents to function. A lot of feeds feature varying amounts of
"well-formedness", sadly.
--Kevin
Post by unknown
Post by unknown
FWIW, I tried writing a very permissive feedparser but lost
interest partially due to the ugliness of Erlang's XML parsing APIs.
Running yaws_html:parse/1 on a sample RSS feed works just fine. I
suspect you can't get anymore permissive than that.
--
http://wagerlabs.com
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
unknown
2007-10-25 01:09:26 UTC
Permalink
Hi Bob,
Post by unknown
http://tidy.sourceforge.net/ is the typical library I've seen used to
transform arbitrary HTML into a valid document quickly and without
re-inventing the wheel. Much easier than trying to integrate with
Mozilla.
Tidy is good, maybe even the only workable solution, depending on your
needs. It tries to convert malformed HTML into wellformed HTML that
you can parse with something that expects wellformed markup. In
practice, it tends to be rather slow. I had a need a few years ago to
parse arbitrary HTML. I didn't care about making it wellformed; I
just needed something that could start at the beginning and raise
SAX-like events when it encountered stuff. For example, the following
malformed document:

<html>
<body>
<p>I am an unclosed tag
</body>
</html>

would result in:

start_tag: html
start_tag: body
start_tag: p
element_content: I am an unclosed tag
end_tag: body
end_tag: html

If this is good enough it might be worth looking at. The code is part
of a C++ application framework I work on in my spare time. You can
get the latest code at:

https://launchpad.net/framework

The relevant bits wrt to HTML parsing are here:

http://codebrowse.launchpad.net/~jkakar/framework/57186-release-0.2/files/jkakar%40starla-20070107232638-khamuslq9vz15z6a?file_id=text-20060821192009-a8646b9047718e3d

I've started porting the HTML parsing code to Python, but it's not
ready yet. Maybe using this code would be helpful? The code has been
used in a production environment and works and performs fairly well.

Thanks,
J.

unknown
2007-10-23 22:25:33 UTC
Permalink
Post by unknown
I've seen some work on parsing badly formed HTML.
If I remember rightly you keep a stack of the currently open tags
then stacks for things like <font> <b> <i> tags etc. So you end up with
several small stacks. Each new open or close tag pushes or pops
things onto these stacks.
When you hit raw data you pattern match over the stacks to figure out
what to do.
As an aside it occurred to me that mozilla is probably pretty good at
sceen scraping
Not to mention lynx.

--Toby
Post by unknown
(or whatever it's called) - so it should be possible to write
a Firefox extension to do this that talks through a socket to Erlang.
<somebody told me this was easy, but they obviously knew more than I do>
You could then use Erlang as a coordination language controlling
a load of firefoxes on different machines, telling them to go get pages and
scrape the pages for data which they send back to Erlang.
if we could use firefox as a component then we could avoid reinventing
the wheel (again)
/Joe
Post by unknown
Possibly. My understanding was that it still required well-formed
documents to function. A lot of feeds feature varying amounts of
"well-formedness", sadly.
--Kevin
Post by unknown
Post by unknown
FWIW, I tried writing a very permissive feedparser but lost
interest partially due to the ugliness of Erlang's XML parsing APIs.
Running yaws_html:parse/1 on a sample RSS feed works just fine. I
suspect you can't get anymore permissive than that.
--
http://wagerlabs.com
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
unknown
2007-10-23 14:13:27 UTC
Permalink
I have written several XML parsers in several states of completeness,
at the moment I'm trying to put together yet another xml toolkit (I
might actually release
this one)

The problem with "parsing" xml is not so much the parsing but what you want to
do with the parse tree. Do you need validation? how are the DTD
defined and so on.

Here are are some of the questions that occur to me in the design of an
XML parser.

1) Is the input small or large. small means will fit into memory.

In the case of small input then everything fits into memory - I
can happily parse
input streams of a 200K lines.

The only large files XML I've found are a tens of Gigabytes of
data - is it important
to be able to parse and validate these? or do you just want sax
like processing

2) Is the input stream "framed" - ie we have a framing protocol so we
know we have
and entire XML document or do you want a re-entrant parser.

3) Do you need to handle streams of xml terms. This is often problematic
since nobody can agree on the framing protocol - does each term begin
with a new <?xml ...?> header?

4) Do you want validation. In which case how do you find the DTD/schema - do you
have to comply with OASIS catalogues?
Do you want DTD's, Schema, RNC, (or an erlang ad hock grammar)

(My solution is to parameterize the parser with a fun F - F(URI)
is a function
that knows how to find the DTD in URI - the OASIS catalogue structure
is not something that I want to be concerned with - most
applications seem to
titall ignore this)

5) Do you want the parser to try an correct errors and recover, or
bail out early.

6) Do you want unicode support? or just ASCII.

7) Does the data come from files, sockets, binaries?

8) Do you want strict or lazy parsing. If you don't look an attribute
you might like to
defer parsing the content until you actually need it.

9) Do you want to check all ids and idrefs and correctly handle
NOTATION's and so on
i.e. all the weird things in the XML spec that 99.9% of
programmers have never used.

10) In the case on XML without a DTD do you want a heuristic to throw away
non-significant white space (I often use a simple heuristic if all the
PCDATA children of a tag are all white space then throw all this
white space away)

...

It's very difficult to write a parser that correctly handles *all* of
these and is fast, small etc.

I have made a set of compromises and a toolkit that provides the following.

1) A tag level interface to the system
this has a file type interface.

Pid = open_xml_token_stream(Descriptor)

makes a re-entrant token scanner

get_next_token(Pid) -> Token | eof

A lot of things can be done with this alone, for example a SAX
like processor

2) A simple parser this takes a token stream and parses it just checking for
well formed-ness.

3) A validator that runs on the output of 2 - this only
understands DTDs (not schemas,
or rnc)

There are also diverse routines to parse files etc, based on these.

I've also written an XSLT type thing that takes the output of 2)
or 3) and transforms
it.

This is just ASCII.

I've talked to a lot of people about XML - most people (the majority) want
something to parse a small ASCII file containing a single XML data structure.

The data structure is well formed - there is no DTD - and they don't care
about integrity constraints on the attributes- They don't care about
entity expansion,
NOTATIONs CDATA etc.

The kind of Ruby code shown in an earlier posting is easy given a
simple parse tree

My experimental parser turns an XML data structure in a

@type xml() = {node,Line,Tag,attrs(),[xml()]} |
{raw,Ln,AllBlack:bool(), bin()}

It's easy to write a fold function that applies a Fun to each node

fold_over_nodes(Fun, Env, {node, _, _, _, C} = Node) ->
Env1 = Fun(Node, Env),
fold_over_nodes(Fun, Env1, C);
fold_over_nodes(Fun, Env, {raw,_,_,_} = Raw) ->
Fun(Raw, Env);
fold_over_nodes(Fun, Env, [H|T]) ->
Env1 = fold_over_nodes(Fun, Env, H),
fold_over_nodes(Fun, Env1, T);
fold_over_nodes(Fun, Env, []) ->
Env.

(like foldl) - this can be used to extract tags

F = fun({node,_,Stag,_,C}=N, E) ->
case member(Stag, [a,b,c]) of
true -> [N|E];
false -> E
end,
Tags = fold_over_nodes(Tree, [], F)

For very simple applications I could put together a parser that does
the following:

1) Small files only.
2) no DTDs or grammar checking
3) white space normalisation according to the following.
If a tag has MIXED content (ie tags and PCDATA) and all the
PCDATA bits are all
blank then remove all the PCDATA
4) ASCII
5) simple "obvious" parse tree {tag,Line,Name,[Attr], [Children]}
Attr is a sorted [{Key,Val}] list
6) Simple SAX library, foldtree functions, find object in tree (like Xpath
only in Erlang)

I have a 6-pack of parsers than almost do this - they are all specialsed in
different ways for infinite streams and so on ...

I'm not sure if a general purpose toolkit (that allows you to build
the above) or a set
of completely different parsers with different properties is desirable.

/Joe Armstrong
Post by unknown
Post by unknown
Do you try to scrape arbitrary HTML? I don't think a XML parser will
help that much in such a case, because HTML is only a distant cousin
of XML...
Completely arbitrary HTML. Any web page out there. The syntax and
approach won't be much for HTML, assuming you had a robust parser. My
rant is about the syntax.
--
http://wagerlabs.com
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
unknown
2007-10-23 20:45:59 UTC
Permalink
Post by unknown
3) Do you need to handle streams of xml terms. This is often problematic
since nobody can agree on the framing protocol - does each term begin
with a new <?xml ...?> header?
There are many applications, which using custom and/or non-conformant XML
processing tools, which emit XML without <?xml ...?> processing instruction,
so it's better to have non-strict mode, which accepts XML like this.
Post by unknown
4) Do you want validation. In which case how do you find the DTD/schema - do you
have to comply with OASIS catalogues?
Do you want DTD's, Schema, RNC, (or an erlang ad hock grammar)
(My solution is to parameterize the parser with a fun F - F(URI)
is a function
that knows how to find the DTD in URI - the OASIS catalogue structure
is not something that I want to be concerned with - most
applications seem to
titall ignore this)
Form my expierence DTD is largely ignored today, I can easily live with XML
parser ignoring DTD.
The standard XML Schema language is W3D XSD, which is very usefull,
especially for XML Data Binding and Web Services.
Post by unknown
5) Do you want the parser to try an correct errors and recover, or
bail out early.
the XML parser, should be strict parser (maybe except only leading
<?xml...?> PI.
The HTML and RDF/RSS parsers can be more forgiving, b/c of legacy of
millions of bad HTML.
FOr HTML the solution is Tidy, maybe we need Erlang Tidy port or just call
it via os:cmd .
Post by unknown
6) Do you want unicode support? or just ASCII.
if you writing parser for XML configuration files, then ASCII is fine, but
let's agree: it's stupid, it's much easier to store configuration in Plain
Old Erlang Term (POET) format and just file:consult.

For the webapp and web syndication domain the UTF-8 encoding is a MUST. Most
RSS files are UTF-8.
Post by unknown
7) Does the data come from files, sockets, binaries?
In most generic case the data comes in chunks of UTF-8 encoded binaries.
Post by unknown
8) Do you want strict or lazy parsing. If you don't look an attribute
you might like to
defer parsing the content until you actually need it.
I think a lazy parser just give an illusion of fast processing, it's still
reads the raw XML into memory.
The MSXML DOM native parser was lazy parser from the begining and it was
showing super-fast bechmarks in just "parsing" (read "opening") of XML
files, but when actual processing of XML nodes was involved it was much
slower, than non-lazy Java parsers. Whoever want to process just parts of
XML document, better use SAX, than lazy tree building parser.
Post by unknown
9) Do you want to check all ids and idrefs and correctly handle
NOTATION's and so on
i.e. all the weird things in the XML spec that 99.9% of
programmers have never used.
no, the only important parts are: namespaces, processing instructions and
CDATA
Post by unknown
10) In the case on XML without a DTD do you want a heuristic to throw away
non-significant white space (I often use a simple heuristic if all the
PCDATA children of a tag are all white space then throw all this
white space away)
good enough for me
Post by unknown
...
It's very difficult to write a parser that correctly handles *all* of
these and is fast, small etc.
I have made a set of compromises and a toolkit that provides the following.
1) A tag level interface to the system
this has a file type interface.
Pid = open_xml_token_stream(Descriptor)
makes a re-entrant token scanner
get_next_token(Pid) -> Token | eof
A lot of things can be done with this alone, for example a SAX
like processor
2) A simple parser this takes a token stream and parses it just checking for
well formed-ness.
3) A validator that runs on the output of 2 - this only
understands DTDs (not schemas,
or rnc)
There are also diverse routines to parse files etc, based on these.
I've also written an XSLT type thing that takes the output of 2)
or 3) and transforms
it.
This is just ASCII.
I've talked to a lot of people about XML - most people (the majority) want
something to parse a small ASCII file containing a single XML data structure.
The data structure is well formed - there is no DTD - and they don't care
about integrity constraints on the attributes- They don't care about
entity expansion,
NOTATIONs CDATA etc.
There are two major types of XML:

1. Document-oriented (also Unstructured)
2. Data-oriented (also Structured)

To support document-oriented XML, you need generic XML parser with
namespaces, CDATA, PI, UTF-8 and XPath/XQuery support.

For data-oriented XML, you do not need generic parser, but XML Data Binding
tool, like XSD to Erlang compiler (xsd2erl) (like XMLBeans for Java, or
RogueWave LIEF for C++, etc.), i.e. you giving employee.xsd and it's
compiled into employee.erl module, which you then use in your application to
parse and generate XML instances, conforming to employee.xsd schema.
Post by unknown
The kind of Ruby code shown in an earlier posting is easy given a
simple parse tree
My experimental parser turns an XML data structure in a
@type xml() = {node,Line,Tag,attrs(),[xml()]} |
{raw,Ln,AllBlack:bool(), bin()}
It's easy to write a fold function that applies a Fun to each node
fold_over_nodes(Fun, Env, {node, _, _, _, C} = Node) ->
Env1 = Fun(Node, Env),
fold_over_nodes(Fun, Env1, C);
fold_over_nodes(Fun, Env, {raw,_,_,_} = Raw) ->
Fun(Raw, Env);
fold_over_nodes(Fun, Env, [H|T]) ->
Env1 = fold_over_nodes(Fun, Env, H),
fold_over_nodes(Fun, Env1, T);
fold_over_nodes(Fun, Env, []) ->
Env.
(like foldl) - this can be used to extract tags
F = fun({node,_,Stag,_,C}=N, E) ->
case member(Stag, [a,b,c]) of
true -> [N|E];
false -> E
end,
Tags = fold_over_nodes(Tree, [], F)
I think it should be:
Tags = fold_over_nodes(F, [], Tree).
Post by unknown
For very simple applications I could put together a parser that does
1) Small files only.
2) no DTDs or grammar checking
3) white space normalisation according to the following.
If a tag has MIXED content (ie tags and PCDATA) and all the
PCDATA bits are all
blank then remove all the PCDATA
4) ASCII
5) simple "obvious" parse tree {tag,Line,Name,[Attr], [Children]}
Attr is a sorted [{Key,Val}] list
6) Simple SAX library, foldtree functions, find object in tree (like Xpath
only in Erlang)
I have a 6-pack of parsers than almost do this - they are all specialsed in
different ways for infinite streams and so on ...
I'm not sure if a general purpose toolkit (that allows you to build
the above) or a set
of completely different parsers with different properties is desirable.
/Joe Armstrong
Zvi
--
View this message in context: http://www.nabble.com/Rant%3A-I-hate-parsing-XML-with-Erlang-tf4676760.html#a13373230
Sent from the Erlang Questions mailing list archive at Nabble.com.
unknown
2007-10-24 06:14:21 UTC
Permalink
Post by unknown
1. Document-oriented (also Unstructured)
2. Data-oriented (also Structured)
To support document-oriented XML, you need generic XML parser with
namespaces, CDATA, PI, UTF-8 and XPath/XQuery support.
For data-oriented XML, you do not need generic parser, but XML Data Binding
tool, like XSD to Erlang compiler (xsd2erl) (like XMLBeans for Java, or
RogueWave LIEF for C++, etc.), i.e. you giving employee.xsd and it's
compiled into employee.erl module, which you then use in your application to
parse and generate XML instances, conforming to employee.xsd schema.
Erlsom is an XML Data Binding Tool written in Erlang.

It doesn't generate code (except for a header file with record definitions),
but the effect is the same: When you parse employee.xml, it uses (a
'compiled' version of) employee.xsd to translate it to a structure of
records that correspond to the types defined in the XSD. Or the other way
around: given such a structure, it can generate the XML.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20071024/d4930217/attachment.html>
unknown
2007-10-23 12:25:19 UTC
Permalink
I agree! Except for the "Erlang" part...
xmerl is extremely unpleasant to work with, but isn't the problem more
related to XML than the programming language?
I find it hopeless to translate the XML mess into an accessible data
format.
I long for the good old ASN.1 days when protocols were designed for
computers
instead of humans. Who needs to parse protocols without a computer?
/OLA.
-----Original Message-----
From: erlang-questions-bounces
[mailto:erlang-questions-bounces] On Behalf Of Joel Reymont
Sent: den 23 oktober 2007 12:52
To: erlang-questions Questions
Subject: [erlang-questions] Rant: I hate parsing XML with Erlang
I only have so much horizontal real-estate on my screen and I totally
hate how xmlElement and xmlAttribute take so much of it. I also hate
how XML parsing code looks in Erlang.
Am I the only one?
I think Simon Peyton-Jones (Haskell) said that the syntax is the user
interface of a language. If so then Erlang has a user interface only
a mother could love, at least when it comes to parsing XML!
Joel
--
http://wagerlabs.com
_______________________________________________
erlang-questions mailing list
erlang-questions
http://www.erlang.org/mailman/listinfo/erlang-questions
unknown
2007-10-23 14:09:58 UTC
Permalink
Post by unknown
I only have so much horizontal real-estate on my screen and I totally
hate how xmlElement and xmlAttribute take so much of it. I also hate
how XML parsing code looks in Erlang.
The original idea with xmerl (dating back to 2001), was
to map XML to S-expressions. The records came about when
trying to support namespaces, XPATH, et al.

There may be some gotchas in what assumptions xmerl_scan
makes about the hook functions, but at least for simple
XML, you can call xmerl_scan with your own hook function:

17>xmerl_scan:file(".../xmerl-0.18.1/priv/testdata/test3.xml",
[{hook_fun,
fun(#xmlElement{name = N,
attributes = As,
content = C},S) ->
As1 = [{K,V} ||
#xmlAttribute{name=K,
value=V} <- As],
{{N,As1,C},S};
(#xmlText{value = V}, S) ->
{V,S}; (X,S) -> {X, S} end}]).
{{'People',[],
["\n ",
{comment,[],["This is a comment"]},
"\n ",
{'Person',[{'Type',"Personal"}],["\n "]},
"\n"]},
[]}

BR,
Ulf W
unknown
2007-10-23 14:27:19 UTC
Permalink
Post by unknown
Post by unknown
Take a look at yaws_html.erl. That is quite a nice
parser that
Post by unknown
doesn't produce the same bloat as xmerl
Are there any examples of using yaws_html as well as
the output that
it produces? Would be nice to include in this
thread.
1> yaws_html:parse("ShowLetter.html").
{ehtml,[],
[{html,[],
[{head,[],
[{title,[],"Yahoo! Mail -
thomasl_erlang"},
{script,[],
"\n<!-- \n\tif(typeof
top.frames[\"wmailmain\"] != \"undefined\")
window.open(\"http://mail.yahoo.com\", \"_top\");\n//
-->\n"},
{link,[{rel,"stylesheet"},

{href,"ShowLetter_files/mail_blue_all.css"},
{type,"text/css"},
{media,"all"}]},

{script,[{src,"ShowLetter_files/mailcommonlib.js"}],[]},
...
and so on

However, note that yaws_html (1.68 in this case)
apparently isn't robust enough to handle unclosed tags
and perhaps other nastiness. You get parse errors
instead, which might not be what you want for a real
html processor. Good luck.

(For extra credit, write an xmerl -> ehtml converter.)

Best,
Thomas


__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Loading...