Quantcast

XMLSlurper really slow reading/parsing html/xml file

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

XMLSlurper really slow reading/parsing html/xml file

mjfan80
In my project I need to parse a HTML file (well formatted, as Xhtml).
This is html file is not too big... some styles, a Table with many td, and other stuff.
The HTML file is 8KB (not big)

The same file is parsed by the pdf plugin (so flying saucer) to make a PDF and this is really quick (less then one second, I think)

But if the same file is parsed by xmlslurper it takes 80 seconds.... yes, 80seconds...
I tryed with XMLSlurper, XMLParser and also the java XMLStreamReader. and it takes beetween 70 to 80 seconds
I Don't know why is so slowly

The html file is stored locally on the server (so no time for download)

this is the what i do to find in the HTML file a  with the class setted to "report" (then i will do something with this table)


def docParser = new XmlParser().parse(urlFile)
def body = doc.'body'
def report = trovaTableReport(body);


public GPathResult trovaTableReport(GPathResult nodo) {
                if(nodo != null) {
                        def eventualiTable = nodo.'table'
                        def report = eventualiTable.find { it.@class.text().contains("report") }
                        if(report != null && !report.isEmpty()) return report
                        else {
                                def reportInterno = null
                                nodo.children().each() {figlio ->
                                        reportInterno = trovaTableReport(figlio)
                                        if(reportInterno != null && !reportInterno.isEmpty()) report = reportInterno
                                }
                                if(report != null && !report.isEmpty()) return report
                                else return null
                        }
                }
                else return null
        }


SomeOne can tell mw why it takes so long to parse a simple html file?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: XMLSlurper really slow reading/parsing html/xml file

Wolfgang Schell
Does your (X)HTML file has a DTD, XML Schema declaration or something, which contains external URLs? Maybe the parser reaches out into the Net and tries to locate DTDs or XML Schemas?

HTH,

Wolfgang
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: XMLSlurper really slow reading/parsing html/xml file

mjfan80
I tried to delete the doctype declaration (that has a external dtd declaration) but nothing, time is still 80 seconds
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: XMLSlurper really slow reading/parsing html/xml file

Luis Muniz-2
Is it this line taht takes 80s?
def docParser = new XmlParser().parse(urlFile)



otherwise, maybe you can put some timing println statements in your code, it could be one of the GPATH expressions in your loops that takes so long.

On Mon, Apr 11, 2011 at 9:06 AM, mjfan80 <[hidden email]> wrote:
I tried to delete the doctype declaration (that has a external dtd
declaration) but nothing, time is still 80 seconds

--
View this message in context: http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3441234.html
Sent from the Grails - user mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

   http://xircles.codehaus.org/manage_email



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: XMLSlurper really slow reading/parsing html/xml file

mjfan80
yes, is that line
i tryed with many println monitoring the time, and this is the line
that takes beetween 70 to 80 seconds

either with parse(file) or parsetext(a string with cone file content)

2011/4/11 Luis Muniz-2 [via Grails]
<[hidden email]>:

> Is it this line taht takes 80s?
> def docParser = new XmlParser().parse(urlFile)
>
>
>
> otherwise, maybe you can put some timing println statements in your code, it
> could be one of the GPATH expressions in your loops that takes so long.
>
> On Mon, Apr 11, 2011 at 9:06 AM, mjfan80 <[hidden email]> wrote:
>>
>> I tried to delete the doctype declaration (that has a external dtd
>> declaration) but nothing, time is still 80 seconds
>>
>> --
>> View this message in context:
>> http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3441234.html
>> Sent from the Grails - user mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe from this list, please visit:
>>
>>    http://xircles.codehaus.org/manage_email
>>
>>
>
>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion
> below:
> http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3442069.html
> To unsubscribe from XMLSlurper really slow reading/parsing html/xml file,
> click here.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: XMLSlurper really slow reading/parsing html/xml file

mjfan80
In reply to this post by Luis Muniz-2
yes, is that line
i tryed with many println monitoring the time, and this is the line that takes beetween 70 to 80 seconds

either with parse(file) or parsetext(a string with cone file content)
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: XMLSlurper really slow reading/parsing html/xml file

Luis Muniz-2
In reply to this post by mjfan80
Then I'm afraid that the only idea I'd have is profiling the process, if there really is nothing special in the html file.

Another (laborious) process would be to progressively eliminate complexity from the html file and retry the parsing every time, to find out what is the cause of the delay.

On Mon, Apr 11, 2011 at 4:27 PM, mjfan80 <[hidden email]> wrote:
yes, is that line
i tryed with many println monitoring the time, and this is the line
that takes beetween 70 to 80 seconds

either with parse(file) or parsetext(a string with cone file content)

2011/4/11 Luis Muniz-2 [via Grails]
<[hidden email]>:

> Is it this line taht takes 80s?
> def docParser = new XmlParser().parse(urlFile)
>
>
>
> otherwise, maybe you can put some timing println statements in your code, it
> could be one of the GPATH expressions in your loops that takes so long.
>
> On Mon, Apr 11, 2011 at 9:06 AM, mjfan80 <[hidden email]> wrote:
>>
>> I tried to delete the doctype declaration (that has a external dtd
>> declaration) but nothing, time is still 80 seconds
>>
>> --
>> View this message in context:
>> http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3441234.html
>> Sent from the Grails - user mailing list archive at Nabble.com.

>>
>> ---------------------------------------------------------------------
>> To unsubscribe from this list, please visit:
>>
>>    http://xircles.codehaus.org/manage_email
>>
>>
>
>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion
> below:
> http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3442069.html
> To unsubscribe from XMLSlurper really slow reading/parsing html/xml file,
> click here.


View this message in context: Re: XMLSlurper really slow reading/parsing html/xml file

Sent from the Grails - user mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: XMLSlurper really slow reading/parsing html/xml file

mjfan80
With this html file it takes 80 seconds (3KB)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
        <title>Elenca Chiamate</title>
        <link rel="stylesheet" type="text/css" href="http://192.168.0.162:8080/HelpDeskGwt/CSS/main.css" media="all"/>
        <link rel="stylesheet" type="text/css" href="http://192.168.0.162:8080/HelpDeskGwt/CSS/main_stampa.css" media="print"/>
        <link rel="stylesheet" type="text/css" href="http://192.168.0.162:8080/HelpDeskGwt/CSS/main_screen.css" media="screen"/>
        <link rel="stylesheet" type="text/css" href="http://192.168.0.162:8080/HelpDeskGwt/CSS/landscape.css" media="print"/>
        <meta name="layout" content="pulsantiera_report"/>
<link rel="stylesheet" type="text/css" href="http://192.168.0.162:8080/HelpDeskGwt/CSS/mostraHeader.css" media="print"/>
</head>
<body>
<div class="header once">
<div>
        <div class="titolo_report">Report elencazione chiamate</div>
        <div class="logo"></div>
        <div class="az">
                <Strong>Provincia di Varese - Lotto N° 2 (MANUTENCOOP)</Strong><br/>
                Via dei Tigli 10<br/>
                  GALLARATE  (VA)<br/>
                tel 0331  793610 - fax 0331  791652<br/>
                P.IVA - C.F.:
        </div>
</div>
</div>
<div class="header other">
<div>
        <div class="titolo_report">Report elencazione chiamate</div>
        <div class="logo"></div>
        <div class="az">
                <Strong>Provincia di Varese - Lotto N° 2 (MANUTENCOOP)</Strong><br/>
                Via dei Tigli 10<br/>
                  GALLARATE  (VA)<br/>
                tel 0331  793610 - fax 0331  791652<br/>
                P.IVA - C.F.:
        </div>
</div>
</div>
        <div class="body">
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
Id ChiamataTipoGrado Diss.StatoComplImpiantoDescr. Imp.UbicazioneNomeCognomeAperturaChiusuraGuastoNoteAssegnazioneNote Ass.
2011/00469/CIDRO-SANITARIAALTORNED_027.AI.I.S. "Gadda-Rosselli"VIA DE ALBERTIS 3GaetanaPellegrino07/02/2011 - 11:08PIAN TERRENO LATO SUD - NEI BAGNI FEMMINILI MANCANO 2 MANIGLIE PASSI RAPIDIORARIO APERTUTA: 07.45-13.45
        </div>
</body>
</html>





I the tryed many time with some combination... and I found out that the problem is the doctype declaration... deleting that (and than making a grails clean and a browser cache deleting) resolved the problem

Thanks
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: XMLSlurper really slow reading/parsing html/xml file

John Thompson
After looking at the API, I wonder if making it not namespace aware would make a difference.

something like:
def rootNode = new XmlSlurper(false, false).parseText(foo_string)

http://groovy.codehaus.org/api/groovy/util/XmlSlurper.html
JT
jts-blog.com
Loading...