How to parse html files?

Put your problem here if it does not fit any of the other categories.

How to parse html files?

Postby faisal___ » Thu Feb 04, 2010 3:47 pm

Hello dear Android Developers,

I am having difficulty parsing a html file in android. I keep getting a SAX exception thrown out at the 'openstream()' point of the code. I suspect that this is due to some unclosed tags, i.e non-compliant xml form.

All I want to do is to read the value/string of the TAG that looks like this:
Syntax: [ Download ] [ Hide ]
Using xml Syntax Highlighting
  1. <h2 id="the_element">Value I need</h2>
Parsed in 0.000 seconds, using GeSHi 1.0.8.4


The error I get from the catch loop:
org.xml.sax.SAXParseException:unterminated entity ref (position:ENTITY_REF null@750:29 in java.io.InputStreamReader@43784e08


Part of my code (discarding unrelated catch loops):
Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
  1.  
  2. try {
  3.  
  4.                         URL myurl=new URL("http://myurladdress.com/");
  5.  
  6.                         DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
  7.  
  8.                         DocumentBuilder builder = null;
  9.  
  10.                         builder = dbf.newDocumentBuilder();
  11.  
  12.                         Document dom = builder.parse(myurl.openStream());
  13.  
  14.                         String value = dom.getElementById("the_element").getNodeValue();
  15.  
  16.                        
  17.  
  18.                         tlogininfo.setText(value);
  19.  
  20.                        
  21.  
  22.                         } catch (Exception e) {
  23.  
  24.                            // Exception to the dom parser: Document dom = builder.parse(myurl.openStream());
  25.  
  26.                                                 tlogininfo.setText("Exception Issue: "+e.toString());
  27.  
  28.                                         }
  29.  
  30.  
Parsed in 0.012 seconds, using GeSHi 1.0.8.4


I hope that you can help and I greatly appreciate it.
Faisal...
faisal___
Freshman
Freshman
 
Posts: 8
Joined: Thu Feb 04, 2010 3:27 pm
Location: Swindon, UK

Top

[Solved]

Postby faisal___ » Fri Feb 05, 2010 3:26 pm

I found a solution involving an external library. Since I see across the net (and within this forum) that others have been trying to do the same, I will post my solution when I clean my codes.

In brief, I used htmlcleaner (from HTMLCleaner - HTML Parser). There are other ways to do it, that is using native libs, but it is more tricky.

Meanwhile, before I post my findings, a tutorial that may help:
Using HTMLCleaner Tutorial
Faisal...
faisal___
Freshman
Freshman
 
Posts: 8
Joined: Thu Feb 04, 2010 3:27 pm
Location: Swindon, UK

The Code...

Postby faisal___ » Thu Feb 18, 2010 8:18 pm

Hello guys,

Here is part of the code to parse HTML. I used the
htmlcleaner
Java Library.

First of all, make sure that you import the htmlcleanerXXX.java as a reference library in your project

Then, you must make sure that these imports exist in your PROJECT.java file:

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
  1. import org.htmlcleaner.*;
  2.  
  3.  
  4.  
  5. import java.io.InputStreamReader;
  6.  
  7. import java.net.MalformedURLException;
  8.  
  9. import java.net.URL;
  10.  
  11. import java.net.URLConnection;
  12.  
  13.  
Parsed in 0.011 seconds, using GeSHi 1.0.8.4



Within your
Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
  1.    public void onCreate(Bundle savedInstanceState)
Parsed in 0.011 seconds, using GeSHi 1.0.8.4
loop, ie your code, here is how you parse your html:

Initialise the parser and URL parameters:

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
  1. HtmlCleaner parser = new HtmlCleaner(); // initialise the parser
  2.  
  3. URL url; // declaration of url parameter
  4.  
  5.  
Parsed in 0.011 seconds, using GeSHi 1.0.8.4


Within a
Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
  1. try
Parsed in 0.010 seconds, using GeSHi 1.0.8.4
loop:

Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
  1. url = new URL("http://www.THESITEYOUWANT.com"); //point to the website you want to retrieve
  2.  
  3. URLConnection conn = url.openConnection(); // open a connection to the website
  4.  
  5.  
  6.  
  7. // TagNode type is part of the HTMLCleaner Library. In retrieves HTML tags in an XML TagNode Format
  8.  
  9. //So, retrieve the website in a XML Tagnode format
  10.  
  11. TagNode node = parser.clean(new InputStreamReader(conn.getInputStream()));
  12.  
  13.  
  14.  
  15.  
  16.  
  17. //Extract the subtree that you require. In this case Tag with attribute "id" with value "value"
  18.  
  19. TagNode node2 = node.findElementByAttValue("id", "value", true, true);
  20.  
  21.  
  22.  
  23. //Covert the retrieved data to string:                 
  24.  
  25. String myValue;
  26.  
  27. myValue = node2.getText().toString());
  28.  
  29.                
  30.  
  31. //Display it:
  32.  
  33. TextView tv= new TextView();
  34.  
  35. tv.setText(myValue);
  36.  
  37. setContent(tv);
  38.  
  39.  
Parsed in 0.012 seconds, using GeSHi 1.0.8.4


Now, with the try loop you will need the following catch loops:
Syntax: [ Download ] [ Hide ]
Using java Syntax Highlighting
  1.  
  2.                                         catch (MalformedURLException e) {
  3.  
  4.                                                 tlogininfo.setText("Malformed URL");
  5.  
  6.                                         } catch (HtmlCleanerException e) {
  7.  
  8.                                         tlogininfo.setText("TagNode Array Issue: "+e);
  9.  
  10.                                 } catch (Exception e) {
  11.  
  12.                                         tlogininfo.setText("Other Issue: "+e);
  13.  
  14.                                 }
  15.  
  16.  
Parsed in 0.011 seconds, using GeSHi 1.0.8.4



Hope it helps ...
Faisal...
faisal___
Freshman
Freshman
 
Posts: 8
Joined: Thu Feb 04, 2010 3:27 pm
Location: Swindon, UK

Re: How to parse html files?

Postby cyroz » Thu Sep 16, 2010 2:50 pm

I'm having a problems doing the example that you showed us here.

My full code :

Code: Select all
package com.helloparser.org;

import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;

import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.HtmlCleanerException;
import org.htmlcleaner.TagNode;

import android.app.Activity;
import android.os.Bundle;
import android.widget.TextView;

public class HelloParser extends Activity {
   

   /** Called when the activity is first created. */
    @Override
    public void onCreate(Bundle savedInstanceState) {
       
       HtmlCleaner parser = new HtmlCleaner(); // initialise the parser
       URL url; // declaration of url parameter
       
        super.onCreate(savedInstanceState);
        setContentView(R.layout.main);
       
        try{
           url = new URL("http://www.THESITEYOUWANT.com"); //point to the website you want to retrieve
           URLConnection conn = url.openConnection(); // open a connection to the website

           // TagNode type is part of the HTMLCleaner Library. In retrieves HTML tags in an XML TagNode Format
           //So, retrieve the website in a XML Tagnode format
           TagNode node = parser.clean(new InputStreamReader(conn.getInputStream()));


           //Extract the subtree that you require. In this case Tag with attribute "id" with value "value"
           TagNode node2 = node.findElementByAttValue("id", "value", true, true);

           //Covert the retrieved data to string:         
           String myValue;
           myValue = node2.getText().toString();
                 
           //Display it:
           TextView tv= new TextView();
           tv.setText(myValue);
           setContent(tv);
        }catch (MalformedURLException e) {
         tlogininfo.setText("Malformed URL");
      }catch (HtmlCleanerException e) {
          tlogininfo.setText("TagNode Array Issue: "+e);
       }catch (Exception e) {
          tlogininfo.setText("Other Issue: "+e);
       }
    }
}



When i run the program , it show force close. I've already try your example but there are some error such as

1. tloginfo.setText ( the eclipse give me option create local variable/create field/change to TagInfo(org.htmlcleaner)
2.TextView tv= new TextView(); - i need to add argument to match TextView
3.setContentView(tv); - need to change to setContextView.

Hope can help me. TQ
cyroz
Once Poster
Once Poster
 
Posts: 1
Joined: Fri Aug 27, 2010 1:52 pm

Re: How to parse html files?

Postby faisal___ » Sun Mar 20, 2011 12:52 am

Yes that's true. But the code above is an example on how to use htmlparser and neglecting the basics related to settext and textview
Faisal...
faisal___
Freshman
Freshman
 
Posts: 8
Joined: Thu Feb 04, 2010 3:27 pm
Location: Swindon, UK

Re: How to parse html files?

Postby mapiko » Thu Apr 21, 2011 2:47 pm

mapiko
Senior Developer
Senior Developer
 
Posts: 106
Joined: Sat May 15, 2010 5:19 pm

Top

Return to Other Coding-Problems

Who is online

Users browsing this forum: No registered users and 21 guests