25.2.10

How to parse ugly HTML in java

I needed a simple and fast method to parse ugly HTML pages in java and have json output from the servlet.
After a little search I found a good way to do this.
Libraries needed (in order of use): Http Components, JTidy, Saxon and json-lib.

There are a lot of dependencies, but they did the work.
First of all we need to use HttpClient from Http Components.

DefaultHttpClient client = new DefaultHttpClient();
HttpGet get = new HttpGet(resourcePath);
HttpResponse httpresponse = client.execute(get);
InputStream stream = httpresponse.getEntity().getContent();

After this we have the source code of the resource requested.

Now it's the turn of JTidy: it made the hard work of translating the ugliest html code of the whole internet in a clean xhtml document, well formatted, and parsable.

Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document doc = tidy.parseDOM(stream, null);


The Document is a org.w3c.dom.Document, browsable with DOM methods (getElementsByTagName, getElementById...), but I prefer XPath; Saxon is our boy.

XPath xPath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xPath.compile(xPathExpresssion);
NodeList nodelist = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);

We can choose differents return types (BOOLEAN, STRING, NUMBER, NODE, NODESET, DOM_OBJECT_MODEL); just change the constant and cast the result.

The final step is the conversion of the pojo in JSON string.
String jsonString = JSONObject.fromObject(myPojo).toString(indentFactor);

That's all.

Nessun commento:

Posta un commento

Hello, new comment!

Warning

My first language isn't english. Feel free to gently correct my words. :)
Powered By Blogger