Can java read html file?

As Jean mentioned, using a StringBuilder instead of += would be better. But if you're looking for something simpler, Guava, IOUtils, and Jsoup are all good options.

Example with Guava:

String content = Files.asCharSource[new File["/path/to/mypage.html"], StandardCharsets.UTF_8].read[];

Example with IOUtils:

InputStream in = new URL["/path/to/mypage.html"].openStream[];
String content;

try {
   content = IOUtils.toString[in, StandardCharsets.UTF_8];
 } finally {
   IOUtils.closeQuietly[in];
 }

Example with Jsoup:

String content = Jsoup.parse[new File["/path/to/mypage.html"], "UTF-8"].toString[];

or

String content = Jsoup.parse[new File["/path/to/mypage.html"], "UTF-8"].outerHtml[];

NOTES:

Files.readLines[] and Files.toString[]

These are now deprecated as of Guava release version 22.0 [May 22, 2017]. Files.asCharSource[] should be used instead as seen in the example above. [version 22.0 release diffs]

IOUtils.toString[InputStream] and Charsets.UTF_8

Deprecated as of Apache Commons-IO version 2.5 [May 6, 2016]. IOUtils.toString should now be passed the InputStream and the Charset as seen in the example above. Java 7's StandardCharsets should be used instead of Charsets as seen in the example above. [deprecated Charsets.UTF_8]

HTML is the core of the web, all the pages you see on the internet are HTML, whether they are dynamically generated by JavaScript, JSP, PHP, ASP or any other web technology. Your browser actually parse HTML and render it for you. But what would you do,  if you need to parse an HTML document and find some elements,  tags, attributes or check if a particular element exists or not from Java program. If you have been in Java programming for some years, I am sure you have done some XML parsing work using parsers like DOM and SAX, but there is also good chance that you have not done any HTML parsing work. Ironically, there are few instances when you need to parse HTML documents from core Java application, which doesn't include Servlet and other Java web technologies.

To make the matter worse, there is no HTTP or HTML library in core JDK as well; or at least I am not aware of that. That's why when it comes to parsing an HTML file, many Java programmers had to look at Google to find out how to get the value of an HTML tag in Java.

When I needed that I was sure that there would be an open-source library that will do it for me, but didn't know that it was as wonderful and feature-rich as JSoup. It not only provides support to read and parse HTML documents but also allows you to extract any element form HTML file, their attribute, their CSS class in JQuery style, and also allows you to modify them.

You can probably do anything with an HTML document using Jsoup. In this article, we will parse and HTML file and find out the value of the title and heading tags. We will also see an example of downloading and parsing HTML from the file as well as any URL or internet by parsing Google's home page in Java.

What is JSoup Library?

Jsoup is an open-source Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. Jsoup implements the WHATWG HTML5 specification and parses HTML to the same DOM as modern browsers like Chrome and Firefox do. Here are some of the useful features of jsoup library :

  •     Jsoup can scrape and parse HTML from a URL, file, or string
  •     Jsoup can find and extract data, using DOM traversal or CSS selectors
  •     Jsoup allows you to manipulate the HTML elements, attributes, and text
  •     Jsoup provides clean user-submitted content against a safe white-list, to prevent XSS attacks
  •     Jsoup also output tidy HTML

Jsoup is designed to deal with different kinds of HTML found in the real world, which includes proper validated HTML to incomplete non-validate tag collection. One of the core strengths of Jsoup is that it's very robust.

HTML Parsing in Java using JSoup

In this Java HTML parsing tutorial, we will see three different examples of parsing and traversing HTML documents in Java using jsoup. In the first example, we will parse an HTML String that contents all tags in form of String literal in Java.

In the second example, we will download our HTML document from web, and in third example, we will load our own sample HTML file login.html for parsing. This file is a sample HTML document that contains title tag and a div in the body that contains an HTML form.

It has input tags to capture username and password and submit and reset button for further action. It's proper HTML which can be validated i.e. all tags and attributes are properly closed. Here is how our sample HTML file look like :

DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
  "//www.w3.org/TR/html4/loose.dtd">

    
        
        Login Page
    
    
        
            
                Username : 
                Password : 
                
                
            
        
    

HTML parsing is very simple with Jsoup, all you need to call is static method Jsoup.parse[]and pass your HTML String to it. JSoup provides several overloaded parse[] methods to read HTML file from String, a File, from a base URI, from an URL, and from an InputStream.

You can also specify character encoding to read HTML files correctly which is not in "UTF-8" format. Here is complete list of HTML parse methods from the JSoup library.

The parse[String html] method parses the input HTML into a new Document. In Jsoup, Document extends Element which extends Node. Also TextNode extends Node. As long as you pass in a non-null string, you're guaranteed to have a successful, sensible parse, with a Document containing [at least] a head and a body element.

Once you have a Document, you can get the data you want by calling appropriate methods in Document and its parent classes Element and Node.

HelloWorldApp Document doc = Jsoup.

How do I read a HTML file?

HTML: Viewing HTML-files You can view an HTML-file that is under preparation by opening it in a browser. You can do this without having to move the file to ~/. www, change permissions, etc. Moreover, most browsers will allow you do changes as well [on a WYSIWYG basis].

Chủ Đề