Follow this Java screen scraper example to aggregate content

By

Cameron McKenzie, TechTarget

With enterprise applications, it's not unusual to aggregate content published on live sites. As such, it's a good idea to develop a level of familiarity with one of the popular Java screen scraper libraries.

In this step-by-step Java screen scraper tutorial, we'll take a look at a framework named JSoup. The JSoup library has been around for a while, and it really hasn't changed much in recent years. It remains a popular and reliable server-side option.

In this example, we will use JSoup to pull some information off the popular GitHub interview questions and answers article published on this site. Extracted information will include the author's name, the display title and a list of all of the embedded links within the page. Here's how to accomplish this.

Step 1: Link to the JSoup library

To work with JSoup, first, create a basic Maven project, and add the JSoup dependency:

<dependency>

<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>

<version>1.11.3</version>

</dependency>

Step 2: Parse a URL

With the Maven dependency added, the next step is to have JSoup parse the specified URL to generate a JSoup Document object. For this example, we will name the returned Document object pageToScrape.

The code can be placed in a microservice, a servlet, a RESTful web component or even a JavaServer Page, but for this example, we will keep things simple and use a class with a runnable main method.

public class JavaScreenScraper {
  public static void main(String args[]) throws Exception {
    String url = "https://www.theserverside.com/video/Tough-sample-Git-and-GitHub-interview-questions-and-answers";
    Document pageToScrape = Jsoup.connect(url).get();
    }
  }  
}

Step 3: Scrape the page title

With the webpage now held as a Document object, it's possible to perform any number of inspections on the page. One inspection is to get the text displayed at the top of the browser when the page renders. You can achieve this with a call to the pageToScrape's title() method.

pageToScrape.title();

Page title — JSoup will screen scrape the page title.

Step 4: Page scrape with CSS selectors

JSoup provides a few built-in methods, such as title() and body(). These enable developers to work with page elements. For more specific queries on the page, you'll need CSS selectors.

For example, the CSS selector for the spot where the author's name is displayed is:

#author > div > a

Given this information, you can have JSoup extract this element from the page with the selectFirst method. This returns an object of type Element representing the anchor tag. Invoke the text() method to obtain the text displayed by the anchor tag:

Element authorLink = pageToScrape.selectFirst("#author > div > a");
System.out.printf("The author: %s.\n", authorLink.text());

How to find a CSS selector

Every element on an HTML page can be identified through a CSS selector. However, it can be extremely frustrating to figure out what the selector is for a given element if one simply inspects the HTML source code. Fortunately, Google Chrome provides some great tools that help identify an element's CSS selector attribute.

To find an element's CSS selector value, simply right-click on the page element of interest, and choose Inspect. Then, left-click the three dots that appear adjacent to the HTML in the Elements tab. Finally, choose Copy > Copy selector, and the CSS selector of the element is added to your copy-and-paste clipboard.

CSS selector — How to find CSS selector of element to screen scrape

Step 5: Multiple screen scraping results

If you're interested in more than one occurrence of an element on a page, you can use the Document object's generic select() method. Like the selectFirst() method, this derivation takes a CSS selector as an argument. However, instead of returning a single Element, it returns a collection of them.

For this Java screen scraper example, we will look at all of the anchor tags on the page and print the URL, or href attribute, of each of them. The code is:

Elements links = pageToScrape.select("a[href]");
for (Element link : links) {
    System.out.print("\nLink: " + link.attr("href"));
}

When put together, the JavaScreenScraper class looks as follows:

package com.mcnz.design;
import org.jsoup.Jsoup;
 
/* Java screen scraper example code */
public class JavaScreenScraper {
 
  public static void main(String args[]) throws Exception {
     
    String url = "https://www.theserverside.com/video/Tough-sample-Git-and-GitHub-interview-questions-and-answers";
           
    Document pageToScrape = Jsoup.connect(url).get();
    System.out.printf("The title is: %s. \n", pageToScrape.title());
 
    Element authorLink = pageToScrape.selectFirst("#author > div > a");
    System.out.printf("The author: %s.\n", authorLink.text());
   
    Elements links = pageToScrape.select("a[href]");
    for (Element link : links) {
      System.out.print("\nLink: " + link.attr("href"));
    } 
  }  
}

The Java screen scraper example code can then be run as a stand-alone Java class and generate the following output:

The title is: Tough sample GitHub interview questions and answers for job candidates. 
The authors name is: Cameron McKenzie.
There have been 1 comments.
 
Link: https://www.theserverside.com/video/Tips-and-tricks-on-how-to-use-Jenkins-Git-Plugin
Link: https://www.theserverside.com/video/Tackle-these-10-sample-DevOps-interview-questions-and-answers
Link: https://www.theserverside.com/video/A-RESTful-APIs-tutorial-Learn-key-web-service-design-principles

The list of links goes on for over 100 iterations, so the output is truncated at three.

As you can see, it is fairly easy to aggregate data from active webpages with a Java screen scraper. There are a number of JavaScript libraries that perform a similar task, but if you want to process intensive applications, access to a server-side technology is preferred. If a Java screen scraper is what you need, give JSoup a try.

The source code for this example can be found on GitHub.

View All Videos

Search App Architecture

The pros and cons of using generative AI for API testing
GenAI has the potential to reduce the API testing effort and improve the coverage and quality of testing. But finding the right ...
Harnessing space-based architecture for high performance
Space-based architecture offers the modularity of a distributed microservices architecture but with submillisecond data access ...
Reverse proxy vs. load balancer: How do they compare?
If reverse proxies were plantains, load balancers would be bananas. Despite surface similarities, the two services work in ...

Search Software Quality

34 must-know Agile terms
Agile is the cornerstone of countless development efforts today. Use this definitive glossary as a quick reference for Agile ...
13 must-know Markdown tips and tricks
Use these tips to improve your Markdown-based documentation, streamline your approach and structure your documents.
Professional Scrum Developer exam questions and answers
Professional Scrum Developer is a highly respected certification, but it's not easy to get. Take this sample PSD exam to test ...

Search Cloud Computing

Steps to launch an EC2 instance using AWS CloudFormation
Admins can use AWS CloudFormation templates and resource stacks to deploy an EC2 instance using an infrastructure-as-code ...
Top 10 PaaS providers of 2025 and what they offer you
PaaS is a good option for developers who want control over application hosting and simplified app deployment, but not all PaaS ...
How to protect VMs with Azure Bastion hosts
Port scans are a popular hacking technique used to find network vulnerabilities. Could Azure Bastion hosts provide the network ...

Search Security

Treasury Department sanctions company tied to Salt Typhoon
The sanctions were in response to significant cyberattacks by Chinese nation-state threat groups against the U.S. government and ...
Adopt passkeys over passwords to improve UX, drive revenue
Password use leads to higher UX friction and lost sales because customers want a smoother sign-in experience. Passkeys can ...
Tech industry experts digest cybersecurity executive order
IT pros assess a last-minute cybersecurity executive order with new directives on a broad swath of topics, from cybercriminal ...

Search AWS

Compare Datadog vs. New Relic for IT monitoring in 2024
Compare Datadog vs. New Relic capabilities including alerts, log management, incident management and more. Learn which tool is ...
AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...

Close