Implementing the Article Struct: Part 1

As always, before beginning any new project, take some time to ask yourself these questions:

Questions

  1. What is the purpose of my program?
  2. What information do I want to collect?
  3. Where will I get that information?
  4. How will I get that information?
  5. What output do I want from that information?

Answers

  1. The purpose of this program is scrapping articles from Phoronix's homepage.
  2. The information we want to collect is the title, summary, details and links of each article.
  3. The information will come directly from the HTML of the webpage.
  4. The Hyper crate provides tools for downloading the HTML
  5. Select provides the tools for obtaining the information from the HTML.
  6. The information will be outputted directly to the terminal.

Writing the Article Struct

Now that know what we are trying to accomplish, we need to create a struct to hold all of the data we need from each article: the title, summary, details and links.

struct Article {
    title:   String,
    link:    String,
    details: String,
    summary: String,
}

Now that we have our struct, we can start implementing functions for our struct. Let's start by implementing a function for obtaining a Vec of Articles containing just the titles. Comment out all the variables in the struct except for title to prevent the program from generating warnings.

Importing Select's Features

Before we begin using the select crate, it's time to add some use statements for the particular features that we want to use from the crate. Add this beneath the extern crate lines:

use select::document::Document;
use select::predicate::{Class,Name};
use select::node::Node;

The purpose of this is primarily to reduce how many key presses we need to access the above features inside the select crate. It's easier to type Document::from_str() than it is to type select::document::Document::from_str().

Implementing Article's Methods

There are two methods that we are going to implement for our Article struct: get_articles() and new(). The get_articles() method will create a Vec of Articles by mapping key data from each node to an Article using the new method and finally collecting the final results as a Vec<Article>.

The definition of node in this usage is the contents of a specific HTML tag. As this program is collecting a list of nodes which are individual <article> tags, each iteration of an <article> will be passed on to the new() method to collect the data from it and return it as an Article.

Example HTML Source

<article>                     // Article Node
  <a href="URL">Title</a>     // Collect URL and Title
  <div class="details"></div> // Collect Details
  <p>Summary<p>               // Collect Summary
</article>

Article::get_articles()

impl Article {
    fn get_articles() -> Vec<Article> {
        Document::from_str(open_testing())  // Open the HTML document
            .find(Name("article")).iter()    // Make an Iterator over <article> nodes
            .map(|node| Article::new(&node)) // Map each article to an Article struct
            .collect()                       // Return it as a Vec<Article>
    }
}

As you may read from the above function, a new Document is created from the &str of the HTML. This document allows us to perform a find() on all tags in the Document whose name is article. If you look at the HTML source directly, you will notice that each article is contained inside of a unique <article> tag. Hence, we are only collecting information, or nodes, from those specific tags and mapping them to a method we have yet to create: new().

Article::new()

Now let's implement Article::new():

impl Article {
    fn get_articles() -> Vec<Article> {
        Document::from_str(open_testing())
            .find(Name("article")).iter()
            .map(|node| Article::new(&node))
            .collect()
    }
    fn new(node: &node) -> Article {
        let header = node.find(Name("a")).first().unwrap(); // Obtain the header from the first <a>
        Article{ title: header.text() }                     // Map the header's text to the struct
    }
}

Again, by looking at the HTML, we can find that the title is contained within an <a> tag. It is the first <a> tag inside the <article> node so we will only obtain the first item using first(). This returns a value that has the potential to error, but we will ignore errors and return the value with unwrap(). It is generally better not to ignore errors though.

Once the information is collected, it is returned as a new Article type, assigning header.text() as the title. The text() function is used to convert a node into a String and ditching the other information not needed, so you will get an error without it.

Testing the New Code

Now we can go back to the main() function and modify it to use our new code:

fn main() {
    let phoronix_articles = Article::get_articles();
    for article in phoronix_articles {
        println!("{}", article.title);
    }
}

Try to compile and run this application with cargo run. Your source code should now look like this:

main.rs

// extern crate hyper;
extern crate select;
use select::document::Document;
use select::predicate::{Class,Name};
use select::node::Node;

fn main() {
    let phoronix_articles = Article::get_articles();
    for article in phoronix_articles {
        println!("{}", article.title);
    }
}

fn open_testing() -> &'static str {
    include_str!("phoronix.html")
}

struct Article {
    title:   String,
//    link:    String,
//    details: String,
//    summary: String,
}

impl Article {
    fn get_articles() -> Vec<Article> {
        Document::from_str(open_testing())
            .find(Name("article")).iter()
            .map(|node| Article::new(&node))
            .collect()
    }
    fn new(node: &Node) -> Article {
        let header = node.find(Name("a")).first().unwrap();
        Article{ title: header.text() }
    }
}