Implementing the Article Struct: Part 1
As always, before beginning any new project, take some time to ask yourself these questions:
Questions
- What is the purpose of my program?
- What information do I want to collect?
- Where will I get that information?
- How will I get that information?
- What output do I want from that information?
Answers
- The purpose of this program is scrapping articles from Phoronix's homepage.
- The information we want to collect is the
title
,summary
,details
andlinks
of each article. - The information will come directly from the HTML of the webpage.
- The
Hyper
crate provides tools for downloading the HTML Select
provides the tools for obtaining the information from the HTML.- The information will be outputted directly to the terminal.
Writing the Article
Struct
Now that know what we are trying to accomplish, we need to create a struct to hold all of the data we need from each article: the title
, summary
, details
and links
.
struct Article {
title: String,
link: String,
details: String,
summary: String,
}
Now that we have our struct, we can start implementing functions for our struct. Let's start by implementing a function for obtaining a Vec
of Articles
containing just the titles. Comment out all the variables in the struct except for title
to prevent the program from generating warnings.
Importing Select's Features
Before we begin using the select
crate, it's time to add some use
statements for the particular features that we want to use from the crate. Add this beneath the extern crate
lines:
use select::document::Document;
use select::predicate::{Class,Name};
use select::node::Node;
The purpose of this is primarily to reduce how many key presses we need to access the above features inside the select
crate. It's easier to type Document::from_str()
than it is to type select::document::Document::from_str()
.
Implementing Article's Methods
There are two methods that we are going to implement for our Article
struct: get_articles()
and new()
. The get_articles()
method will create a Vec
of Articles
by mapping key data from each node
to an Article
using the new
method and finally collecting the final results as a Vec<Article>
.
The definition of
node
in this usage is the contents of a specific HTML tag. As this program is collecting a list ofnodes
which are individual<article>
tags, each iteration of an<article>
will be passed on to thenew()
method to collect the data from it and return it as anArticle
.
Example HTML Source
<article> // Article Node
<a href="URL">Title</a> // Collect URL and Title
<div class="details"></div> // Collect Details
<p>Summary<p> // Collect Summary
</article>
Article::get_articles()
impl Article {
fn get_articles() -> Vec<Article> {
Document::from_str(open_testing()) // Open the HTML document
.find(Name("article")).iter() // Make an Iterator over <article> nodes
.map(|node| Article::new(&node)) // Map each article to an Article struct
.collect() // Return it as a Vec<Article>
}
}
As you may read from the above function, a new Document
is created from the &str
of the HTML. This document allows us to perform a find()
on all tags in the Document
whose name is article
. If you look at the HTML source directly, you will notice that each article is contained inside of a unique <article>
tag. Hence, we are only collecting information, or nodes
, from those specific tags and mapping them to a method we have yet to create: new()
.
Article::new()
Now let's implement Article::new()
:
impl Article {
fn get_articles() -> Vec<Article> {
Document::from_str(open_testing())
.find(Name("article")).iter()
.map(|node| Article::new(&node))
.collect()
}
fn new(node: &node) -> Article {
let header = node.find(Name("a")).first().unwrap(); // Obtain the header from the first <a>
Article{ title: header.text() } // Map the header's text to the struct
}
}
Again, by looking at the HTML, we can find that the title is contained within an <a>
tag. It is the first <a>
tag inside the <article>
node so we will only obtain the first item using first()
. This returns a value that has the potential to error, but we will ignore errors and return the value with unwrap()
. It is generally better not to ignore errors though.
Once the information is collected, it is returned as a new Article
type, assigning header.text()
as the title. The text()
function is used to convert a node into a String
and ditching the other information not needed, so you will get an error without it.
Testing the New Code
Now we can go back to the main()
function and modify it to use our new code:
fn main() {
let phoronix_articles = Article::get_articles();
for article in phoronix_articles {
println!("{}", article.title);
}
}
Try to compile and run this application with cargo run
. Your source code should now look like this:
main.rs
// extern crate hyper;
extern crate select;
use select::document::Document;
use select::predicate::{Class,Name};
use select::node::Node;
fn main() {
let phoronix_articles = Article::get_articles();
for article in phoronix_articles {
println!("{}", article.title);
}
}
fn open_testing() -> &'static str {
include_str!("phoronix.html")
}
struct Article {
title: String,
// link: String,
// details: String,
// summary: String,
}
impl Article {
fn get_articles() -> Vec<Article> {
Document::from_str(open_testing())
.find(Name("article")).iter()
.map(|node| Article::new(&node))
.collect()
}
fn new(node: &Node) -> Article {
let header = node.find(Name("a")).first().unwrap();
Article{ title: header.text() }
}
}