Secondary Pythonista — Test, Re-write, Test Again (Part 2)

In our last Secondary Pythonista article we covered a lot of good ground. We went from no written code to a working script which collects ids corresponding to the articles we’re looking to compile. And it’s all been with less than 30 lines of code. Pretty fantastic.

But we still have a ways to go before we can consider our script even remotely “finished”. Today we’ll start harvesting the output we want to compile, storing it in the best manner possible to be retrieved later, all with a view towards final output.

Like this article? Stay up to date with the latest changes by subscribing to our RSS feed or by following us either on Twitter, Facebook, Google+ or App.net.

Using the todays_articles List

At the conclusion of the last article we had a list containing something we’re calling a post id. I figured out this was useful from an examination of the raw HTML using the Chrome Dev tools. If all this is confusing to you, give our last article a read.

How exactly will this todays_articles list be useful? Well, in Python we’ll loop over its contents and act on the data we find with the help of that post id. Let me try explaining this with showing rather than telling. Take a look at the code below:

The todays_articles loop

The todays_articles loop

We’ve removed the print todays_articles line and in its place we have a for loop. This loop is going through the todays_articles list and assigning the value it’s using each time to the variable post_id. We’re then searching through the HTML for the element with an id corresponding to the post_id.

The rest of the code following that is a fancy way of printing out our data for verification. There are newer ways of formatting strings for presentation but I’ve gone with a simple, if not old fashioned, way of doing so using the “%” character. Here’s some Python documentation which explains this a little better, including all the different conversion types you can use. Since we’re working strictly with strings we’ll be using the “%s” syntax.

If you run this code now you’ll see we’ve successfully isolated the desired element, as well as the elements contained within it.

console output from todays_articles loop

The console output from todays_articles loop

Pulling the Needed Data

Let’s move forward now and start pulling some more specific data out. We’ll take them one by one.

The Article Title

First let’s extract the article title. From examining the HTML, we’ve determined that the title is found within our HTML segment in an h1 tag. Let’s search it using the Beautiful Soup find() method and print the results to verify.

finding an h1

Finding an h1

the console output from the search

the console output from the search

Well that’s not too bad. It’s definitely the data we wanted, but it still has all those pesky HTML tags in it. Let’s strip those out like we have done before with the get_text() method.

adding the get_text method

Adding the get_text method

console output to match

Console output to match

Ah! There we have it. Much better, wouldn’t you say? We now have the title — let’s move on from there.

The Author

Next we’ll pull out the author. From the HTML, you’ll see somewhat buried inside an a tag with the author’s name inside it. What we’ll use to distinguish it is the rel attribute on the a tag. It has the value of “author”. Beautiful Soup’s find() method allows us to pass in what’s called a keyword argument. It acts as a way to access elements based on their attributes. So we’ll pass in rel=’author’ to get the particular a tag we want.

Like we learned in fetching the article title, we will need to use the get_text method to pull just the text out of the HTML. You’ll also see a print statement for verification at the end.

finding the author

Finding the author

console output to match

Console output to match

The Article Category

Let’s move on to the article’s category. Like the author information we just retrieved, we’ll acquire the category that the article belongs in by means of a rel attribute on an a tag. Same as before, using the get_text() method and a print statement to finish.

find the category

Find the category

console output to match

Console output to match

OK, things look good, as expected, right? Well, not quite. Perhaps you noticed that the first article in our list, the Dropcam review by Phillip Johns, has two categories associated with it. When you view it on the website you can see that, it’s “Hardware \ Utilities”. Well we wouldn’t want to lose that data now would we?

Right now we’re using the find() method when it looks like we need to be using the find_all() method, giving us more than just one result. Let’s replace find() with find_all(). Then we’ll run our code again and examine the output.

using find_all

Using find_all

error message

The error message

Oops.

Did you notice that? We have an error. Looks like we can’t use our get_text() method anymore. Now this makes sense, because we know from our use of find_all() earlier in our script that it returns a list. So, we’ll need to do the same thing we did earlier when looping through the articles on the page. We’ll create a new clean_categories list for us to populate, a for loop to traverse the article_category list, and now using the get_text() method throw a cleaned up version of the category into our clean_categories list. Finally we’ll print out clean_categories for verification.

clean_categories list

clean_categories list

console output to match

Console output to match

OK, we’re getting there. We still have a list which needs to become a string. And in the case of there being multiple categories, we want to join them in a manner which is similar to their presentation on the AppStorm homepage. Why don’t we try using the join() method to do this. We’ll join them with a “/”.

join statement

join statement

console output to match

Console output to match

Nicely done! This is really starting to take shape, isn’t it? Let’s move onto tags next.

The Article Tags

Now, when it comes to the tags, we’ll be grabbing them based on a class. We’ve learned the trick of finding an element based on its attributes, but with classes it’s a bit different. You see “class” is a reserved word in Python. So Beautiful Soup gives us the ability to target an element based on its class by using the class_ keyword. So we’ll be finding a ul tag with the class of tags. Let’s start there, printing it out for verification.

finding tag ul

Finding the tag ul

console output to match

Console output to match

Alright. So we can clearly see that once again, further processing will be needed to get the plain tag info that we want. We’ll need another for loop to cycle through the li elements within the ul we just selected. We’ll also need to access the text within, not the li element but what’s in the a tag within the li element. Like we’ve done before we’ll create a clean_tags list to store this info in. For clarity’s sake why don’t we rename the variable containing our ul to tag_list. Makes a little more sense — at least it does to me.

new tag for loop

The new tag for loop

console output to match

Console output to match

Did you notice we did something interesting when we built our tag_list? We chained together a find() method with a find_al() method. Chaining together these methods limits the field of their search. It’s an efficient way of selecting a particular element of series of elements.

Now that we have our cleaned up list of tags, why don’t we join them together like we did with the categories. Except instead of a “/” we’ll use a “,”.

cleaned up tag output

cleaned up tag output

console output to match

console output to match

Excellent! On to the comment count.

The Comment Count

We’ll be accessing the number of comments for the page by using the same “find by classmethod we used when grabbing the tags. Then we’ll finish it off with the gettext() method and a print statement. Seems to be getting pretty standard at this point, isn’t it? Hope you’re getting the hang of this.

comment count

The comments count

console output to match

Console output to match

The Article Summary

Last but certainly not least we have the article summary. At this point I’m almost inclined to just have you figure it out on your own. We’ll be targeting the data within a div element with a class of “entry”. get_text() and print are your friends as usual.

article summary

The article summary

console output to match

Console output to match

Did what you program match what we see above? Great. Now, we could leave it right here. But, considering how simple that was for us after doing it so many times before, why don’t we try something else here. That “(more…)” line is at the end of every article summary. On the homepage it’s a link to the main article. Here in our world of plain text, it’s just redundant.

Fortunately we can target that link with the class “more-link”. To remove it though we’ll use a new method called decompose(). It takes the element we’ve targeted and completely removes it from the HTML.

adding decompose line

Adding the decompose line

console output to match

Console output to match

Sweet. One line to fix that pesky “(more…)” line.

Where We’re At

Now, we’ve made a tremendous amount of progress here. We’ve collected together all the information we want. But, it isn’t in the most readable layout. And our code has succumbed to our enthusiasm. It could use some comments to properly explain what’s where for future debugging purposes, and frankly just because it’s good practice. We also have to consider that our script right now is simply outputting data from iPad.AppStorm. We have five more sites to care for. To do that in the most efficient way possible, we’ll need to make our code more modular and start utilizing some custom functions.

In our next article we’ll address all these issues.


theatre-aglow
theatre-aglow
theatre-aglow
theatre-aglow