In our last Secondary Pythonista article we covered a lot of good ground. We went from no written code to a working script which collects ids corresponding to the articles we’re looking to compile. And it’s all been with less than 30 lines of code. Pretty fantastic.
But we still have a ways to go before we can consider our script even remotely “finished”. Today we’ll start harvesting the output we want to compile, storing it in the best manner possible to be retrieved later, all with a view towards final output.
Using the todays_articles List
At the conclusion of the last article we had a list containing something we’re calling a post id. I figured out this was useful from an examination of the raw HTML using the Chrome Dev tools. If all this is confusing to you, give our last article a read.
How exactly will this todays_articles list be useful? Well, in Python we’ll loop over its contents and act on the data we find with the help of that post id. Let me try explaining this with showing rather than telling. Take a look at the code below:
We’ve removed the print todays_articles line and in its place we have a for loop. This loop is going through the todays_articles list and assigning the value it’s using each time to the variable post_id. We’re then searching through the HTML for the element with an id corresponding to the post_id.
The rest of the code following that is a fancy way of printing out our data for verification. There are newer ways of formatting strings for presentation but I’ve gone with a simple, if not old fashioned, way of doing so using the “%” character. Here’s some Python documentation which explains this a little better, including all the different conversion types you can use. Since we’re working strictly with strings we’ll be using the “%s” syntax.
If you run this code now you’ll see we’ve successfully isolated the desired element, as well as the elements contained within it.
Pulling the Needed Data
Let’s move forward now and start pulling some more specific data out. We’ll take them one by one.
The Article Title
First let’s extract the article title. From examining the HTML, we’ve determined that the title is found within our HTML segment in an h1 tag. Let’s search it using the Beautiful Soup find() method and print the results to verify.
Well that’s not too bad. It’s definitely the data we wanted, but it still has all those pesky HTML tags in it. Let’s strip those out like we have done before with the get_text() method.
Ah! There we have it. Much better, wouldn’t you say? We now have the title — let’s move on from there.
Next we’ll pull out the author. From the HTML, you’ll see somewhat buried inside an a tag with the author’s name inside it. What we’ll use to distinguish it is the rel attribute on the a tag. It has the value of “author”. Beautiful Soup’s find() method allows us to pass in what’s called a keyword argument. It acts as a way to access elements based on their attributes. So we’ll pass in rel=’author’ to get the particular a tag we want.
Like we learned in fetching the article title, we will need to use the get_text method to pull just the text out of the HTML. You’ll also see a print statement for verification at the end.
The Article Category
Let’s move on to the article’s category. Like the author information we just retrieved, we’ll acquire the category that the article belongs in by means of a rel attribute on an a tag. Same as before, using the get_text() method and a print statement to finish.
OK, things look good, as expected, right? Well, not quite. Perhaps you noticed that the first article in our list, the Dropcam review by Phillip Johns, has two categories associated with it. When you view it on the website you can see that, it’s “Hardware \ Utilities”. Well we wouldn’t want to lose that data now would we?
Right now we’re using the find() method when it looks like we need to be using the find_all() method, giving us more than just one result. Let’s replace find() with find_all(). Then we’ll run our code again and examine the output.
Did you notice that? We have an error. Looks like we can’t use our get_text() method anymore. Now this makes sense, because we know from our use of find_all() earlier in our script that it returns a list. So, we’ll need to do the same thing we did earlier when looping through the articles on the page. We’ll create a new clean_categories list for us to populate, a for loop to traverse the article_category list, and now using the get_text() method throw a cleaned up version of the category into our clean_categories list. Finally we’ll print out clean_categories for verification.
OK, we’re getting there. We still have a list which needs to become a string. And in the case of there being multiple categories, we want to join them in a manner which is similar to their presentation on the AppStorm homepage. Why don’t we try using the join() method to do this. We’ll join them with a “/”.
Nicely done! This is really starting to take shape, isn’t it? Let’s move onto tags next.
The Article Tags
Now, when it comes to the tags, we’ll be grabbing them based on a class. We’ve learned the trick of finding an element based on its attributes, but with classes it’s a bit different. You see “class” is a reserved word in Python. So Beautiful Soup gives us the ability to target an element based on its class by using the class_ keyword. So we’ll be finding a ul tag with the class of tags. Let’s start there, printing it out for verification.
Alright. So we can clearly see that once again, further processing will be needed to get the plain tag info that we want. We’ll need another for loop to cycle through the li elements within the ul we just selected. We’ll also need to access the text within, not the li element but what’s in the a tag within the li element. Like we’ve done before we’ll create a clean_tags list to store this info in. For clarity’s sake why don’t we rename the variable containing our ul to tag_list. Makes a little more sense — at least it does to me.
Did you notice we did something interesting when we built our tag_list? We chained together a find() method with a find_al() method. Chaining together these methods limits the field of their search. It’s an efficient way of selecting a particular element of series of elements.
Now that we have our cleaned up list of tags, why don’t we join them together like we did with the categories. Except instead of a “/” we’ll use a “,”.
Excellent! On to the comment count.
The Comment Count
We’ll be accessing the number of comments for the page by using the same “find by class” method we used when grabbing the tags. Then we’ll finish it off with the gettext() method and a print statement. Seems to be getting pretty standard at this point, isn’t it? Hope you’re getting the hang of this.
The Article Summary
Last but certainly not least we have the article summary. At this point I’m almost inclined to just have you figure it out on your own. We’ll be targeting the data within a div element with a class of “entry”. get_text() and print are your friends as usual.
Did what you program match what we see above? Great. Now, we could leave it right here. But, considering how simple that was for us after doing it so many times before, why don’t we try something else here. That “(more…)” line is at the end of every article summary. On the homepage it’s a link to the main article. Here in our world of plain text, it’s just redundant.
Fortunately we can target that link with the class “more-link”. To remove it though we’ll use a new method called decompose(). It takes the element we’ve targeted and completely removes it from the HTML.
Sweet. One line to fix that pesky “(more…)” line.
Where We’re At
Now, we’ve made a tremendous amount of progress here. We’ve collected together all the information we want. But, it isn’t in the most readable layout. And our code has succumbed to our enthusiasm. It could use some comments to properly explain what’s where for future debugging purposes, and frankly just because it’s good practice. We also have to consider that our script right now is simply outputting data from iPad.AppStorm. We have five more sites to care for. To do that in the most efficient way possible, we’ll need to make our code more modular and start utilizing some custom functions.
In our next article we’ll address all these issues.