2014년 1월 28일 화요일

Python - A benefit of the module "Readability"

Main purpose : Comparison between two modules "Readability" and "BeautifulSoup"

Source = Chris Reeves'

I am going to introduce a new module "Readability" . Mainly, it is used to pull out the contents in HTML body. Some people may insist that users might be able to do with BeautifulSoup. So, let's compare those two.

import nltk
import urllib
import readability
from bs4 import BeautifulSoup
from readability.readability import Document
import mechanize

url = "http://www.nytimes.com/2014/01/27/us/politics/2014-elections-likely-to-keep-capitals-split.html?ref=us"

br = mechanize.Browser()
htmltext = br.open(url).read()

soup = BeautifulSoup(htmltext)
for tag in soup.find_all('p',attrs = {"itemprop":"articleBody"}):
        print tag.contents[0]

This is the code reading up HTML body contents using BeatifulSoup. You might also know, this code would bring only the half of (above the big pictures on articles) Body contents. When a bot met some Java Script, a bot commanded by BeautifulSoup might think the article is ended, so the program is terminated. It is why I recommend to use "readability" to pull out the HTML Body contents.


import urllib
import readability
from bs4 import BeautifulSoup
from readability.readability import Document
import mechanize

url = "http://www.nytimes.com/2014/01/27/us/politics/2014-elections-likely-to-keep-capitals-split.html?ref=us"

br = mechanize.Browser()
htmltext = br.open(url).read()

# The bestway to access HTML file is to use mechanize because it can avoid.
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent','Firefox')] # It says that I am not a robot, but a Firefox web browser

html = br.open(url).read()

readable_article = Document(html).summary()
readable_title = Document(html).short_title()
soup = BeautifulSoup(readable_article)

final_article = soup.text
print final_article
print readable_title

This is the code using readability, and there are a few things to remember.

br.set_handle_robots(False)
br.addheaders = [('User-agent','Firefox')] # It says that I am not a robot, but a Firefox web browser
# Avoid that you are not a robot 

댓글 없음:

댓글 쓰기