Main purpose : Comparison between two modules "Readability" and "BeautifulSoup"
Source = Chris Reeves'
I am going to introduce a new module "Readability" . Mainly, it is used to pull out the contents in HTML body. Some people may insist that users might be able to do with BeautifulSoup. So, let's compare those two.
import nltk import urllib import readability from bs4 import BeautifulSoup from readability.readability import Document import mechanize url = "http://www.nytimes.com/2014/01/27/us/politics/2014-elections-likely-to-keep-capitals-split.html?ref=us" br = mechanize.Browser() htmltext = br.open(url).read() soup = BeautifulSoup(htmltext) for tag in soup.find_all('p',attrs = {"itemprop":"articleBody"}): print tag.contents[0]
This is the code reading up HTML body contents using BeatifulSoup. You might also know, this code would bring only the half of (above the big pictures on articles) Body contents. When a bot met some Java Script, a bot commanded by BeautifulSoup might think the article is ended, so the program is terminated. It is why I recommend to use "readability" to pull out the HTML Body contents.
import urllib import readability from bs4 import BeautifulSoup from readability.readability import Document import mechanize url = "http://www.nytimes.com/2014/01/27/us/politics/2014-elections-likely-to-keep-capitals-split.html?ref=us" br = mechanize.Browser() htmltext = br.open(url).read() # The bestway to access HTML file is to use mechanize because it can avoid. br = mechanize.Browser() br.set_handle_robots(False) br.addheaders = [('User-agent','Firefox')] # It says that I am not a robot, but a Firefox web browser html = br.open(url).read() readable_article = Document(html).summary() readable_title = Document(html).short_title() soup = BeautifulSoup(readable_article) final_article = soup.text print final_article print readable_title
This is the code using readability, and there are a few things to remember.
br.set_handle_robots(False) br.addheaders = [('User-agent','Firefox')] # It says that I am not a robot, but a Firefox web browser # Avoid that you are not a robot
댓글 없음:
댓글 쓰기