Main Purpose : Build up own a keyword scrapper
Source : Chris Reeves' Python turorial
In this tutorial, we are going to learn how to build up keyword extractor.
Main,py
import gethtml import articletext url = "http://www.nytimes.com/2014/01/27/sports/committing-to-play-for-a-college-then-starting-9th-grade.html?hp" # we will extract the keyword from this news article article = gethtml.getHtmlText(url) print articletext.getkeywords(article)
gethtml.py
import mechanize def getHtmlText(url): br = mechanize.Browser() htmltext = br.open(url).read() return htmltext def getHtmlFile(url): br = mechanize.Browser() htmlfile = br.open(url) return htmlfile
articletext.py
from bs4 import BeautifulSoup import gethtml def getArticleText(webtext): articletext = "" soup = BeautifulSoup(webtext) for tag in soup.find_all('p',attrs = {"itemprop":"articleBody"}): articletext += tag.contents[0] return articletext def getkeywords(articletext): common = open("common.txt").read().split("\n") word_dict = dict() word_list = articletext.lower().split() for word in word_list: if word not in common and word.isalnum(): if word not in word_dict: word_dict[word] = 1 if word in word_dict: word_dict[word] += 1 top_words = sorted(word_dict.items(), key = lambda(k,v):(v,k),reverse = True)[0:25] top25 = [] for w in top_words: top25.append(w[0]) return top25
The most important key file is "articletext.py". The algorithm of this file group is followed as
- Call up "getHtmlText" in gethtml.py
- In gethtml.py, read up the HTML text in getHtmlText and return htmltext. So, the variable 'article' in Main.py now contains the HTML text.
- Call up "articlekeyword" function in articletext.py
- articlekeyword function in articletext.py takes a parameter as an "articletext". To separate the article into words list, the parameter "articletext" must be a list.
- So, we create a new function getArticleText taking a parameter as a HTML text, which we delievered in Main.py. The parameter was now moved to the new function getArticleText.
-
for tag in soup.find_all('p',attrs = {"itemprops":"articleBody"} # You can check through Chrome
Check only the body of text by words in such a method. Return such a word list - We are going to find the words occurred in highest frequency. To avoid that words such as "the","a" are picked, we downloaded "the most used words top 500", and avoided words in this list.
- Construct the dictionary. Using dictionary, we can access the data not by its index, but by its "keywords". word_dict["keyword"] = "input-value" is the classical method to construct the dictionary. In this codes, "word_dict" is the dictionary contains the frequency of words in the article.
- Using lambda function, we sorted in order of high frequency.
lambda "parameter" : "expressin" lambda x,y : x+y map(function, list) map(lambda x: x ** 2, range(5)) # [0, 1, 4, 9, 16]
"Sorted" is a function used for preserving the original target. Thus, you can sort any forms (dic, list, tuple) with sorted method.
댓글 없음:
댓글 쓰기