2014년 1월 28일 화요일

Python Tutorial Part 3 - Keyword Extractor

Main Purpose : Build up own a keyword scrapper

Source : Chris Reeves' Python turorial

In this tutorial, we are going to learn how to build up keyword extractor.



Main,py

import gethtml
import articletext

url = "http://www.nytimes.com/2014/01/27/sports/committing-to-play-for-a-college-then-starting-9th-grade.html?hp"
# we will extract the keyword from this news article

article = gethtml.getHtmlText(url)
print articletext.getkeywords(article)                                         


gethtml.py

import mechanize

def getHtmlText(url):
        br = mechanize.Browser()
        htmltext = br.open(url).read()
        return htmltext

def getHtmlFile(url):
        br = mechanize.Browser()
        htmlfile = br.open(url)
        return htmlfile                                


articletext.py

from bs4 import BeautifulSoup
import gethtml

def getArticleText(webtext):
        articletext = ""
        soup = BeautifulSoup(webtext)
        for tag in soup.find_all('p',attrs = {"itemprop":"articleBody"}):
                articletext += tag.contents[0]
        return articletext

def getkeywords(articletext):
        common = open("common.txt").read().split("\n")
        word_dict = dict()
        word_list = articletext.lower().split()
        for word in word_list:
                if word not in common and word.isalnum():
                        if word not in word_dict:
                                word_dict[word] = 1
                        if word in word_dict:
                                word_dict[word] += 1
        top_words = sorted(word_dict.items(), key = lambda(k,v):(v,k),reverse = True)[0:25]
        top25 = []
        for w in top_words:
                top25.append(w[0])
        return top25

The most important key file is "articletext.py". The algorithm of this file group is followed as

  1. Call up "getHtmlText" in gethtml.py
  2. In gethtml.py, read up the HTML text in getHtmlText and return htmltext. So, the variable 'article' in Main.py now contains the HTML text.
  3. Call up "articlekeyword" function in articletext.py
  4. articlekeyword function in articletext.py takes a parameter as an "articletext". To separate the article into words list, the parameter "articletext" must be a list.
    1. So, we create a new function getArticleText taking a parameter as a HTML text, which we delievered in Main.py. The parameter was now moved to the new function getArticleText.
    2. for tag in soup.find_all('p',attrs = {"itemprops":"articleBody"} # You can check through Chrome
      
      Check only the body of text by words in such a method. Return such a word list
  5. We are going to find the words occurred in highest frequency. To avoid that words such as "the","a" are picked, we downloaded "the most used words top 500", and avoided words in this list.
  6. Construct the dictionary. Using dictionary, we can access the data not by its index, but by its "keywords". word_dict["keyword"] = "input-value" is the classical method to construct the dictionary. In this codes, "word_dict" is the dictionary contains the frequency of words in the article.
  7. Using lambda function, we sorted in order of high frequency.
    lambda "parameter" : "expressin"
    lambda x,y : x+y
    map(function, list)
    map(lambda x: x ** 2, range(5)) # [0, 1, 4, 9, 16]  
    
    "Sorted" is a function used for preserving the original target. Thus, you can sort any forms (dic, list, tuple) with sorted method.

댓글 없음:

댓글 쓰기