Hans Jung's Blog

2014년 1월 28일 화요일

Python - A benefit of the module "Readability"

Main purpose : Comparison between two modules "Readability" and "BeautifulSoup"

I am going to introduce a new module "Readability" . Mainly, it is used to pull out the contents in HTML body. Some people may insist that users might be able to do with BeautifulSoup. So, let's compare those two.

import nltk
import urllib
import readability
from bs4 import BeautifulSoup
from readability.readability import Document
import mechanize

url = "http://www.nytimes.com/2014/01/27/us/politics/2014-elections-likely-to-keep-capitals-split.html?ref=us"

br = mechanize.Browser()
htmltext = br.open(url).read()

soup = BeautifulSoup(htmltext)
for tag in soup.find_all('p',attrs = {"itemprop":"articleBody"}):
        print tag.contents[0]

This is the code reading up HTML body contents using BeatifulSoup. You might also know, this code would bring only the half of (above the big pictures on articles) Body contents. When a bot met some Java Script, a bot commanded by BeautifulSoup might think the article is ended, so the program is terminated. It is why I recommend to use "readability" to pull out the HTML Body contents.

import urllib
import readability
from bs4 import BeautifulSoup
from readability.readability import Document
import mechanize

url = "http://www.nytimes.com/2014/01/27/us/politics/2014-elections-likely-to-keep-capitals-split.html?ref=us"

br = mechanize.Browser()
htmltext = br.open(url).read()

# The bestway to access HTML file is to use mechanize because it can avoid.
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent','Firefox')] # It says that I am not a robot, but a Firefox web browser

html = br.open(url).read()

readable_article = Document(html).summary()
readable_title = Document(html).short_title()
soup = BeautifulSoup(readable_article)

final_article = soup.text
print final_article
print readable_title

This is the code using readability, and there are a few things to remember.

br.set_handle_robots(False)
br.addheaders = [('User-agent','Firefox')] # It says that I am not a robot, but a Firefox web browser
# Avoid that you are not a robot

Python Tutorial Part 3 - Keyword Extractor

Main Purpose : Build up own a keyword scrapper

Source : Chris Reeves' Python turorial

In this tutorial, we are going to learn how to build up keyword extractor.

Main,py

import gethtml
import articletext

url = "http://www.nytimes.com/2014/01/27/sports/committing-to-play-for-a-college-then-starting-9th-grade.html?hp"
# we will extract the keyword from this news article

article = gethtml.getHtmlText(url)
print articletext.getkeywords(article)

gethtml.py

import mechanize

def getHtmlText(url):
        br = mechanize.Browser()
        htmltext = br.open(url).read()
        return htmltext

def getHtmlFile(url):
        br = mechanize.Browser()
        htmlfile = br.open(url)
        return htmlfile

articletext.py

from bs4 import BeautifulSoup
import gethtml

def getArticleText(webtext):
        articletext = ""
        soup = BeautifulSoup(webtext)
        for tag in soup.find_all('p',attrs = {"itemprop":"articleBody"}):
                articletext += tag.contents[0]
        return articletext

def getkeywords(articletext):
        common = open("common.txt").read().split("\n")
        word_dict = dict()
        word_list = articletext.lower().split()
        for word in word_list:
                if word not in common and word.isalnum():
                        if word not in word_dict:
                                word_dict[word] = 1
                        if word in word_dict:
                                word_dict[word] += 1
        top_words = sorted(word_dict.items(), key = lambda(k,v):(v,k),reverse = True)[0:25]
        top25 = []
        for w in top_words:
                top25.append(w[0])
        return top25

The most important key file is "articletext.py". The algorithm of this file group is followed as

Call up "getHtmlText" in gethtml.py
In gethtml.py, read up the HTML text in getHtmlText and return htmltext. So, the variable 'article' in Main.py now contains the HTML text.
Call up "articlekeyword" function in articletext.py
articlekeyword function in articletext.py takes a parameter as an "articletext". To separate the article into words list, the parameter "articletext" must be a list.

So, we create a new function getArticleText taking a parameter as a HTML text, which we delievered in Main.py. The parameter was now moved to the new function getArticleText.

for tag in soup.find_all('p',attrs = {"itemprops":"articleBody"} # You can check through Chrome

Check only the body of text by words in such a method. Return such a word list

We are going to find the words occurred in highest frequency. To avoid that words such as "the","a" are picked, we downloaded "the most used words top 500", and avoided words in this list.
Construct the dictionary. Using dictionary, we can access the data not by its index, but by its "keywords". word_dict["keyword"] = "input-value" is the classical method to construct the dictionary. In this codes, "word_dict" is the dictionary contains the frequency of words in the article.
Using lambda function, we sorted in order of high frequency.
```
lambda "parameter" : "expressin"
lambda x,y : x+y
map(function, list)
map(lambda x: x ** 2, range(5)) # [0, 1, 4, 9, 16]  
```
"Sorted" is a function used for preserving the original target. Thus, you can sort any forms (dic, list, tuple) with sorted method.

Python Tutorial Part 2 - Page Scraper

Main Purpose : Build up own Web scraper

Source : Chris Reeves' Python turorial

From this post, we will use some useful modules specified to deal with HTML file, such as BeautifulSoup or Mechanize. This is also from Chris Reeves' Python turorial

Part 1. Page explorer - Beautiful Soup and Mechanize

import urllib
import urlparse
import mechanize
from bs4 import BeautifulSoup

url = raw_input("Input URL you want to scrape: ")
print urlparse.urlparse(url).hostname
1. br = mechanize.Browser()

2. urls = [url]
visited = [url]

2. while len(urls) > 0:
3.      try:
1.             br.open(urls[0])
2.             urls.pop(0)
4.             for link in br.links():
                        newurl = urlparse.urljoin(link.base_url,link.url)
                        b1 = urlparse.urlparse(newurl).hostname
                        b2 = urlparse.urlparse(newurl).path
                        newurl =  "http://"+b1+b2

5.                     if newurl not in visited and urlparse.urlparse(url).hostname in newurl:
                                visited.append(newurl)
                                urls.append(newurl)
                                print newurl
3.      except:
                urls.pop(0)

This is the classic page explorer program, and here the algorithm is.

Open up one stack and one box containing historical records of spider's behavior
Start the while roop with a condition of empty stack. (Starting with a base URL in the stack, we will find every sub-URL of base URL and put in what we found into the stack. Every time the spider visits a certain page, that page is removed in the stack. The workflow follows the ㄹ method, which works in horizontal first.
Once the program read the HTML, then delete it by using urls.pop(0)

It is also one of classical method to find all sub URL.

"http://" + urlparse.urlparse(url).hostname + urlparse.urlpasre(url).path

If such an URL is not in visited box and hostname is in our new subURL, keep working.

Here are some important codes that we have to remember.

br = mechanize.Browser() ## br is commanded to work in similar way as a Web Browser. 
br.open(url) ## open an web site and bring HTML as a file

while len(urls) > 0: 
       urls = [url]
       # After read 
       urls.pop()

3. Try- Except. If some error occurs, then follow the command written after except. If the website we are crawling has some error-hyper link, we can deal with that as an error, and just keep going on to the next task.

for link in br.links () ## br.links() examines every sub-link to find subURL. A 'link' is the key to move in br.links()
     link.base_url  ## link.base_url is the base URL. It stands for the horizontal position the robot is exploring now. 
     link.url ## link.url is to find every sub-URL. 
     newurl = urlparse.urljoin(link.base_url,link.url)
     b1 = urlparse.urlparse(newurl).hostname
     b2 = urlparse.urlparse(newurl).path ## it is more clever method to find subURL without JavaScript Trap. 
     newurl = "http://"+b1+b2

if newurl not in visited and urlparse.urlparse(url).hostname in newurl: ## 1) Not in historical data, 2) same tree
     visited.append(newurl)
     urls.append(newurl)
     print newurl

Part 2. Use BeautifulSoup

# Another elements of url not with / this form 
import urllib
from bs4 import BeautifulSoup
import urlparse

url = "http://nytimes.com"
htmltext = urllib.urlopen(url)
soup = BeautifulSoup(htmltext)

1. for tag in soup.find_all('a',href = True):
        raw = tag['href']
2.        b1 = urlparse.urlparse(tag['href']).hostname
        b2 = urlparse.urlparse(tag['href']).path # * 
        print str(b1) + str(b2)

#* Spider trap:
        # We don't use the method used in last tutorial
        # because sometimes WEB gives certain ID to users and Spider web might be confused it as all different website
        # That is why it is good to use urlparse.path

 
for tag in soup.find_all('a',href = True)

It finds every sources containing "a href".

 
urlparse.urlparse(tag['href'])

The result of 1. is not a string. With this method, we can convert the result into the string.

Part 3. Find the smartest way to make sub-url

import urllib
from bs4 import BeautifulSoup
import urlparse
import mechanize
# Simulate the browser

url  = "http://sparkbrowser.com"
count = 0

# The smartest method ==> No hashtop such tha sparkbrowser.com#top 

br = mechanize.Browser()
#Just copy how the Browser acts 
br.open(url)
for link in br.links():
	newurl = urlparse.urljoin(link.base_url,link.url)
	b1 = urlparse.urlparse(newurl).path
	b2 = urlparse.urlparse(newurl).hostname
	print "http://"+b2+b1
	count += 1
print count
		# IN this way, we don't include Javascript such as #top 
		# Best way



# Method_3. NOT WORK Use urlparse.urlparse(tag['href']).path or hostname
'''
htmlfile = urllib.urlopen(url)
soup = BeautifulSoup(htmlfile)

for tag in soup.find_all('a',href=True):
	b1 = urlparse.urlparse(tag['href']).hostname
	b2 = urlparse.urlparse(tag['href']).path
	print "http://"+str(b1)+str(b2)
'''

# Method_1. Use BeautifulSoup 
'''
htmlfile = urllib.urlopen(url)
soup = BeautifulSoup(htmlfile)

for tag in soup.find_all('a',href=True):
	print tag['href']
'''


# Method_2. Use mechanize 
'''
br = mechanize.Browser()
br.open(url)

for link in br.links():
	newurl = urlparse.urljoin(link.base_url,link.url)
	print newurl
	count += 1
print count
'''

The best way is already introduced in Part 1. The second best way is to use a BeautifulSoup, but it cannot filter some Java script trap.

Python Tutorial Part 1 - Program scrapping Stock Price

Main Purpose : Build up own the stock price scraper

Source : Chris Reeves' Python turorial

Recently, I found a good tutorial to have interest in Python: Chris Reeves' Python turorial I want to abridge what I have learned from this tutorial. The main purpose of this tutorial is to build my own Web Spider
(Web bot, Web crawler, or whatever so called)

First 10 of tutorials deal with constructing up a program scrapping stock prices of companies registered in NASDAQ list. In this post (part 1), I am going to review the Python codes for study.

Part 1. Stock Price - A single company, say, Apple

This is the code for scrapping the last price of APPLE from Yahoo Finance - Apple

 
1. import urllib
    import re

htmlfile = urllib.urlopen("http://finance.yahoo.com/q?s=AAPL&ql=1")
2. htmltext = htmlfile.read()

regex = '(.+?)'
pattern = re.compile(regex)

3. price = re.findall(pattern,htmltext)

print price

A main algorithm for this code is to read HTML file, and scrape the HTML source standing up the last price.

1. Import the module "urllib" and "re". A "urllib" is a module to read up HTML, and "re" module is a module to use Regular Expression, which is useful to deal with character strings.
2.

 
htmlfile = urllib.urlopen(url)
htmltext = urllib.urlopen(url).read()

is to read a HTML of the website. In this code, we read the HTML as a text, not a file, to use a certain keyword with 'Re' module (It is useful to read HTML as a file to use BeautifulSoup method, which we will deal with later)
3. This is the useful link for learning a Regular Expression (it is written in Korean, because I am a Korean. You can find some well written tutorials easily.)

 '(.+?)'

'()', in ReGex, is to used in grouping ReGex. For example, Set (Value) is matched with "Set" and "Set Value". A dot (.) is matched with anything, and "+" means "more than one", equivalently 'not empty'. In totally, (.+?) will be matched any string up to .

pattern = re.compile(regex)
find = re.findall(target, htmltext)

A "re.compile(regex)" compiles and "re.findall(target,htmltext)" convert the regex language into the word we can understand.

Part 2. Stock Price - Multiple companies

This is the code for scrapping every code registered in NASDAQ. Preliminarily, we have to download the list of the code of companies listed in NASDAQ. You can easily download it by googling.

import urllib
import re

symbolfile = open("stocklist.txt")
1. symbolslist = symbolfile.read()
2. uppernewsymbolslist = symbolslist.split("\n")
3. newsymbolslist = [x.lower() for x in uppernewsymbolslist]
i = 0

while i (.+?)'
        pattern = re.compile(regex)
        price = re.findall(pattern,htmltext)
        print "the price of",uppernewsymbolslist[i],"is", price
        i += 1

file = open("directory of file/filename")
file_read_text = file.read()

This is how we let Python to read the file.
2. A 'split()' let python to save the text into the list.
3.

 newsymbolslist = [x.lower() for x in uppernewsymbolslist]

It is pretty interesting that this command is actually working. It is why Python is such a strong language. It resembles how we talk. It does mean that "Lower x, which is an element of the list "uppernewsymbolist."

Part 3. Quicker Method to scrape the price of multiple companies

import urllib
import re
import json

NASfile = open("stocklist.txt")
NASread = NASfile.read()
NASlist = NASread.split("\n")

i = 0
while i < len(NASlist):
        url = "http://www.bloomberg.com/markets/watchlist/recent-ticker/"+NASlist[i]+":US"
        htmltext = urllib.urlopen(url)

1.         data = json.load(htmltext) #it takes only a file, it must occur an error!  
# It is file distinguished by a Hash Map, which will work as a key

        print data["disp_name"],":", data["last_price"] #works like an array, but it works with a key, not an index.
        i+= 1

You can find a Json file easily by separating any websites by "developer tool" in Chrome. In most cases, it is likely that a Json file include the data in the website. If you scrape more than 200 companies' stock price with a method in part 2, it might take more than 10 minutes to complete. If you would scrape it through "Bloomberg", which contains lots of heavy flash files, it might go worse. However,
if you could just scrape the price through Json files, just simple text files, it might take a few seconds to complete a task. This is the code how we do in such a way.

1. The biggest benefit that 'Json' module has is that it contains data in the form of list. The distinguished point from list is that you can call data not by an index, but by a keyword.
In this code, we defined a json file by the variable 'data'. You can call json file by calling its key in a form of 'data["key"]'.
It is important to keep in mind that the object of json function is a file, not a text.

Part 4. Printing out a result

import urllib
import json

#Open up the stocklist.txt
NASfile = open("stocklist.txt")
NASload = NASfile.read()
NASlist = NASload.split("\n")

for symbol in NASlist:
        1. myfile = open("/home/hansjung/Python/stockprogram/"+symbol+".txt","w+")
        myfile.close()

        htmltext = urllib.urlopen("http://www.bloomberg.com/markets/chart/data/1D/"+symbol+":US")
        data = json.load(htmltext) #When we use Json, we don't use .read()
        datapoints = data["data_values"]

        2. myfile = open("/home/hansjung/Python/stockprogram/"+symbol+".txt","a")

        for point in datapoints:
        3.        myfile.write(str(symbol+","+str(point[0])+","+str(point[1])))
        4.        myfile.close()

The basic algorithm for printing in a file (writing up in a file) is in following.

(Open) Create a file printed your result (Close)
(Open) Append your result into the file
Write your result (Close)

Part 5. Quicker method - Using Multi-thread

1. from threading import Thread
import urllib
import re

2. def th(ur):
        base = "http://finance.yahoo.com/q?s="+ur
        regex = '(.+?)'
        pattern = re.compile(regex)
        htmltext = urllib.urlopen(base).read()
        results = re.findall(pattern,htmltext)
        print "the price of",str(ur),"is",str(results[0])

stocklist = open("stocklist.txt").read()
stocklist = stocklist.split("\n")
print stocklist

3. threadlist = []

4. for u in stocklist:
        t = Thread(target=th, args=(u,))
        t.start()
        threadlist.append(t)

5. for b in threadlist :
        b.join()

A multi-thread is able to your program to work in multi-tasking. Basically, computer programs are built up in following a certain sequence. In contrast, a computer program can work in multi-tasking by a multi-thread method. The basic algorithm for this computer program is following.

Define the task to be processed in multi-tasking as a function
Open the list 'threadlist' that stores up results
Point your function in number 1 and setting up an parameter, and call up to start a task.
Store up results into the list built in 2.

Actually, it is difficult to use this method, because almost all programmers are used to programming in sequence based. If we are able to use multi-thread properly, our work might be way more efficient.

1. Call the module. From A import B means that, from A folder, import B module file.
2. Define your task that you want to deal with in multi-thread.
3. Open up 'threadlist' (I don't know why we need this, but it is must be done)
4. Define a multi-thread and start it, then save up the result into the threadlist. 5. Prettify your result.

2014년 1월 23일 목요일

Current I am doing on Winter vacation

What I am currently doing

Taking a video lecture - Coursera
1. Data Analysis
2. Game Theory
3. Social Network Analysis
Practicing programming - recently, I completed my own web crawler written through Python.
English Study
Exercise
Live as a Christian

Interesting tutorials about how to build the web crawler by Python

www.youtube.com/creeveshft
I found an impressive video list recently. It gives you lectures about how to build programs through Python. Unlike to other tutorials, this tutorials are really helpful because

It gives a lecture with an video
Short so to be able to keep concentration
Useful - Unlike other tutorials starting from "Hello, world", it gives you how to construct a "real" program applicable in real life.

Statistical Data Mining - FInd the Mr / Miss Right

Amy Webb, the speaker on that video, gave a speech about how she met her husband through online dating.

How she had done is similar to the way that a mathematician used. This is a link about how a mathematician found her girl friend. Link
The fabulous a data mining can give to the data miner is that once you have desire to know or find out something, you can gather, process and modify data to achieve your goal through data mining.