Hans Jung's Blog : Python Tutorial Part 1 - Program scrapping Stock Price

Main Purpose : Build up own the stock price scraper

Recently, I found a good tutorial to have interest in Python: Chris Reeves' Python turorial I want to abridge what I have learned from this tutorial. The main purpose of this tutorial is to build my own Web Spider
(Web bot, Web crawler, or whatever so called)

First 10 of tutorials deal with constructing up a program scrapping stock prices of companies registered in NASDAQ list. In this post (part 1), I am going to review the Python codes for study.

Part 1. Stock Price - A single company, say, Apple

This is the code for scrapping the last price of APPLE from Yahoo Finance - Apple

 
1. import urllib
    import re

htmlfile = urllib.urlopen("http://finance.yahoo.com/q?s=AAPL&ql=1")
2. htmltext = htmlfile.read()

regex = '(.+?)'
pattern = re.compile(regex)

3. price = re.findall(pattern,htmltext)

print price

A main algorithm for this code is to read HTML file, and scrape the HTML source standing up the last price.

1. Import the module "urllib" and "re". A "urllib" is a module to read up HTML, and "re" module is a module to use Regular Expression, which is useful to deal with character strings.
2.

 
htmlfile = urllib.urlopen(url)
htmltext = urllib.urlopen(url).read()

is to read a HTML of the website. In this code, we read the HTML as a text, not a file, to use a certain keyword with 'Re' module (It is useful to read HTML as a file to use BeautifulSoup method, which we will deal with later)
3. This is the useful link for learning a Regular Expression (it is written in Korean, because I am a Korean. You can find some well written tutorials easily.)

 '(.+?)'

'()', in ReGex, is to used in grouping ReGex. For example, Set (Value) is matched with "Set" and "Set Value". A dot (.) is matched with anything, and "+" means "more than one", equivalently 'not empty'. In totally, (.+?) will be matched any string up to .

pattern = re.compile(regex)
find = re.findall(target, htmltext)

A "re.compile(regex)" compiles and "re.findall(target,htmltext)" convert the regex language into the word we can understand.

Part 2. Stock Price - Multiple companies

This is the code for scrapping every code registered in NASDAQ. Preliminarily, we have to download the list of the code of companies listed in NASDAQ. You can easily download it by googling.

import urllib
import re

symbolfile = open("stocklist.txt")
1. symbolslist = symbolfile.read()
2. uppernewsymbolslist = symbolslist.split("\n")
3. newsymbolslist = [x.lower() for x in uppernewsymbolslist]
i = 0

while i (.+?)'
        pattern = re.compile(regex)
        price = re.findall(pattern,htmltext)
        print "the price of",uppernewsymbolslist[i],"is", price
        i += 1

file = open("directory of file/filename")
file_read_text = file.read()

This is how we let Python to read the file.
2. A 'split()' let python to save the text into the list.
3.

 newsymbolslist = [x.lower() for x in uppernewsymbolslist]

It is pretty interesting that this command is actually working. It is why Python is such a strong language. It resembles how we talk. It does mean that "Lower x, which is an element of the list "uppernewsymbolist."

Part 3. Quicker Method to scrape the price of multiple companies

import urllib
import re
import json

NASfile = open("stocklist.txt")
NASread = NASfile.read()
NASlist = NASread.split("\n")

i = 0
while i < len(NASlist):
        url = "http://www.bloomberg.com/markets/watchlist/recent-ticker/"+NASlist[i]+":US"
        htmltext = urllib.urlopen(url)

1.         data = json.load(htmltext) #it takes only a file, it must occur an error!  
# It is file distinguished by a Hash Map, which will work as a key

        print data["disp_name"],":", data["last_price"] #works like an array, but it works with a key, not an index.
        i+= 1

You can find a Json file easily by separating any websites by "developer tool" in Chrome. In most cases, it is likely that a Json file include the data in the website. If you scrape more than 200 companies' stock price with a method in part 2, it might take more than 10 minutes to complete. If you would scrape it through "Bloomberg", which contains lots of heavy flash files, it might go worse. However,
if you could just scrape the price through Json files, just simple text files, it might take a few seconds to complete a task. This is the code how we do in such a way.

1. The biggest benefit that 'Json' module has is that it contains data in the form of list. The distinguished point from list is that you can call data not by an index, but by a keyword.
In this code, we defined a json file by the variable 'data'. You can call json file by calling its key in a form of 'data["key"]'.
It is important to keep in mind that the object of json function is a file, not a text.

Part 4. Printing out a result

import urllib
import json

#Open up the stocklist.txt
NASfile = open("stocklist.txt")
NASload = NASfile.read()
NASlist = NASload.split("\n")

for symbol in NASlist:
        1. myfile = open("/home/hansjung/Python/stockprogram/"+symbol+".txt","w+")
        myfile.close()

        htmltext = urllib.urlopen("http://www.bloomberg.com/markets/chart/data/1D/"+symbol+":US")
        data = json.load(htmltext) #When we use Json, we don't use .read()
        datapoints = data["data_values"]

        2. myfile = open("/home/hansjung/Python/stockprogram/"+symbol+".txt","a")

        for point in datapoints:
        3.        myfile.write(str(symbol+","+str(point[0])+","+str(point[1])))
        4.        myfile.close()

The basic algorithm for printing in a file (writing up in a file) is in following.

(Open) Create a file printed your result (Close)
(Open) Append your result into the file
Write your result (Close)

Part 5. Quicker method - Using Multi-thread

1. from threading import Thread
import urllib
import re

2. def th(ur):
        base = "http://finance.yahoo.com/q?s="+ur
        regex = '(.+?)'
        pattern = re.compile(regex)
        htmltext = urllib.urlopen(base).read()
        results = re.findall(pattern,htmltext)
        print "the price of",str(ur),"is",str(results[0])

stocklist = open("stocklist.txt").read()
stocklist = stocklist.split("\n")
print stocklist

3. threadlist = []

4. for u in stocklist:
        t = Thread(target=th, args=(u,))
        t.start()
        threadlist.append(t)

5. for b in threadlist :
        b.join()

A multi-thread is able to your program to work in multi-tasking. Basically, computer programs are built up in following a certain sequence. In contrast, a computer program can work in multi-tasking by a multi-thread method. The basic algorithm for this computer program is following.

Define the task to be processed in multi-tasking as a function
Open the list 'threadlist' that stores up results
Point your function in number 1 and setting up an parameter, and call up to start a task.
Store up results into the list built in 2.

Actually, it is difficult to use this method, because almost all programmers are used to programming in sequence based. If we are able to use multi-thread properly, our work might be way more efficient.

1. Call the module. From A import B means that, from A folder, import B module file.
2. Define your task that you want to deal with in multi-thread.
3. Open up 'threadlist' (I don't know why we need this, but it is must be done)
4. Define a multi-thread and start it, then save up the result into the threadlist. 5. Prettify your result.

Hans Jung's Blog

2014년 1월 28일 화요일

Python Tutorial Part 1 - Program scrapping Stock Price