Source : Chris Reeves' Python turorial
Recently, I found a good tutorial to have interest in Python:
Chris Reeves' Python turorial I want to abridge what I have learned from this tutorial. The main purpose of this tutorial is to build my own
Web Spider (Web bot, Web crawler, or whatever so called)
First 10 of tutorials deal with constructing up a program scrapping stock prices of companies registered in NASDAQ list. In this post (part 1), I am going to review the Python codes for study.
Part 1. Stock Price - A single company, say, Apple
This is the code for scrapping the last price of APPLE from
Yahoo Finance - Apple
1. import urllib
import re
htmlfile = urllib.urlopen("http://finance.yahoo.com/q?s=AAPL&ql=1")
2. htmltext = htmlfile.read()
regex = '(.+?)'
pattern = re.compile(regex)
3. price = re.findall(pattern,htmltext)
print price
A main algorithm for this code is to read HTML file, and scrape the HTML source standing up the last price.
1. Import the module "urllib" and "re". A "urllib" is a module to read up HTML, and "re" module is a module to use Regular Expression, which is useful to deal with character strings.
2.
htmlfile = urllib.urlopen(url)
htmltext = urllib.urlopen(url).read()
is to read a HTML of the website. In this code, we read the HTML as a text, not a file, to use a certain keyword with 'Re' module (It is useful to read HTML as a file to use BeautifulSoup method, which we will deal with later)
3. This is the useful link for learning a
Regular Expression (it is written in Korean, because I am a Korean. You can find some well written tutorials easily.)
'(.+?)'
'()', in ReGex, is to used in grouping ReGex. For example, Set (Value) is matched with "Set" and "Set Value". A dot (.) is matched with anything, and "+" means "more than one", equivalently 'not empty'. In totally, (.+?) will be matched any string up to .
pattern = re.compile(regex)
find = re.findall(target, htmltext)
A "re.compile(regex)" compiles and "re.findall(target,htmltext)" convert the regex language into the word we can understand.
This is the code for scrapping every code registered in NASDAQ. Preliminarily, we have to download the list of the code of companies listed in NASDAQ. You can easily download it by googling.
import urllib
import re
symbolfile = open("stocklist.txt")
1. symbolslist = symbolfile.read()
2. uppernewsymbolslist = symbolslist.split("\n")
3. newsymbolslist = [x.lower() for x in uppernewsymbolslist]
i = 0
while i (.+?)'
pattern = re.compile(regex)
price = re.findall(pattern,htmltext)
print "the price of",uppernewsymbolslist[i],"is", price
i += 1
1.
file = open("directory of file/filename")
file_read_text = file.read()
This is how we let Python to read the file.
2.
A 'split()' let python to save the text into the list.
3.
newsymbolslist = [x.lower() for x in uppernewsymbolslist]
It is pretty interesting that this command is actually working. It is why Python is such a strong language. It resembles how we talk. It does mean that "Lower x, which is an element of the list "uppernewsymbolist."
Part 3. Quicker Method to scrape the price of multiple companies
import urllib
import re
import json
NASfile = open("stocklist.txt")
NASread = NASfile.read()
NASlist = NASread.split("\n")
i = 0
while i < len(NASlist):
url = "http://www.bloomberg.com/markets/watchlist/recent-ticker/"+NASlist[i]+":US"
htmltext = urllib.urlopen(url)
1. data = json.load(htmltext) #it takes only a file, it must occur an error!
# It is file distinguished by a Hash Map, which will work as a key
print data["disp_name"],":", data["last_price"] #works like an array, but it works with a key, not an index.
i+= 1
You can find a Json file easily by separating any websites by "developer tool" in Chrome. In most cases, it is likely that a Json file include the data in the website. If you scrape more than 200 companies' stock price with a method in part 2, it might take more than 10 minutes to complete. If you would scrape it through "Bloomberg", which contains lots of heavy flash files, it might go worse. However,
if you could just scrape the price through Json files, just simple text files, it might take a few seconds to complete a task. This is the code how we do in such a way.
1. The biggest benefit that 'Json' module has is that it contains data in the form of list. The distinguished point from list is that you can call data not by an index, but by a keyword.
In this code, we defined a json file by the variable 'data'. You can call json file by calling its key in a form of 'data["key"]'.
It is important to keep in mind that the object of json function is a file, not a text.
Part 4. Printing out a result
import urllib
import json
#Open up the stocklist.txt
NASfile = open("stocklist.txt")
NASload = NASfile.read()
NASlist = NASload.split("\n")
for symbol in NASlist:
1. myfile = open("/home/hansjung/Python/stockprogram/"+symbol+".txt","w+")
myfile.close()
htmltext = urllib.urlopen("http://www.bloomberg.com/markets/chart/data/1D/"+symbol+":US")
data = json.load(htmltext) #When we use Json, we don't use .read()
datapoints = data["data_values"]
2. myfile = open("/home/hansjung/Python/stockprogram/"+symbol+".txt","a")
for point in datapoints:
3. myfile.write(str(symbol+","+str(point[0])+","+str(point[1])))
4. myfile.close()
The basic algorithm for printing in a file (writing up in a file) is in following.
- (Open) Create a file printed your result (Close)
- (Open) Append your result into the file
- Write your result (Close)
Part 5. Quicker method - Using Multi-thread
1. from threading import Thread
import urllib
import re
2. def th(ur):
base = "http://finance.yahoo.com/q?s="+ur
regex = '(.+?)'
pattern = re.compile(regex)
htmltext = urllib.urlopen(base).read()
results = re.findall(pattern,htmltext)
print "the price of",str(ur),"is",str(results[0])
stocklist = open("stocklist.txt").read()
stocklist = stocklist.split("\n")
print stocklist
3. threadlist = []
4. for u in stocklist:
t = Thread(target=th, args=(u,))
t.start()
threadlist.append(t)
5. for b in threadlist :
b.join()
A multi-thread is able to your program to work in multi-tasking. Basically, computer programs are built up in following a certain sequence. In contrast, a computer program can work in multi-tasking by a multi-thread method. The basic algorithm for this computer program is following.
- Define the task to be processed in multi-tasking as a function
- Open the list 'threadlist' that stores up results
- Point your function in number 1 and setting up an parameter, and call up to start a task.
- Store up results into the list built in 2.
Actually, it is difficult to use this method, because almost all programmers are used to programming in sequence based. If we are able to use multi-thread properly, our work might be way more efficient.
1. Call the module. From A import B means that, from A folder, import B module file.
2. Define your task that you want to deal with in multi-thread.
3. Open up 'threadlist' (I don't know why we need this, but it is must be done)
4. Define a multi-thread and start it, then save up the result into the threadlist.
5. Prettify your result.