Main Purpose : Build up own Web scraper
Source : Chris Reeves' Python turorial
From this post, we will use some useful modules specified to deal with HTML file, such as BeautifulSoup or Mechanize. This is also from Chris Reeves' Python turorial
Part 1. Page explorer - Beautiful Soup and Mechanize
import urllib import urlparse import mechanize from bs4 import BeautifulSoup url = raw_input("Input URL you want to scrape: ") print urlparse.urlparse(url).hostname 1. br = mechanize.Browser() 2. urls = [url] visited = [url] 2. while len(urls) > 0: 3. try: 1. br.open(urls[0]) 2. urls.pop(0) 4. for link in br.links(): newurl = urlparse.urljoin(link.base_url,link.url) b1 = urlparse.urlparse(newurl).hostname b2 = urlparse.urlparse(newurl).path newurl = "http://"+b1+b2 5. if newurl not in visited and urlparse.urlparse(url).hostname in newurl: visited.append(newurl) urls.append(newurl) print newurl 3. except: urls.pop(0)
This is the classic page explorer program, and here the algorithm is.
- Open up one stack and one box containing historical records of spider's behavior
- Start the while roop with a condition of empty stack. (Starting with a base URL in the stack, we will find every sub-URL of base URL and put in what we found into the stack. Every time the spider visits a certain page, that page is removed in the stack. The workflow follows the ㄹ method, which works in horizontal first.
- Once the program read the HTML, then delete it by using urls.pop(0)
- It is also one of classical method to find all sub URL.
"http://" + urlparse.urlparse(url).hostname + urlparse.urlpasre(url).path
- If such an URL is not in visited box and hostname is in our new subURL, keep working.
Here are some important codes that we have to remember.
1.
br = mechanize.Browser() ## br is commanded to work in similar way as a Web Browser. br.open(url) ## open an web site and bring HTML as a file
2.
while len(urls) > 0: urls = [url] # After read urls.pop()
3. Try- Except. If some error occurs, then follow the command written after except. If the website we are crawling has some error-hyper link, we can deal with that as an error, and just keep going on to the next task.
4.
for link in br.links () ## br.links() examines every sub-link to find subURL. A 'link' is the key to move in br.links() link.base_url ## link.base_url is the base URL. It stands for the horizontal position the robot is exploring now. link.url ## link.url is to find every sub-URL. newurl = urlparse.urljoin(link.base_url,link.url) b1 = urlparse.urlparse(newurl).hostname b2 = urlparse.urlparse(newurl).path ## it is more clever method to find subURL without JavaScript Trap. newurl = "http://"+b1+b25.
if newurl not in visited and urlparse.urlparse(url).hostname in newurl: ## 1) Not in historical data, 2) same tree visited.append(newurl) urls.append(newurl) print newurl
Part 2. Use BeautifulSoup
# Another elements of url not with / this form import urllib from bs4 import BeautifulSoup import urlparse url = "http://nytimes.com" htmltext = urllib.urlopen(url) soup = BeautifulSoup(htmltext) 1. for tag in soup.find_all('a',href = True): raw = tag['href'] 2. b1 = urlparse.urlparse(tag['href']).hostname b2 = urlparse.urlparse(tag['href']).path # * print str(b1) + str(b2) #* Spider trap: # We don't use the method used in last tutorial # because sometimes WEB gives certain ID to users and Spider web might be confused it as all different website # That is why it is good to use urlparse.path
1.
for tag in soup.find_all('a',href = True)
It finds every sources containing "a href".
2.
urlparse.urlparse(tag['href'])The result of 1. is not a string. With this method, we can convert the result into the string.
Part 3. Find the smartest way to make sub-url
import urllib from bs4 import BeautifulSoup import urlparse import mechanize # Simulate the browser url = "http://sparkbrowser.com" count = 0 # The smartest method ==> No hashtop such tha sparkbrowser.com#top br = mechanize.Browser() #Just copy how the Browser acts br.open(url) for link in br.links(): newurl = urlparse.urljoin(link.base_url,link.url) b1 = urlparse.urlparse(newurl).path b2 = urlparse.urlparse(newurl).hostname print "http://"+b2+b1 count += 1 print count # IN this way, we don't include Javascript such as #top # Best way # Method_3. NOT WORK Use urlparse.urlparse(tag['href']).path or hostname ''' htmlfile = urllib.urlopen(url) soup = BeautifulSoup(htmlfile) for tag in soup.find_all('a',href=True): b1 = urlparse.urlparse(tag['href']).hostname b2 = urlparse.urlparse(tag['href']).path print "http://"+str(b1)+str(b2) ''' # Method_1. Use BeautifulSoup ''' htmlfile = urllib.urlopen(url) soup = BeautifulSoup(htmlfile) for tag in soup.find_all('a',href=True): print tag['href'] ''' # Method_2. Use mechanize ''' br = mechanize.Browser() br.open(url) for link in br.links(): newurl = urlparse.urljoin(link.base_url,link.url) print newurl count += 1 print count '''
The best way is already introduced in Part 1. The second best way is to use a BeautifulSoup, but it cannot filter some Java script trap.
댓글 없음:
댓글 쓰기