2014년 1월 28일 화요일

Python Tutorial Part 2 - Page Scraper

Main Purpose : Build up own Web scraper

Source : Chris Reeves' Python turorial

From this post, we will use some useful modules specified to deal with HTML file, such as BeautifulSoup or Mechanize. This is also from Chris Reeves' Python turorial


Part 1. Page explorer - Beautiful Soup and Mechanize

import urllib
import urlparse
import mechanize
from bs4 import BeautifulSoup

url = raw_input("Input URL you want to scrape: ")
print urlparse.urlparse(url).hostname
1. br = mechanize.Browser()

2. urls = [url]
visited = [url]

2. while len(urls) > 0:
3.      try:
1.             br.open(urls[0])
2.             urls.pop(0)
4.             for link in br.links():
                        newurl = urlparse.urljoin(link.base_url,link.url)
                        b1 = urlparse.urlparse(newurl).hostname
                        b2 = urlparse.urlparse(newurl).path
                        newurl =  "http://"+b1+b2

5.                     if newurl not in visited and urlparse.urlparse(url).hostname in newurl:
                                visited.append(newurl)
                                urls.append(newurl)
                                print newurl
3.      except:
                urls.pop(0)

This is the classic page explorer program, and here the algorithm is.

  1. Open up one stack and one box containing historical records of spider's behavior
  2. Start the while roop with a condition of empty stack. (Starting with a base URL in the stack, we will find every sub-URL of base URL and put in what we found into the stack. Every time the spider visits a certain page, that page is removed in the stack. The workflow follows the ㄹ method, which works in horizontal first.
  3. Once the program read the HTML, then delete it by using urls.pop(0)
  4. It is also one of classical method to find all sub URL.
    "http://" + urlparse.urlparse(url).hostname + urlparse.urlpasre(url).path
    
  5. If such an URL is not in visited box and hostname is in our new subURL, keep working.

Here are some important codes that we have to remember.

1.

br = mechanize.Browser() ## br is commanded to work in similar way as a Web Browser. 
br.open(url) ## open an web site and bring HTML as a file 

2.

while len(urls) > 0: 
       urls = [url]
       # After read 
       urls.pop()

3. Try- Except. If some error occurs, then follow the command written after except. If the website we are crawling has some error-hyper link, we can deal with that as an error, and just keep going on to the next task.

4.

for link in br.links () ## br.links() examines every sub-link to find subURL. A 'link' is the key to move in br.links()
     link.base_url  ## link.base_url is the base URL. It stands for the horizontal position the robot is exploring now. 
     link.url ## link.url is to find every sub-URL. 
     newurl = urlparse.urljoin(link.base_url,link.url)
     b1 = urlparse.urlparse(newurl).hostname
     b2 = urlparse.urlparse(newurl).path ## it is more clever method to find subURL without JavaScript Trap. 
     newurl = "http://"+b1+b2

5.

if newurl not in visited and urlparse.urlparse(url).hostname in newurl: ## 1) Not in historical data, 2) same tree
     visited.append(newurl)
     urls.append(newurl)
     print newurl



Part 2. Use BeautifulSoup

# Another elements of url not with / this form 
import urllib
from bs4 import BeautifulSoup
import urlparse

url = "http://nytimes.com"
htmltext = urllib.urlopen(url)
soup = BeautifulSoup(htmltext)

1. for tag in soup.find_all('a',href = True):
        raw = tag['href']
2.        b1 = urlparse.urlparse(tag['href']).hostname
        b2 = urlparse.urlparse(tag['href']).path # * 
        print str(b1) + str(b2)

#* Spider trap:
        # We don't use the method used in last tutorial
        # because sometimes WEB gives certain ID to users and Spider web might be confused it as all different website
        # That is why it is good to use urlparse.path

1.

 
for tag in soup.find_all('a',href = True)

It finds every sources containing "a href".

2.

 
urlparse.urlparse(tag['href'])
The result of 1. is not a string. With this method, we can convert the result into the string.



Part 3. Find the smartest way to make sub-url

import urllib
from bs4 import BeautifulSoup
import urlparse
import mechanize
# Simulate the browser

url  = "http://sparkbrowser.com"
count = 0

# The smartest method ==> No hashtop such tha sparkbrowser.com#top 

br = mechanize.Browser()
#Just copy how the Browser acts 
br.open(url)
for link in br.links():
	newurl = urlparse.urljoin(link.base_url,link.url)
	b1 = urlparse.urlparse(newurl).path
	b2 = urlparse.urlparse(newurl).hostname
	print "http://"+b2+b1
	count += 1
print count
		# IN this way, we don't include Javascript such as #top 
		# Best way



# Method_3. NOT WORK Use urlparse.urlparse(tag['href']).path or hostname
'''
htmlfile = urllib.urlopen(url)
soup = BeautifulSoup(htmlfile)

for tag in soup.find_all('a',href=True):
	b1 = urlparse.urlparse(tag['href']).hostname
	b2 = urlparse.urlparse(tag['href']).path
	print "http://"+str(b1)+str(b2)
'''

# Method_1. Use BeautifulSoup 
'''
htmlfile = urllib.urlopen(url)
soup = BeautifulSoup(htmlfile)

for tag in soup.find_all('a',href=True):
	print tag['href']
'''


# Method_2. Use mechanize 
'''
br = mechanize.Browser()
br.open(url)

for link in br.links():
	newurl = urlparse.urljoin(link.base_url,link.url)
	print newurl
	count += 1
print count
'''

The best way is already introduced in Part 1. The second best way is to use a BeautifulSoup, but it cannot filter some Java script trap.

댓글 없음:

댓글 쓰기