Problem in python crawler

hbamoria · February 20, 2016, 6:34pm

Hello!
I was making a python crawler as per your video. I must admit that tutorial has been extremely helpful in understanding the structure and how things work in a crawler. However, I am unable to parse data in one case. Here is the HTML/CSS structure of an item on that page :

span class = “standard-price”
::before
"3999"
span

I’m trying to parse it through following xpath query :

tree.xpath(’//span[@class = “standard-price”]/text()’)[0]

but it gives me an error : “Index out of range”.

I believe this '::before" is creating the issue. How do we tackle such thing Please help me with this. I’m really stuck!

Thanks & regards.

adamarthurryan · February 22, 2016, 9:55pm

I’m not exactly clear what you mean by ::before - is it that the price is inserted by CSS with a rule like the following?

span::before {
  content: "3999";
}

In that case, the XPath won’t be able to pull it out, because there is no CSS engine built in to lxml.

Another thing to check, does the <span> element have any other classes? If so, the XPath selector won’t match them - it doesn’t know anything about the multiple class syntax and just matches on the whole attribute string.

See http://stackoverflow.com/questions/1604471/how-can-i-find-an-element-by-css-class-with-xpath for a solution.

hbamoria · February 23, 2016, 5:24am

I’ve uploaded the screenshot of the desired page. Here span has only one class i.e ‘standard-price’ and inside the ‘span’ tag it has “::before” written. Writing an xpath query returns an empty list over here.
Hope it makes my problem clear.
Thanks

adamarthurryan · February 23, 2016, 5:55am

Ah, I see. the class attribute is actually "standard-price " with an extra space at the end. CSS doesn’t care, but XPath does. You could try //span[@class="standard-price "]/text() (with the space).

In general, a more robust way to match class names is with ...[contains(concat(' ', @class, ' '), ' the-class-name ')]

adamarthurryan · February 23, 2016, 5:56am

PS. I also found this course extremely helpful!

hbamoria · February 23, 2016, 6:29am

Oh god! How couldn’t I notice it! Thanks for the help Adam! I spent hours on figuring out the issue and eventually learnt to do this by BeautifulSoup. However, I was completely restless and wanted to do it with xpath as well. This solution brings peace to me!
Indeed this course was extremely helpful. Thanks again, for the help.
You guys are awesome.

hbamoria · February 23, 2016, 7:02pm

I’d like to disturb you once again. How do I parse the developer name i.e. “By King” using BeautifulSoup ?
I tried writing various combinations of:

developer_name = soup.find_all("div", {"class":"left"}, "h2")

but it doesn’t work. Any suggestions ?

adamarthurryan · February 23, 2016, 7:57pm

What about soup.select("div.left h2")?

Beautiful Soup: CSS Selectors

hbamoria · February 23, 2016, 8:11pm

Awesome. Works like a charm. Didn’t know about parsing the siblings. This was of much help.
Thanks for your prompt replies

hbamoria · February 24, 2016, 3:33pm

I’ve completed writing the program. It works just fine for depth = 0 but throws error at higher depth values i.e. depth = 1,2,3… Apparently, there is some issue with the syntax of requests.get() method. I am not able to figure it out.
I am attaching screenshot of the console showing the error and program I’ve written. Could you please help me with this ?

import requests
from bs4 import BeautifulSoup

class Appcrawler(object):
    def __init__(self, starting_url, depth):
	
	self.starting_url = starting_url
	self.depth = depth
	self.current_depth = 0
	self.apps = []
	self.depth_links = []



    def get_app_from_link(self, link):

	r = requests.get(link)
	soup_pre = r.content
	soup = BeautifulSoup(soup_pre, "lxml")

	name = soup.select("div .left h1")[0].text
	developer = soup.select('div .left h2')[0].text
	price = soup.select('div .price')[0].text
	link_getter = soup.select('div .center-stack .name')

	for item in link_getter:
		other_links = item.get("href")
		
	app = App(name, developer, price, other_links)
	return app
	

    def crawler(self):
        app = self.get_app_from_link(self.starting_url)
	self.apps.append(app)
	self.depth_links.append(app.other_links)

	while self.current_depth < self.depth:
		current_links = []
		
		for item in self.depth_links[self.current_depth]:
			item_app = self.get_app_from_link(item)
			self.apps.append(item_app)
			self.current_links.extend(item_app.links)
		self.current_depth += 1
		self.depth_links.append(current_links)



class App(object):
    def __init__(self, name, developer, price, other_links):
	self.name = name
	self.developer = developer
	self.price = price
	self.other_links = other_links

    def __str__(self):
	return ("Name: " + self.name +
	"\r\nDeveloper: " + self.developer +
	"\r\nPrice: " + self.price )

crawl = Appcrawler('https://itunes.apple.com/in/app/candy-crush-saga/id553834731?mt=8', 1)
crawl.crawler()

for item in crawl.apps:
    print item

adamarthurryan · February 24, 2016, 4:29pm

Well, the exception says "Invalid URL 'h' … perhaps you meant http://h?"

It looks to me like whatever you’re doing to extract the related item links in get_app_from_link() is not working properly and is not returning the right value. Why don’t you put some print statements in there and see what’s happening?

hbamoria · February 24, 2016, 8:52pm

I did it! (y)