Problem in python crawler


#1

Hello!
I was making a python crawler as per your video. I must admit that tutorial has been extremely helpful in understanding the structure and how things work in a crawler. However, I am unable to parse data in one case. Here is the HTML/CSS structure of an item on that page :

span class = “standard-price”
::before
"3999"
span

I’m trying to parse it through following xpath query :

tree.xpath(’//span[@class = “standard-price”]/text()’)[0]

but it gives me an error : “Index out of range”.

I believe this '::before" is creating the issue. How do we tackle such thing Please help me with this. I’m really stuck!

Thanks & regards.


#2

I’m not exactly clear what you mean by ::before - is it that the price is inserted by CSS with a rule like the following?

span::before {
  content: "3999";
}

In that case, the XPath won’t be able to pull it out, because there is no CSS engine built in to lxml.

Another thing to check, does the <span> element have any other classes? If so, the XPath selector won’t match them - it doesn’t know anything about the multiple class syntax and just matches on the whole attribute string.

See http://stackoverflow.com/questions/1604471/how-can-i-find-an-element-by-css-class-with-xpath for a solution.


#3

I’ve uploaded the screenshot of the desired page. Here span has only one class i.e ‘standard-price’ and inside the ‘span’ tag it has “::before” written. Writing an xpath query returns an empty list over here.
Hope it makes my problem clear.
Thanks


#4

Ah, I see. the class attribute is actually "standard-price " with an extra space at the end. CSS doesn’t care, but XPath does. You could try //span[@class="standard-price "]/text() (with the space).

In general, a more robust way to match class names is with ...[contains(concat(' ', @class, ' '), ' the-class-name ')]


#5

PS. I also found this course extremely helpful!


#6

Oh god! How couldn’t I notice it! Thanks for the help Adam! I spent hours on figuring out the issue and eventually learnt to do this by BeautifulSoup. However, I was completely restless and wanted to do it with xpath as well. This solution brings peace to me! :slight_smile:
Indeed this course was extremely helpful. Thanks again, for the help. :slightly_smiling:
You guys are awesome.


#7

I’d like to disturb you once again. How do I parse the developer name i.e. “By King” using BeautifulSoup ?
I tried writing various combinations of:

developer_name = soup.find_all("div", {"class":"left"}, "h2")

but it doesn’t work. Any suggestions ?


#8

What about soup.select("div.left h2")?

Beautiful Soup: CSS Selectors


#9

Awesome. Works like a charm. :slight_smile: Didn’t know about parsing the siblings. This was of much help.
Thanks for your prompt replies :slight_smile:


#10

I’ve completed writing the program. It works just fine for depth = 0 but throws error at higher depth values i.e. depth = 1,2,3… Apparently, there is some issue with the syntax of requests.get() method. I am not able to figure it out.
I am attaching screenshot of the console showing the error and program I’ve written

. Could you please help me with this ?

import requests
from bs4 import BeautifulSoup

class Appcrawler(object):
    def __init__(self, starting_url, depth):
	
	self.starting_url = starting_url
	self.depth = depth
	self.current_depth = 0
	self.apps = []
	self.depth_links = []



    def get_app_from_link(self, link):

	r = requests.get(link)
	soup_pre = r.content
	soup = BeautifulSoup(soup_pre, "lxml")

	name = soup.select("div .left h1")[0].text
	developer = soup.select('div .left h2')[0].text
	price = soup.select('div .price')[0].text
	link_getter = soup.select('div .center-stack .name')

	for item in link_getter:
		other_links = item.get("href")
		
	app = App(name, developer, price, other_links)
	return app
	

    def crawler(self):
        app = self.get_app_from_link(self.starting_url)
	self.apps.append(app)
	self.depth_links.append(app.other_links)

	while self.current_depth < self.depth:
		current_links = []
		
		for item in self.depth_links[self.current_depth]:
			item_app = self.get_app_from_link(item)
			self.apps.append(item_app)
			self.current_links.extend(item_app.links)
		self.current_depth += 1
		self.depth_links.append(current_links)



class App(object):
    def __init__(self, name, developer, price, other_links):
	self.name = name
	self.developer = developer
	self.price = price
	self.other_links = other_links

    def __str__(self):
	return ("Name: " + self.name +
	"\r\nDeveloper: " + self.developer +
	"\r\nPrice: " + self.price )

crawl = Appcrawler('https://itunes.apple.com/in/app/candy-crush-saga/id553834731?mt=8', 1)
crawl.crawler()

for item in crawl.apps:
    print item

#11

Well, the exception says "Invalid URL 'h'perhaps you meant http://h?"

It looks to me like whatever you’re doing to extract the related item links in get_app_from_link() is not working properly and is not returning the right value. Why don’t you put some print statements in there and see what’s happening?


#12

I did it! (y) :smiley: