Hello!
I was making a python crawler as per your video. I must admit that tutorial has been extremely helpful in understanding the structure and how things work in a crawler. However, I am unable to parse data in one case. Here is the HTML/CSS structure of an item on that page :
span class = “standard-price”
::before
"3999"
span
I’m trying to parse it through following xpath query :
tree.xpath(’//span[@class = “standard-price”]/text()’)[0]
but it gives me an error : “Index out of range”.
I believe this '::before" is creating the issue. How do we tackle such thing Please help me with this. I’m really stuck!
Thanks & regards.
I’m not exactly clear what you mean by ::before
- is it that the price is inserted by CSS with a rule like the following?
span::before {
content: "3999";
}
In that case, the XPath won’t be able to pull it out, because there is no CSS engine built in to lxml.
Another thing to check, does the <span>
element have any other classes? If so, the XPath selector won’t match them - it doesn’t know anything about the multiple class syntax and just matches on the whole attribute string.
See http://stackoverflow.com/questions/1604471/how-can-i-find-an-element-by-css-class-with-xpath for a solution.
I’ve uploaded the screenshot of the desired page. Here span has only one class i.e ‘standard-price’ and inside the ‘span’ tag it has “::before” written. Writing an xpath query returns an empty list over here.
Hope it makes my problem clear.
Thanks
Ah, I see. the class attribute is actually "standard-price "
with an extra space at the end. CSS doesn’t care, but XPath does. You could try //span[@class="standard-price "]/text()
(with the space).
In general, a more robust way to match class names is with ...[contains(concat(' ', @class, ' '), ' the-class-name ')]
3 Likes
PS. I also found this course extremely helpful!
Oh god! How couldn’t I notice it! Thanks for the help Adam! I spent hours on figuring out the issue and eventually learnt to do this by BeautifulSoup. However, I was completely restless and wanted to do it with xpath as well. This solution brings peace to me!
Indeed this course was extremely helpful. Thanks again, for the help.
You guys are awesome.
2 Likes
I’d like to disturb you once again. How do I parse the developer name i.e. “By King” using BeautifulSoup ?
I tried writing various combinations of:
developer_name = soup.find_all("div", {"class":"left"}, "h2")
but it doesn’t work. Any suggestions ?
What about soup.select("div.left h2")
?
Beautiful Soup: CSS Selectors
Awesome. Works like a charm. Didn’t know about parsing the siblings. This was of much help.
Thanks for your prompt replies
I’ve completed writing the program. It works just fine for depth = 0 but throws error at higher depth values i.e. depth = 1,2,3… Apparently, there is some issue with the syntax of requests.get() method. I am not able to figure it out.
I am attaching screenshot of the console showing the error and program I’ve written. Could you please help me with this ?
import requests
from bs4 import BeautifulSoup
class Appcrawler(object):
def __init__(self, starting_url, depth):
self.starting_url = starting_url
self.depth = depth
self.current_depth = 0
self.apps = []
self.depth_links = []
def get_app_from_link(self, link):
r = requests.get(link)
soup_pre = r.content
soup = BeautifulSoup(soup_pre, "lxml")
name = soup.select("div .left h1")[0].text
developer = soup.select('div .left h2')[0].text
price = soup.select('div .price')[0].text
link_getter = soup.select('div .center-stack .name')
for item in link_getter:
other_links = item.get("href")
app = App(name, developer, price, other_links)
return app
def crawler(self):
app = self.get_app_from_link(self.starting_url)
self.apps.append(app)
self.depth_links.append(app.other_links)
while self.current_depth < self.depth:
current_links = []
for item in self.depth_links[self.current_depth]:
item_app = self.get_app_from_link(item)
self.apps.append(item_app)
self.current_links.extend(item_app.links)
self.current_depth += 1
self.depth_links.append(current_links)
class App(object):
def __init__(self, name, developer, price, other_links):
self.name = name
self.developer = developer
self.price = price
self.other_links = other_links
def __str__(self):
return ("Name: " + self.name +
"\r\nDeveloper: " + self.developer +
"\r\nPrice: " + self.price )
crawl = Appcrawler('https://itunes.apple.com/in/app/candy-crush-saga/id553834731?mt=8', 1)
crawl.crawler()
for item in crawl.apps:
print item
Well, the exception says "Invalid URL 'h'
… perhaps you meant http://h?
"
It looks to me like whatever you’re doing to extract the related item links in get_app_from_link()
is not working properly and is not returning the right value. Why don’t you put some print
statements in there and see what’s happening?