What is Web Scraping ?
Web Scrapping is a technique which is used to extract data in large quantity from websites and saved to a local file in your computer or to a database in table(spreadsheet) format.
Why it is useful?
We can see the data which is displayed on websites only by using a web browser (Google Chrome, Firefox etc). They do not allow us to copy this data for personal use. If we want to save data from websites then we have to do copy and paste data manually. It is very hectic and tedious job if data is in large amount which will be completed in many hours or sometimes days to complete.
Here, Web Scrapping is the best option. It is technique which does automation of extracting the data and instead of manually copying the data from websites. So, it is done by Web Scrapping which perform this task in fraction of the time.
Methods of Web Scraping
1. Using software:
Web Scraping software is of 2 types. First, which can be installed in your computer and second, which runs in cloud – browser based. WebHarvy, OutWit Hub, Visual Web Ripper etc. are examples of web scraping software which can be installed in your computer, whereas import.io, Mozenda etc. are examples of cloud data extraction platforms.
2. Writing code:
You can hire a developer to build custom data extraction software for your specific requirement. The developer can in-turn make use of web scraping APIs which helps him/her develop the software easily. For example apify.com lets you easily get APIs to scrape data from any website.
Firstly we are going to import requests library. Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor.
Now we assign the link of the website through which we are going to scrape the data and assign it to variable named website_url.
website_url = "https://www.amazon.in/Test-Exclusive-746/product-reviews/B07DJHXTLJ/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"
requests.get(website_url).text will ping a website and return you HTML of the website.
get() method sends a GET request to the specified url.
BeautifulSoup 4(bs4):- Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
Prettify() function in BeautifulSoup will enable us to view how the tags are nested in the document.
from bs4 import BeautifulSoup soup = BeautifulSoup(page.content,'html.parser') print(soup.prettify)
If you carefully inspect the HTML script all the table contents i.e. names of the customers which we intend to extract is under class ‘a-profile-name’.
Now to extract all the links within <a>, we will use find_all(). names=soup.find_all('span',class_='a-profile-name')[2:] names
stars = soup.find_all('span',class_='a-icon-alt')[3:] Stars
From the links, we have to extract the name of the customer and thier ratings of products which is names and stars.
To do that we create a list cust_name and cust_rating so that we can extract the name of name of the customer and their ratings from the link and append it to the list cust_name and cust_rating
cust_name =  for i in range(0,len(names)): cust_name.append(names[i].get_text()) cust_name
cust_rating= for i in range(0,len(stars)): cust_rating.append(stars[i].get_text())cust_rating