BeautifulSoup for web scraping

BeautifulSoup for screen-scraping Beautifulsoup is a python library to extract data from HTML or XML files.

To install this package with Anaconda run this line.
conda install -c anaconda beautiful-soup

Getting the content

Before using our library BeautifulSoup we need to collect the content of the webpage. First we need to send a request to get the content from a website. To do this we need requests library.
Then we need the GET request to collect the details.
import requests
link = "https://www.plus2net.com/html_tutorial/html-canvas.php"
content = requests.get(link)
print(content.text)
The full page ( HTML code ) is printed by above code.

Creating a BeautifulSoup object

We will use the above code and then apply one BeautifulSoup object to it.
from bs4 import BeautifulSoup
soup = BeautifulSoup(content.text, 'html.parser')
Note that we created soup as our object and we will further use this object to traverse the HTML code to get the different nodes. Here we have used html.parser. We can also use other options, like 'lxml' and 'html5lib' for different requirments.

Let us try to extract some common HTML tags
import requests
link = "https://www.plus2net.com/html_tutorial/html-canvas.php"
content = requests.get(link)

from bs4 import BeautifulSoup
soup = BeautifulSoup(content.text, 'html.parser')
print(soup.title) # gets title tag 
print(soup.h1)    # gets H1 tag 
print(soup.h1.string)
#print(soup.title.parent) # full string within the  parent tag
print(soup.title.parent.name)
Output is here
<title>Canvas html <canvas> tag to draw lines or graphics or animation  in web page</canvas></title>
<h1 itemprop="headline"><canvas> HTML Canvas  tag</h1>
<canvas> HTML Canvas  tag
head
We can use string as input to create BeautifulSoup object
content = """<html>
<head>
<title>Your title of the page here</title>
<META NAME='DESCRIPTION' CONTENT='my description '>
<META NAME='KEYWORDS' CONTENT='kw1,kw2,kw3'>
</head>
<body>

Hello <br>
Welcome to plus2net.com

</body>
</html>"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')
print(soup.title) # gets title 
print(soup.h1)    # gets H1 tag 
print(soup.title.name) # name of the tag i.e title
print(soup.title.string)
print(soup.title.parent) # full string within the  parent tag
print(soup.title.parent.name)
Output is here
<title>Your title of the page here</title>
None
title
Your title of the page here
<head>
<title>Your title of the page here</title>
<meta content="my description " name="DESCRIPTION"/>
<meta content="kw1,kw2,kw3" name="KEYWORDS"/>
</head>
head
Printing other tags
print(soup.h2)    # gets H2 tag 
print(soup.h2.string) # print text associated with h2 tag
Before scraping read the website’s legal use of data and avoid frequent request for data to website.


plus2net.com



Post your comments , suggestion , error , requirements etc here




We use cookies to improve your browsing experience. . Learn more
HTML MySQL PHP JavaScript ASP Photoshop Articles FORUM . Contact us
©2000-2020 plus2net.com All rights reserved worldwide Privacy Policy Disclaimer