Categories
Python

How to extract data from HTML using Python

In this tutorial we are going to dive deep in to various methods of extracting data between HTML elements using Python and Beautiful Soup Python library. We are also listing all P tags between H2 tags.

Without further delay I am going to jump straight in to the details. Step by step approach below

Example HTML

I am going to save this to my local system as index.html

<html>
<body>
<head>
<title>Test Website</title>
</head>
<h2>Header 1</h2>
<p>List 1</p>
<p>List 2</p>
<h2>Header 2</h2>
<p>List 3</p>
<p>List 4</p>
<div id="main">
	This is the main content
</div>
<h2>Header 3</h2>
<p>List 5</p>
<p>List 6</p>
<div class="tool_tip">
This is a tip
</div>
<div class="tool_tip">
This is a tip 1
</div>
<div id="footer">
	This is the footer content
</div>
</body>
</html>

Install BeautifulSoap

Using pip/pip3 you can install this library. Latest version while writing this tutorial beautifulsoup version 4.11.1

pip3 install beautifulsoup4

If you do have beautifulsoup4 but would like to update it to latest version then try this command

# pip3 install beautifulsoup4 --upgrade
# pip3 list|grep beautifulsoup4
beautifulsoup4         4.11.1

Load Webpage

In real world may be your requirement is to read HTML file directly from a website, if that is the case use requests library to read the webpage first and save it to variable read_html like below

import requests
from bs4 import BeautifulSoup
r=requests.get("<WEBSITE_URL>")
read_html= BeautifulSoup(r.text,'html.parser')

Since this is an example and I have saved the file as index.html , I am going to use the usual method of reading a local file and saving the content to variable read_html instead of the above method of using requests library.

from bs4 import BeautifulSoup
f = open('index.html', 'r')
r = f.read()
read_html= BeautifulSoup(r,'html.parser')

If you would like to use html5lib parser instead, run the below command and install html5lib. Replace the read_html line to load html5lib instead of html.parser

pip3 install html5lib
>>> read_html= BeautifulSoup(r,'html5lib')
>>> read_html
<html>
<body>
<head>
<title>Test Website</title>
</head>
<h2>Header 1</h2>
<p>List 1</p>
<p>List 2</p>
<h2>Header 2</h2>
<p>List 3</p>
<p>List 4</p>
<div id="main">
	This is the main content
</div>
<h2>Header 3</h2>
<p>List 5</p>
<p>List 6</p>
<div class="tool_tip">
This is a tip
</div>
<div class="tool_tip">
This is a tip 1
</div>
<div id="footer">
	This is the footer content
</div>
</body>
</html>
>>>

As you can see above we have the HTML data inside the variable read_html

Get value inside Title tags

To get the title value of the webpage simply run this

>>> read_html.title
<title>Test Website</title>

get the text inside the title tag

>>> read_html.title.text
'Test Website'

Get value of P tags

To get the value of P tag use this approach

>>> read_html.find_all("p")
[<p>List 1</p>, <p>List 2</p>, <p>List 3</p>, <p>List 4</p>, <p>List 5</p>, <p>List 6</p>]
>>> len(read_html.find_all("p"))
6
>>> read_html.find_all("p")[0].text
'List 1'
>>> read_html.find_all("p")[1].text
'List 2'
>>>

As you can see above using len() we can find out number of p tags inside an HTML and then use a for loop to get all those values

>>> for i in range(len(read_html.find_all("p"))):
...  print(read_html.find_all("p")[i].text
...
List 1
List 2
List 3
List 4
List 5
List 6

Get value of H2 tags

This will be similar to the step we have followed for getting the p elements value

>>> print(read_html.find_all("h2"))
[<h2>Header 1</h2>, <h2>Header 2</h2>, <h2>Header 3</h2>]
>>>
>>> print(read_html.find_all("h2")[2].text)
Header 3
>>>

You can use len() and loop through each h2 to get the value.

Extract DIV tag content by ID

Just use the find function to get the div and pass the id you would like to extract as shown below

>>> read_html.find('div',{"id":"footer"}).text
'\n\tThis is the footer content\n'
>>> read_html.find('div',{"id":"footer"}).text.strip()
'This is the footer content'
>>>

Use strip() to remove those new line and tab character’s.

Extract DIV tag content by Class Name

We can use find_all function to extract data within the div with specific class name especially if you have multiple class with same name

>>> read_html.find_all('div',{"class":"tool_tip"})
[<div class="tool_tip">
This is a tip
</div>, <div class="tool_tip">
This is a tip 1
</div>]
>>>

If you would like to extract just the first class

>>> read_html.find_all('div',{"class":"tool_tip"})[0].text.strip()
'This is a tip'

Read all P tags after H2

In this example we are going to read all the P tags that come after H2. For that use the script below, save the script as read.py

#!/usr/bin/python3
from bs4 import BeautifulSoup
f = open('index.html', 'r')
s = f.read()
soup_get = BeautifulSoup(s, 'html.parser')
get_h2_tag = soup_get.find_all('h2')
for i in get_h2_tag:
    htag=i.text.strip()
    print(htag)
    if htag != "":
        for sib in i.next_siblings:
            if sib.name == 'p':
                print(sib.text)
            elif sib.name == 'h2':
                print ("---------------------------------------")
                break

Output of the above script

# ./read.py
Header 1
List 1
List 2
---------------------------------------
Header 2
List 3
List 4
---------------------------------------
Header 3
List 5
List 6
#

More information on BeautifulSoap can be found in this documentation.

Click to rate this tutorial!
[Total: 0 Average: 0]

Leave a Reply

Your email address will not be published. Required fields are marked *