Web Scraping In Python Using Beautifulsoup

Sachin Pal
9 min readAug 9, 2022

--

Web Scraping In Python Using Beautiful Soup
Source: Author(GeekPython.in)

The Internet is filled with lots of digital data we might need for research or personal interest. We will need some web scraping skills to get these data.

Python has enough powerful tools to carry out web scraping tasks efficiently and effectively on extensive data.

What is web scraping?

Web scraping or web data extraction is a process of gathering information from the Internet. It can be a simple copy-paste of the data from specific websites or an advanced data collection from websites with real-time data.

Some websites don't mind extracting their data, while some websites strictly prohibit data extraction on their websites.

If you are scraping websites for educational purposes, you're likely not to have any problem but if you are starting large-scale projects, check the website's Terms of Services.

Why do we need it?

Not all websites have APIs to fetch content, so to extract the content, we just left with only one option: scrape the content.

Steps for web scraping

  • Inspecting the source of data
  • Getting the HTML content
  • Parsing the HTML with Beautifulsoup

Now let's move ahead and install the dependencies we'll need for this tutorial.

Installing the dependencies

pip install requests beautifulsoup4

Scraping the website

We are going to scrape the Wikipedia article on Python Programming Language. This webpage contains almost all HTML tags which will be good for us to test all aspects of BeautifulSoup.

1. Inspecting the source of data

Before writing any Python code, you must take a good look at the website you are going to perform web scraping.

You need to understand the website's structure to extract the relevant information for our project.

Thoroughly go through the website, perform basic actions, understand how the website works, and check the URLs, routes, query parameters, etc.

Inspecting the webpage using Developer Tools

Now, it's time to inspect the website's DOM (Document Object Model) using Developer Tools.

Developer Tools help in understanding the structure of the website. It can do various things, from inspecting the loaded HTML, CSS, and JavaScript to showing the assets the page has requested and how long they took to load. All modern browsers come with Developer Tools installed.

To open dev tools, right-click on the webpage and click on the Inspect option. This process is for the Chrome browser on Windows, or apply the following keyboard shortcut -

Ctrl + Shift + I

For macOS, I think the command is -

⌘ + ⌥ + I

Now it's time to look at the DOM of our webpage that we will scrape.

DOM of the website to be scrape
Source: Author(GeekPython.in)

The HTML on the right represents the page’s structure, which we can see on the left side.

2. Get the HTML content

We need requests library to scrape the website's HTML content, which we have already installed in our system.

Next, open up your favorite IDE or Code Editor and retrieve the site's HTML in just a few lines of Python code.

import requests

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

# Step 1: Get the HTML
r = requests.get(url)
htmlContent = r.content

# Getting the content as bytes
print(htmlContent)

# Getting the encoded content
print(r.text)

If we print the r.textWe'll get the same output as the HTML we inspected earlier with the browser's developer tools. Now we have access to the site's HTML in our Python script.

Now let's parse the HTML using Beautiful Soup

3. Parse the HTML with Beautifulsoup

We have successfully scraped the HTML of the website, but there is a problem. If we look at it, many HTML elements lie here and there, and attributes and tags are scattered around. So we need to parse that lengthy response using Python code to make it more readable and accessible.

Beautiful Soup helps us to parse the structured data. It is a Python library for pulling data from HTML and XML files.

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

# Step 1: Get the HTML
r = requests.get(url)
content = r.content

# Step 2: Parse the HTML
soup = BeautifulSoup(content, 'html.parser')
print(soup)

The second argument we added in our Beautiful Soup object is html.parser. It would be best to choose the correct parser for the HTML content.

Find elements by ID

Elements in an HTML webpage can have an id attribute assigned to them. It makes an element in the page uniquely identifiable.

Beautiful Soup allows us to find the specific HTML element by its ID

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

r = requests.get(url)
content = r.content

soup = BeautifulSoup(content, 'html.parser')

id_content = soup.find(id="firstHeading")

To prettify the HTML for easier viewing, we can use .prettify() to any beautiful soup object. Here we called .prettify() on id_content variable from above.

print(id_content.prettify())

Find elements by Tag

In an HTML webpage, we encounter many HTML tags and might want the data that resides in those tags. Like we want the hyperlinks that reside in the "a" (anchor) tag or want to scrape the description from the "p" (paragraph) tag.

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

r = requests.get(url)
content = r.content

soup = BeautifulSoup(content, 'html.parser')

# Getting the first <code> tag
find_tag = soup.find("code")
print(find_tag.prettify())

# Getting all the <pre> tag
all_pre_tag = soup.find_all("pre")

for pre_tag in all_pre_tag:
print(pre_tag)

Find elements by HTML Class Name

We can see hundreds of elements like <div>, <p>, or <a> with some classes in an HTML webpage, and through these classes, we can access the entire content present inside the specific element.

Beautiful Soup provides a class_ argument to find the content present inside an element with a specified class name.

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

r = requests.get(url)
content = r.content

soup = BeautifulSoup(content, 'html.parser')

# Getting the "div" element with class name "mw-highlight"
class_elem = soup.find("div", class_="mw-highlight")
print(class_elem.prettify())

The first argument we provided inside the beautiful soup object is the element and the second argument we provided is the class name.

Find elements by Text Content and Class name

Beautiful Soup provides a string argument that allows us to search for a string instead of a tag. We can pass in a string, a regular expression, a list, a function, or the value True.

# Getting all the strings whose value is "Python"find_str = soup.find_all(string="Python")
print(find_str)

.........
['Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python']

We can also find the tags whose value matches the specified value for the string argument.

find_str_tag = soup.find_all("p", string="Python")

Here we are looking for the <p> tag where the value "Python" must be. But if we move ahead and try to print the result, we'll get an empty result.

print(find_str_tag) ......... 
[]

When we use string=, our program looks precisely the same value we provide. Any customization, whitespace, spelling, or capitalization difference will prevent the element from matching.

If we provide the exact value, then the program will run successfully.

find_str_tag = soup.find_all("span", string="Typing")
print(find_str_tag)

.........
[<span class="toctext">Typing</span>, <span class="mw-headline" id="Typing">Typing</span>]

Passing a Function

In the above section, when we tried to find the <p> tag containing the string "Python" we got disappointment.

But Beautiful Soup allows us to pass a function as arguments. After using the function, we can modify the above code to work perfectly fine.

# Creating a functiondef has_python(text):
return text in soup.find_all("p")

find_str_tag = soup.find_all("p", string=has_python("Python"))
print(len(find_str_tag))

Here we created a function called has_python which takes text as an argument and then returns that text present in all the <p> tags.

Next, we passed that function to the string argument and passed the string "Python" to it. Then we printed the number of occurrences of the "Python" in all the <p> tags.

81

Extract Text from HTML elements

What if we do not want the content with the HTML tags attached to them? What if we want clean and simple text data from the elements and tags?

We can use .text or .get_text() to return only the text content of the HTML elements we pass in the Beautiful Soup object.

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

r = requests.get(url)
content = r.content

soup = BeautifulSoup(content, 'html.parser')

table_elements = soup.find_all("table", class_="wikitable")

for table_data in table_elements:
table_body = table_data.find("tbody")

print(table_body.text) # or

print(table_body.get_text())

We'll get the whole table as an output in text format. But there will be so many whitespaces between the text, so we'll need to strip that data and remove the whitespaces by simply using .strip method.

print(table_body.text.strip())

There are other ways also to remove whitespaces. Check it out here.

Extract Attributes from HTML elements

An HTML page has numerous attributes like href, src, style, title, and more. Since an HTML webpage contains many <a> tags with href attributes, we will scrape all the href attributes present on our website.

We cannot scrape the attributes as we did in the above examples.

# Accessing href in the main content of the HTML pageanchor_in_body_content = soup.find(id="bodyContent")

# Finding all the anchor tags
anchors = anchor_in_body_content.find_all("a")

# Looping over all the anchor tags to get the href attribute
for link in anchors:
links = link.get('href')
print(links)

We looped over all the <a> tags in the main content of the HTML page and then used a .get('href') to get all the href attributes.

You can do the same for the src attributes also.

# Accessing src in body of the HTML pageimg_in_body_content = soup.find(id="bodyContent")

# Finding all the img tags
media = img_in_body_content.find_all("img")

# Looping over all the img tags to get the src attribute
for img in media:
images = img.get('src')
print(images)

Access Parent and Sibling elements

Beautiful Soup allows us to access an element's parent using the .parent attribute.

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

r = requests.get(url)
content = r.content

soup = BeautifulSoup(content, 'html.parser')

id_content = soup.find(id="cite_ref-123")

parent_elem = id_content.parent
print(parent_elem)

We can find grandparent or great-grandparent elements of a specific element passed in the beautiful soup object.

id_content = soup.find(id="cite_ref-123")

grandparent_elem = id_content.parent.parent
print(grandparent_elem)

Another method that Beautiful Soup provides is .parents which helps us iterate over all of an element's parents.

id_content = soup.find(id="cite_ref-123")

for elem in id_content.parents:
print(elem) # to print the elements

print(elem.name) # to print only the names of elements

Note: This program might take a little time to complete, so wait until the program is finished.

p
div
div
div
div
body
html
[document]

Similarly, we can access an element's next and previous siblings using .next_sibling and .previous_sibling respectively.

id_content = soup.find(id="cite_ref-123")

# To print the next sibling of an element
next_sibling_elem = id_content.next_sibling

print(next_sibling_elem)
id_content = soup.find(id="cite_ref-123")

# To print the previous sibling of an element
previous_sibling_elem = id_content.previous_sibling

print(previous_sibling_elem)

Iterating over all the next siblings

next_sibling_elem = id_content.next_sibling

for next_elem in id_content.next_siblings:
print(next_elem)

Iterating over all the previous siblings

id_content = soup.find(id="cite_ref-123")

for previous_elem in id_content.previous_siblings:
print(previous_elem)

Using Regular Expression

Last, we can use a regular expression to search for an element, tag, text, etc., in the HTML tree.

import requests
from bs4 import BeautifulSoup
import re

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

r = requests.get(url)
content = r.content

soup = BeautifulSoup(content, 'html.parser')

id_content = soup.find(id="bodyContent")

for tag in id_content.find_all(re.compile("^p")):
print(tag.name)

This code will match all the alphanumeric characters, which means a-z, A-Z, and 0-9. It also matches the underscore, _. But we don't have elements starting from digits or underscore so that it will return all the tags and elements of an element passed in the Beautiful Soup object.

id_content = soup.find(id="bodyContent")

for tag in id_content.find_all(re.compile("\w")):
print(tag.name)

Conclusion

Well, we learned how to scrape a static website though it can be different for dynamic websites that throw different data on different requests or hidden websites with authentication. More powerful scraping tools are available for these types of websites like Selenium, Scrapy, etc.

requests library allows us to access the site's HTML, which then can be helpful for us to pull out the data from HTML using Beautiful Soup.

There are many methods and functions still available that we haven't seen, but we discussed some essential functions and methods that are used most commonly.

That's all for now

Keep Coding✌✌

Originally published at https://geekpython.in.

--

--

Sachin Pal
Sachin Pal

Written by Sachin Pal

I am a self-taught Python developer who loves to write on Python Programming and quite obsessed with Machine Learning.

No responses yet