Processing XML Data from Web Scraping in Python

Web scraping is a powerful way to extract data from websites, and many sites provide information in XML format, such as RSS feeds. Python makes it easy to parse and process XML data from such sources for applications like news aggregators, data monitoring tools, or personalized dashboards.

1. Fetching and Parsing XML Data from an RSS Feed

Here’s an example of fetching an RSS feed and extracting its data using Python’s requests library and ElementTree module:

Code Example

import requests
import xml.etree.ElementTree as ET

# URL of the RSS feed
rss_url = "https://rss.nytimes.com/services/xml/rss/nyt/Technology.xml"

# Fetch the RSS feed
# Using verify=False to bypass SSL issues for demonstration purposes
try:
    response = requests.get(rss_url)
	
    if response.status_code == 200:
        xml_content = response.text
        print("RSS feed fetched successfully.")
    else:
        print(f"Failed to fetch RSS feed. Status code: {response.status_code}")
        exit()
except requests.exceptions.SSLError as e:
    print("SSL Error encountered:", str(e))
    exit()

# Parse the XML content
try:
    root = ET.fromstring(xml_content)
except ET.ParseError as e:
    print("Error parsing XML content:", str(e))
    exit()

# Extract and print data
print("\nTechnology News Headlines:\n")
for item in root.findall('./channel/item'):
    title = item.find('title').text
    link = item.find('link').text
    pub_date = item.find('pubDate').text
    print(f"Title: {title}\nLink: {link}\nPublished on: {pub_date}\n")

Explanation:

  • requests.get(): Fetches the RSS feed from the specified URL.
  • ET.fromstring(): Parses the XML content into an ElementTree object.
  • root.findall(): Navigates to the <item> elements containing individual news articles.
  • item.find(): Extracts specific fields like <title>, <link>, and <pubDate>.
---

2. Storing the Extracted Data for Further Use

The extracted data can be stored in a structured format, such as a list of dictionaries, and saved as a CSV or JSON file for further processing.

Code Example

# Store data in a list
news_data = []
for item in root.findall('./channel/item'):
    news_data.append({
        'title': item.find('title').text,
        'link': item.find('link').text,
        'pubDate': item.find('pubDate').text,
    })

# Save to a CSV file
import pandas as pd
df = pd.DataFrame(news_data)
df.to_csv("news.csv", index=False)
print("News saved to news.csv successfully.")

Explanation:

  • news_data.append(): Collects the extracted data into a list of dictionaries.
  • pandas.DataFrame(): Converts the list into a Pandas DataFrame for easier manipulation.
  • to_csv(): Saves the DataFrame to a CSV file for further analysis.
---

Efficiently Storing RSS Feed Data in SQLite Using Python

Learn how to fetch data from an RSS feed, parse XML content, and store it in an SQLite database using Python. This script checks for new updates and appends only unique records, ensuring efficient and organized data handling.
import requests
import xml.etree.ElementTree as ET
import sqlite3
from datetime import datetime

# RSS feed URL (BBC World News)
rss_url = "http://feeds.bbci.co.uk/news/world/rss.xml"

# SQLite database file
db_file = "E:\\testing\\sqlite\\rss_feed_data.db"

# Connect to SQLite database (or create if it doesn't exist)
conn = sqlite3.connect(db_file)
cursor = conn.cursor()

# Create a table to store RSS feed data (if not exists)
cursor.execute("""
CREATE TABLE IF NOT EXISTS rss_feed (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    title TEXT NOT NULL,
    link TEXT NOT NULL UNIQUE,
    pub_date TEXT NOT NULL
)
""")
conn.commit()

# Fetch RSS feed
response = requests.get(rss_url)
if response.status_code != 200:
    print(f"Failed to fetch RSS feed. Status code: {response.status_code}")
    exit()

# Parse the RSS feed
root = ET.fromstring(response.content)

# Extract the latest publication date from the database
cursor.execute("SELECT MAX(pub_date) FROM rss_feed")
last_pub_date = cursor.fetchone()[0]

if last_pub_date:
    last_pub_date = datetime.strptime(last_pub_date, "%a, %d %b %Y %H:%M:%S %Z")
else:
    last_pub_date = datetime.min  # Set to the earliest possible date

# Initialize a counter for new records
new_records = 0

# Parse the XML and insert new records into the database
for item in root.findall('./channel/item'):
    title = item.find('title').text
    link = item.find('link').text
    pub_date_text = item.find('pubDate').text
    pub_date = datetime.strptime(pub_date_text, "%a, %d %b %Y %H:%M:%S %Z")

    # Only insert if the publication date is later than the last date in the database
    if pub_date > last_pub_date:
        try:
            cursor.execute("INSERT INTO rss_feed (title, link, pub_date) VALUES (?, ?, ?)",
                           (title, link, pub_date_text))
            new_records += 1
        except sqlite3.IntegrityError:
            print(f"Duplicate record skipped: {title}")

# Commit changes to the database
conn.commit()

# Close the database connection
conn.close()

# Output the result
print(f"Script completed. {new_records} new record(s) added to the database.")

Code Explanation:

  • rss_url: The URL of the RSS feed to fetch data from.
  • sqlite3.connect: Connects to an SQLite database file or creates it if it doesn't exist.
  • CREATE TABLE IF NOT EXISTS: Ensures the `rss_feed` table is created with columns for `title`, `link`, and `pub_date`.
  • ET.fromstring: Parses the XML content into an ElementTree object for navigation.
  • last_pub_date: Fetches the latest publication date from the database to avoid duplicate entries.
  • INSERT INTO: Adds new records to the database only if they are not duplicates.
  • conn.commit: Saves the changes to the database.

This script ensures that only new records are added to the database by checking the publication date, making it efficient for handling updates from RSS feeds.

Integrating Google News RSS Feed

In this section, we will demonstrate how to use the requests library to fetch RSS feed data and ElementTree for parsing XML. We'll fetch stock market news from Google News.

The following script fetches the RSS feed for stock market news and displays the top 5 articles with their headlines and links.

import requests
import xml.etree.ElementTree as ET

# URL for the Google News RSS feed (Stock Market-related)
rss_url = "https://news.google.com/rss/search?q=Stock+Market&hl=en-IN&gl=IN&ceid=IN:en"

# Function to fetch and parse the RSS feed
def fetch_stock_news():
    try:
        # Fetch the RSS feed
        response = requests.get(rss_url)
        response.raise_for_status()  # Raise an exception for HTTP errors
        
        # Parse the XML content
        root = ET.fromstring(response.content)
        
        # Print the feed title
        channel = root.find("channel")
        feed_title = channel.find("title").text
        print(f"Feed Title: {feed_title}\\n")

        # Extract and print the top 5 news items
        print("Latest Stock Market News:\\n")
        items = channel.findall("item")  # Get all news items
        for item in items[:5]:  # Fetch only the top 5 entries
            title = item.find("title").text  # News headline
            link = item.find("link").text  # News URL
            print(f"- {title}")  # Print the title
            print(f"  Link: {link}\\n")  # Print the URL

    except Exception as e:
        print(f"Error fetching news: {e}")

# Run the function
fetch_stock_news()

How It Works

  • Fetching the RSS Feed: The requests.get function fetches the XML content from the RSS feed URL.
  • Parsing the XML: The ElementTree library parses the XML structure to extract the feed title and individual news items.
  • Extracting Data: The script iterates over the <item> elements to retrieve the news headlines (<title>) and links (<link>).

Sample Output

Feed Title: Google News - Stock Market

Latest Stock Market News:

- Stock Market Hits Record High Amid Tech Rally
  Link: https://news.google.com/articles/ABC123

- Indian Stock Market Shows Positive Growth
  Link: https://news.google.com/articles/XYZ456

- Analysts Predict Bullish Trends for 2025
  Link: https://news.google.com/articles/DEF789

Enhancements

You can expand this script further with the following features:

  • Save News to File: Save the fetched news into a CSV or JSON file for offline use.
  • GUI Integration: Use Tkinter to display the news in a graphical application.
  • Advanced Parsing: Use BeautifulSoup if the XML structure is complex or requires advanced handling.

This script provides a robust foundation for working with RSS feeds and parsing XML data for real-time updates.

Conclusion

Processing XML data from web scraping allows you to extract useful information from RSS feeds and other XML sources. Using Python’s requests and ElementTree, you can parse and store the data efficiently. This technique is ideal for creating news aggregators, dashboards, or monitoring tools.


Subscribe to our YouTube Channel here


Subscribe

* indicates required
Subscribe to plus2net

    plus2net.com







    Python Video Tutorials
    Python SQLite Video Tutorials
    Python MySQL Video Tutorials
    Python Tkinter Video Tutorials
    We use cookies to improve your browsing experience. . Learn more
    HTML MySQL PHP JavaScript ASP Photoshop Articles FORUM . Contact us
    ©2000-2024 plus2net.com All rights reserved worldwide Privacy Policy Disclaimer