Web scraping is a powerful way to extract data from websites, and many sites provide information in XML format, such as RSS feeds. Python makes it easy to parse and process XML data from such sources for applications like news aggregators, data monitoring tools, or personalized dashboards.
Here’s an example of fetching an RSS feed and extracting its data using Python’s requests library and ElementTree module:
import requests
import xml.etree.ElementTree as ET
# URL of the RSS feed
rss_url = "https://rss.nytimes.com/services/xml/rss/nyt/Technology.xml"
# Fetch the RSS feed
# Using verify=False to bypass SSL issues for demonstration purposes
try:
response = requests.get(rss_url)
if response.status_code == 200:
xml_content = response.text
print("RSS feed fetched successfully.")
else:
print(f"Failed to fetch RSS feed. Status code: {response.status_code}")
exit()
except requests.exceptions.SSLError as e:
print("SSL Error encountered:", str(e))
exit()
# Parse the XML content
try:
root = ET.fromstring(xml_content)
except ET.ParseError as e:
print("Error parsing XML content:", str(e))
exit()
# Extract and print data
print("\nTechnology News Headlines:\n")
for item in root.findall('./channel/item'):
title = item.find('title').text
link = item.find('link').text
pub_date = item.find('pubDate').text
print(f"Title: {title}\nLink: {link}\nPublished on: {pub_date}\n")
<item>
elements containing individual news articles.<title>
, <link>
, and <pubDate>
.The extracted data can be stored in a structured format, such as a list of dictionaries, and saved as a CSV or JSON file for further processing.
# Store data in a list
news_data = []
for item in root.findall('./channel/item'):
news_data.append({
'title': item.find('title').text,
'link': item.find('link').text,
'pubDate': item.find('pubDate').text,
})
# Save to a CSV file
import pandas as pd
df = pd.DataFrame(news_data)
df.to_csv("news.csv", index=False)
print("News saved to news.csv successfully.")
import requests
import xml.etree.ElementTree as ET
import sqlite3
from datetime import datetime
# RSS feed URL (BBC World News)
rss_url = "http://feeds.bbci.co.uk/news/world/rss.xml"
# SQLite database file
db_file = "E:\\testing\\sqlite\\rss_feed_data.db"
# Connect to SQLite database (or create if it doesn't exist)
conn = sqlite3.connect(db_file)
cursor = conn.cursor()
# Create a table to store RSS feed data (if not exists)
cursor.execute("""
CREATE TABLE IF NOT EXISTS rss_feed (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
link TEXT NOT NULL UNIQUE,
pub_date TEXT NOT NULL
)
""")
conn.commit()
# Fetch RSS feed
response = requests.get(rss_url)
if response.status_code != 200:
print(f"Failed to fetch RSS feed. Status code: {response.status_code}")
exit()
# Parse the RSS feed
root = ET.fromstring(response.content)
# Extract the latest publication date from the database
cursor.execute("SELECT MAX(pub_date) FROM rss_feed")
last_pub_date = cursor.fetchone()[0]
if last_pub_date:
last_pub_date = datetime.strptime(last_pub_date, "%a, %d %b %Y %H:%M:%S %Z")
else:
last_pub_date = datetime.min # Set to the earliest possible date
# Initialize a counter for new records
new_records = 0
# Parse the XML and insert new records into the database
for item in root.findall('./channel/item'):
title = item.find('title').text
link = item.find('link').text
pub_date_text = item.find('pubDate').text
pub_date = datetime.strptime(pub_date_text, "%a, %d %b %Y %H:%M:%S %Z")
# Only insert if the publication date is later than the last date in the database
if pub_date > last_pub_date:
try:
cursor.execute("INSERT INTO rss_feed (title, link, pub_date) VALUES (?, ?, ?)",
(title, link, pub_date_text))
new_records += 1
except sqlite3.IntegrityError:
print(f"Duplicate record skipped: {title}")
# Commit changes to the database
conn.commit()
# Close the database connection
conn.close()
# Output the result
print(f"Script completed. {new_records} new record(s) added to the database.")
This script ensures that only new records are added to the database by checking the publication date, making it efficient for handling updates from RSS feeds.
In this section, we will demonstrate how to use the requests
library to fetch RSS feed data and ElementTree
for parsing XML. We'll fetch stock market news from Google News.
The following script fetches the RSS feed for stock market news and displays the top 5 articles with their headlines and links.
import requests
import xml.etree.ElementTree as ET
# URL for the Google News RSS feed (Stock Market-related)
rss_url = "https://news.google.com/rss/search?q=Stock+Market&hl=en-IN&gl=IN&ceid=IN:en"
# Function to fetch and parse the RSS feed
def fetch_stock_news():
try:
# Fetch the RSS feed
response = requests.get(rss_url)
response.raise_for_status() # Raise an exception for HTTP errors
# Parse the XML content
root = ET.fromstring(response.content)
# Print the feed title
channel = root.find("channel")
feed_title = channel.find("title").text
print(f"Feed Title: {feed_title}\\n")
# Extract and print the top 5 news items
print("Latest Stock Market News:\\n")
items = channel.findall("item") # Get all news items
for item in items[:5]: # Fetch only the top 5 entries
title = item.find("title").text # News headline
link = item.find("link").text # News URL
print(f"- {title}") # Print the title
print(f" Link: {link}\\n") # Print the URL
except Exception as e:
print(f"Error fetching news: {e}")
# Run the function
fetch_stock_news()
requests.get
function fetches the XML content from the RSS feed URL.ElementTree
library parses the XML structure to extract the feed title and individual news items.<item>
elements to retrieve the news headlines (<title>
) and links (<link>
).Feed Title: Google News - Stock Market
Latest Stock Market News:
- Stock Market Hits Record High Amid Tech Rally
Link: https://news.google.com/articles/ABC123
- Indian Stock Market Shows Positive Growth
Link: https://news.google.com/articles/XYZ456
- Analysts Predict Bullish Trends for 2025
Link: https://news.google.com/articles/DEF789
You can expand this script further with the following features:
This script provides a robust foundation for working with RSS feeds and parsing XML data for real-time updates.
Processing XML data from web scraping allows you to extract useful information from RSS feeds and other XML sources. Using Python’s requests and ElementTree, you can parse and store the data efficiently. This technique is ideal for creating news aggregators, dashboards, or monitoring tools.