This project is a lightweight, fully functional Python + BeautifulSoup webpage auditor that lets you fetch HTML from any set of URLs, store it in an SQLite database, analyze SEO-related tags, and export the results to Excel for reporting.
All scripts are independent, sharing a common configuration file settings.py
, so you can change database paths or root folder settings in one place. Users can download the complete ZIP containing the scripts and start running them immediately.
DB_PATH
, ROOT_PATH
, User-Agent string, and timeout values.pages
table. Each run can append new URLs without affecting others.pages
table, extracts key SEO tags (title, description, keywords, canonical URL, schema, image alt coverage, etc.), and saves them in page_summary
table.query
variable for custom reports or deletions.page_summary
and exports them to Excel for easy sharing and further analysis.pages
table for archival purposes.page_summary
.SELECT url FROM page_summary WHERE LENGTH(title) < 50
page_summary
table, which can be directly shared with your team or clients.title
or meta description
is missing or too short.canonical
tags or og:*
social meta tags.Unlike online SEO checkers, this tool lets you audit hundreds of pages offline, retain historical data in a database, and customize your own queries for deeper analysis. It's ideal for webmasters, SEO analysts, and developers who want full control over their audit process.
page_summary.py
can be easily extended to capture additional HTML tags, schema types, or even custom string patterns relevant to your project.DB_PATH
and ROOT_PATH
in settings.py
to match your local directory.pip install requests beautifulsoup4 lxml pandas openpyxl
pages
(raw HTML) and summaries in page_summary
.query
line to answer a specific SEO question.pages
):SELECT url
FROM pages
WHERE html NOT LIKE '%<title%'
OR LENGTH(substr(html, instr(html, '<title>')+7,
instr(html, '</title>')-instr(html, '<title>')-7)) < 20;
page_summary
):SELECT url, canonical_url
FROM page_summary
WHERE canonical_url IS NOT NULL
AND canonical_url != url;
rel="canonical"
to the preferred, crawlable URL.pages
):SELECT url
FROM pages
WHERE html NOT LIKE '%G-%' -- quick GA4 id presence check
OR html NOT LIKE '%gtag(';
pages
):SELECT url, LENGTH(html) AS bytes
FROM pages
ORDER BY LENGTH(html)
LIMIT 20;
pages
):SELECT url
FROM pages
WHERE html NOT LIKE '%<meta name="description"%';
page_summary
):SELECT title, COUNT(*) AS cnt, GROUP_CONCAT(url, ' | ') AS urls
FROM page_summary
WHERE title IS NOT NULL AND title != ''
GROUP BY title
HAVING COUNT(*) > 1;
page_summary
):-- Empty H1s
SELECT url
FROM page_summary
WHERE h1 IS NULL OR TRIM(h1) = '';
-- Duplicate H1s
SELECT h1, COUNT(*) AS cnt, GROUP_CONCAT(url, ' | ') AS urls
FROM page_summary
WHERE h1 IS NOT NULL AND TRIM(h1) != ''
GROUP BY h1
HAVING COUNT(*) > 1;
page_summary
):SELECT url, broken_link
FROM page_summary
WHERE broken_link IS NOT NULL;
query
in your existing analysis script for any check above.page_summary
to Excel/CSV for progress tracking.Author
🎥 Join me live on YouTubePassionate about coding and teaching, I publish practical tutorials on PHP, Python, JavaScript, SQL, and web development. My goal is to make learning simple, engaging, and project‑oriented with real examples and source code.