Python Webpage Auditor – Fetch, Store, Analyze & Export

Web / URLs HTML Sources fetch_store.py Fetch HTML & Store in DB page_summary.py Extract Tags & Data page_analysis_query.py Run Custom SQL Analysis page_summary_excel.py Export to Excel settings.py Shared Config

This project is a lightweight, fully functional Python + BeautifulSoup webpage auditor that lets you fetch HTML from any set of URLs, store it in an SQLite database, analyze SEO-related tags, and export the results to Excel for reporting.

All scripts are independent, sharing a common configuration file settings.py, so you can change database paths or root folder settings in one place. Users can download the complete ZIP containing the scripts and start running them immediately.

1. File Overview

  • settings.py – Common configuration file storing DB_PATH, ROOT_PATH, User-Agent string, and timeout values.
  • fetch_store.py – Fetches the HTML of given URLs and stores them in the pages table. Each run can append new URLs without affecting others.
  • page_summary.py – Reads the raw HTML from pages table, extracts key SEO tags (title, description, keywords, canonical URL, schema, image alt coverage, etc.), and saves them in page_summary table.
  • page_analysis_query.py – Lets you run any SQL query to filter, analyze, or clean data in your database. You can quickly modify the query variable for custom reports or deletions.
  • page_summary_excel.py – Reads all records from page_summary and exports them to Excel for easy sharing and further analysis.

2. How It Works

  1. Run fetch_store.py with your list of URLs. This stores the complete HTML in pages table for archival purposes.
  2. Run page_summary.py to extract SEO-critical elements from the HTML and store them in structured columns in page_summary.
  3. Use page_analysis_query.py to apply filters such as:
    SELECT url FROM page_summary WHERE LENGTH(title) < 50
  4. Run page_summary_excel.py to generate an Excel report from the page_summary table, which can be directly shared with your team or clients.

3. Example Use Cases

  • Check if title or meta description is missing or too short.
  • Find pages without GA4 tracking IDs.
  • Identify low image alt-text coverage for accessibility improvements.
  • Locate pages missing canonical tags or og:* social meta tags.
  • Audit schema presence (FAQPage, HowTo, VideoObject).

4. Why This Project is Useful

Unlike online SEO checkers, this tool lets you audit hundreds of pages offline, retain historical data in a database, and customize your own queries for deeper analysis. It's ideal for webmasters, SEO analysts, and developers who want full control over their audit process.

  • Easy switch over to MySQL – Change the connection code and table creation statements to migrate from SQLite to MySQL if your data grows larger.
  • 🛠 Expandable checkspage_summary.py can be easily extended to capture additional HTML tags, schema types, or even custom string patterns relevant to your project.
  • 📄 Flexible URL input – Start with a single text file containing one URL, then expand to a larger list as your audit scope grows.

5. How to Run

  1. Download the ZIP package containing all scripts.
  2. Update DB_PATH and ROOT_PATH in settings.py to match your local directory.
  3. Install dependencies:
    pip install requests beautifulsoup4 lxml pandas openpyxl
  4. Run each script from the terminal as per your audit workflow.

6. Download

📥 Download ZIP of Python Webpage Auditor


Page Analysis (SQLite): Practical SEO Checks You Can Run From Your Database

  • Store your crawl in pages (raw HTML) and summaries in page_summary.
  • Keep your Python script the same—only swap the query line to answer a specific SEO question.
  • Workflow: fetch → summarize → analyze → export. Fix issues, recrawl, re-run.

1) Missing or Too-Short <title> (active example)

  • Why: Titles <~20 chars or missing underperform on CTR and clarity.
  • Query (on pages):
SELECT url
FROM pages
WHERE html NOT LIKE '%<title%'
   OR LENGTH(substr(html, instr(html, '<title>')+7,
                    instr(html, '</title>')-instr(html, '<title>')-7)) < 20;
  • Fix: Write unique, descriptive titles (~50–60 chars) aligned to intent/keyword.

2) Canonical Mismatch vs. Actual URL

  • Why: Wrong canonicals split signals and cause duplicate confusion.
  • Query (on page_summary):
SELECT url, canonical_url
FROM page_summary
WHERE canonical_url IS NOT NULL
  AND canonical_url != url;
  • Fix: Point rel="canonical" to the preferred, crawlable URL.

3) GA4 / Analytics Not Present

  • Why: Missing analytics = lost engagement and conversion data.
  • Query (on pages):
SELECT url
FROM pages
WHERE html NOT LIKE '%G-%'      -- quick GA4 id presence check
   OR html NOT LIKE '%gtag(';
  • Fix: Add GA4 snippet in the base template; validate on key templates.

4) Extremely Short Pages (Possible Thin Content)

  • Why: Very small HTML often indicates placeholders or thin pages.
  • Query (on pages):
SELECT url, LENGTH(html) AS bytes
FROM pages
ORDER BY LENGTH(html)
LIMIT 20;
  • Fix: Enrich or consolidate; align with search intent and internal links.

5) Missing Meta Description

  • Why: Not a ranking factor, but boosts CTR and snippet quality.
  • Query (on pages):
SELECT url
FROM pages
WHERE html NOT LIKE '%<meta name="description"%';
  • Fix: Add unique, compelling 120–160 char descriptions summarizing value.

6) Duplicate Titles Across Pages

  • Why: Duplicate signals dilute relevance and reduce discoverability.
  • Query (on page_summary):
SELECT title, COUNT(*) AS cnt, GROUP_CONCAT(url, ' | ') AS urls
FROM page_summary
WHERE title IS NOT NULL AND title != ''
GROUP BY title
HAVING COUNT(*) > 1;
  • Fix: Differentiate titles to reflect each page’s unique topic/intent.

7) Empty or Duplicate H1s

  • Why: H1 conveys primary on-page focus; keep one clear H1 per page.
  • Queries (on page_summary):
-- Empty H1s
SELECT url
FROM page_summary
WHERE h1 IS NULL OR TRIM(h1) = '';

-- Duplicate H1s
SELECT h1, COUNT(*) AS cnt, GROUP_CONCAT(url, ' | ') AS urls
FROM page_summary
WHERE h1 IS NOT NULL AND TRIM(h1) != ''
GROUP BY h1
HAVING COUNT(*) > 1;
  • Fix: Ensure one descriptive H1 aligned with title and query intent.

8) Broken Internal Links (If Captured)

  • Why: Broken links waste crawl budget and harm UX.
  • Example shape (on page_summary):
SELECT url, broken_link
FROM page_summary
WHERE broken_link IS NOT NULL;
  • Fix: Update or remove; add redirects where appropriate.

How to Use This Section

  • Swap the query in your existing analysis script for any check above.
  • Inspect the output URLs and prioritize high-impact templates and money pages.
  • Fix issues, recrawl, and re-run the same query to ensure counts drop to zero.
  • Export from page_summary to Excel/CSV for progress tracking.



Sqlite Beatutifulsoup Simple Webpage Auditor
Subhendu Mohapatra — author at plus2net
Subhendu Mohapatra

Author

🎥 Join me live on YouTube

Passionate about coding and teaching, I publish practical tutorials on PHP, Python, JavaScript, SQL, and web development. My goal is to make learning simple, engaging, and project‑oriented with real examples and source code.



Subscribe to our YouTube Channel here



plus2net.com







Python Video Tutorials
Python SQLite Video Tutorials
Python MySQL Video Tutorials
Python Tkinter Video Tutorials
We use cookies to improve your browsing experience. . Learn more
HTML MySQL PHP JavaScript ASP Photoshop Articles Contact us
©2000-2025   plus2net.com   All rights reserved worldwide Privacy Policy Disclaimer