PHP Mini Webpage Auditor (DOMDocument + cURL) — Step by Step

This post walks you through a single-file PHP mini auditor that fetches any webpage and extracts key SEO signals— title, meta description, keywords, canonical, H1, Open Graph, Twitter/X, GA4 IDs, and internal/external link counts—using only cURL and DOMDocument + XPath. It works in both browser and CLI modes.

Need a refresher on cURL? See: cURL basics and cURL support & installation.

What this script does

  • Fetches HTML with cURL (custom UA, timeouts, TLS verification).
  • Parses the DOM with DOMDocument and DOMXPath.
  • Extracts title, meta tags (description, keywords, viewport), canonical, all H1s.
  • Detects social tags: Open Graph (og:title/description/image) and Twitter/X tags.
  • Finds GA4 IDs (regex on raw HTML: G-XXXXXXXX).
  • Counts links: internal vs external based on the page host.
  • Dual output: text report in CLI, styled HTML in browser.

How it works — Step by step

  1. Target URL: a default URL is set, but the script also reads ?url=... in browser or $argv[1] in CLI. We sanitize the value and ensure it starts with HTTP/HTTPS.
  2. Fetch HTML with cURL: we set timeouts, a User-Agent string, TLS verification, and follow redirects. If cURL isn’t enabled, see the cURL guides linked above.
  3. Load DOM: DOMDocument loads the HTML; DOMXPath lets us query like bs4’s find().
  4. Extract fields:
    • //title, //meta[@name='description'] (fallback to og:description)
    • //meta[@name='keywords'], //meta[@name='viewport'], //link[@rel='canonical']
    • All //h1 nodes
    • Open Graph via @property='og:*', Twitter/X via @name='twitter:*' or 'x:*'
  5. GA4 IDs: we run a simple regex over the raw HTML: /G-[A-Z0-9]{6,12}/.
  6. Link classification: parse each <a href>, compare host against the target URL’s host to count internal vs external.
  7. Dual-mode output: if (PHP_SAPI === 'cli') prints a compact text summary; otherwise we render a HTML report.

Security considerations

  • Lock down allowed hosts (optional): allow only your domain(s) or a whitelist. This avoids your server fetching arbitrary external URLs.
  • Disable URL parameter for public demo (see “Locked Demo” below). Use a hardcoded URL to prevent misuse.
  • Timeout & TLS: keep sensible timeouts and enable TLS verification (already on in this script).

CLI & Browser usage

CLI:

php mini_auditor.php "https://www.plus2net.com/python/set.php"

Browser:

mini_auditor.php?url=https://www.plus2net.com/python/set.php

Locked Demo (no URL parameter)

For a public demo page, remove the ?url= option and hardcode one URL. This prevents other sites from embedding your tool to fetch arbitrary pages.

<?php
$DEFAULT_URL = "https://www.plus2net.com/python/set.php"; // fixed
$targetUrl = $DEFAULT_URL; // ignore $_GET['url'] and CLI args
// ... (reuse the same cURL + DOM parsing code, just skip reading ?url)
?>

Full Code (copy & use)

Save as mini_auditor.php. Ensure cURL is enabled in PHP.

DEMO : Webpage auditor
<?php
/**
 * mini_auditor.php
 * Minimal PHP webpage auditor (DOMDocument + XPath, no external libs).
 * - Browser: mini_auditor.php?url=https://www.plus2net.com/python/set.php
 * - CLI:     php mini_auditor.php "https://www.plus2net.com/python/set.php"
 */

// ===== 1) Config =====
$DEFAULT_URL = "https://www.plus2net.com/python/set.php";
$UA = "Mozilla/5.0 (compatible; Plus2net-PHP-Auditor/1.0; +https://www.plus2net.com/)";
$TIMEOUT = 20;

// ===== 2) Resolve target URL =====
$targetUrl = $DEFAULT_URL;
if (PHP_SAPI === 'cli') {
    if (!empty($argv[1])) $targetUrl = $argv[1];
} else {
    if (!empty($_GET['url'])) $targetUrl = $_GET['url'];
}
$targetUrl = trim($targetUrl);
if (!preg_match('#^https?://#i', $targetUrl)) {
    $targetUrl = "https://" . ltrim($targetUrl, "/");
}

// ===== 3) Fetch HTML via cURL =====
function fetch_html($url, $ua, $timeout) {
    $ch = curl_init($url);
    curl_setopt_array($ch, [
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_USERAGENT      => $ua,
        CURLOPT_CONNECTTIMEOUT => $timeout,
        CURLOPT_TIMEOUT        => $timeout,
        CURLOPT_SSL_VERIFYPEER => true,
        CURLOPT_SSL_VERIFYHOST => 2,
        CURLOPT_HEADER         => false,
    ]);
    $html = curl_exec($ch);
    $status = curl_getinfo($ch, CURLINFO_RESPONSE_CODE);
    $err   = curl_error($ch);
    curl_close($ch);
    return [$status, $html, $err];
}

list($status, $html, $err) = fetch_html($targetUrl, $UA, $TIMEOUT);

// ===== 4) Parse with DOMDocument + XPath =====
libxml_use_internal_errors(true);
$dom = new DOMDocument();
if ($html !== false && $html !== null) {
    @$dom->loadHTML($html);
}
$xpath = new DOMXPath($dom);

function firstAttr(DOMNodeList $nodes, $attr) {
    if ($nodes->length > 0) {
        $n = $nodes->item(0);
        return trim($n->getAttribute($attr) ?? "");
    }
    return null;
}
function firstNodeText(DOMNodeList $nodes) {
    if ($nodes->length > 0) {
        return trim($nodes->item(0)->textContent ?? "");
    }
    return null;
}
function q($xpath, $expr) { return $xpath->query($expr); }

$parts = parse_url($targetUrl);
$baseHost = isset($parts['host']) ? strtolower($parts['host']) : "";

// Extract fields
$title = firstNodeText(q($xpath, "//title"));

$metaDescription = firstAttr(q($xpath, "//meta[translate(@name,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')='description']"), "content");
if (!$metaDescription) {
    $metaDescription = firstAttr(q($xpath, "//meta[translate(@property,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')='og:description']"), "content");
}
$metaKeywords   = firstAttr(q($xpath, "//meta[translate(@name,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')='keywords']"), "content");
$metaViewport   = firstAttr(q($xpath, "//meta[translate(@name,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')='viewport']"), "content");
$canonical      = firstAttr(q($xpath, "//link[translate(@rel,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')='canonical']"), "href");

// H1s
$h1nodes = q($xpath, "//h1");
$h1s = [];
foreach ($h1nodes as $h) { $h1s[] = trim($h->textContent); }

// OG basics
$ogTitle       = firstAttr(q($xpath, "//meta[translate(@property,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')='og:title']"), "content");
$ogDescription = firstAttr(q($xpath, "//meta[translate(@property,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')='og:description']"), "content");
$ogImage       = firstAttr(q($xpath, "//meta[translate(@property,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')='og:image']"), "content");
$ogPresent     = ($ogTitle || $ogDescription || $ogImage) ? 1 : 0;

// Twitter/X presence
$twMeta = q($xpath, "//meta[starts-with(translate(@name,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz'),'twitter:') or starts-with(translate(@name,'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz'),'x:')]");
$twitterPresent = ($twMeta->length > 0) ? 1 : 0;

// GA4 IDs (regex on raw HTML)
$ga4Ids = [];
if (!empty($html)) {
    if (preg_match_all('/G-[A-Z0-9]{6,12}/', $html, $m)) {
        $ga4Ids = array_values(array_unique($m[0]));
    }
}

// Links (internal/external)
$internal = 0; $external = 0;
$anodes = q($xpath, "//a[@href]");
foreach ($anodes as $a) {
    $href = trim($a->getAttribute("href"));
    if ($href === "" || strpos($href, "javascript:") === 0 || strpos($href, "#") === 0) continue;
    $ph = parse_url($href, PHP_URL_HOST);
    if (!$ph || strtolower($ph) === $baseHost) $internal++; else $external++;
}

// ===== 5) Output (HTML if browser, text if CLI) =====
$isCli = (PHP_SAPI === 'cli');

if ($isCli) {
    echo "=== PHP Mini Auditor ===\n";
    echo "URL:            $targetUrl\n";
    echo "HTTP Status:    $status\n";
    echo "Title:          " . ($title ?: "—") . "\n";
    echo "Description:    " . ($metaDescription ?: "—") . "\n";
    echo "Keywords:       " . ($metaKeywords ?: "—") . "\n";
    echo "Viewport:       " . ($metaViewport ?: "—") . "\n";
    echo "Canonical:      " . ($canonical ?: "—") . "\n";
    echo "H1s:            " . (count($h1s) ? implode(" | ", $h1s) : "—") . "\n";
    echo "OG present:     " . ($ogPresent ? "Yes" : "No") . "\n";
    echo "Twitter present:" . ($twitterPresent ? "Yes" : "No") . "\n";
    echo "GA4 IDs:        " . (count($ga4Ids) ? implode(", ", $ga4Ids) : "—") . "\n";
    echo "Links:          internal=$internal, external=$external\n";
    exit;
}
?>
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>PHP Mini Auditor</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<style>
  body{font-family:system-ui,-apple-system,Segoe UI,Roboto,Ubuntu,"Helvetica Neue",Arial,sans-serif;margin:20px;line-height:1.6}
  .wrap{max-width:980px;margin:auto}
  .card{border:1px solid #e1e5ea;border-radius:8px;padding:16px;margin-bottom:16px}
  .muted{color:#6c757d}
  table{width:100%;border-collapse:collapse}
  th,td{border:1px solid #e1e5ea;padding:8px;vertical-align:top}
  th{background:#f8f9fa;text-align:left}
  .ok{color:#1b7f2a}
  .warn{color:#9a6700}
  .bad{color:#b42318}
  input[type=text]{width:100%;padding:8px;border:1px solid #ced4da;border-radius:6px}
  button{padding:8px 12px;border-radius:6px;border:1px solid #0d6efd;background:#0d6efd;color:#fff;cursor:pointer}
  .row{display:flex;gap:16px;flex-wrap:wrap}
  .col{flex:1 1 320px}
  code{background:#f6f8fa;padding:2px 4px;border-radius:4px}
</style>
</head>
<body>
<div class="wrap">
  <h1>PHP Mini Auditor</h1>
  <p class="muted">Quickly inspect key SEO tags of any webpage (no external PHP libraries). Enter a URL or use the default.</p>

  <form method="get" class="card">
    <label for="url"><strong>URL</strong></label>
    <input type="text" id="url" name="url" value="<?php echo htmlspecialchars($targetUrl); ?>">
    <div style="margin-top:8px"><button type="submit">Audit</button></div>
  </form>

  <div class="card">
    <h3>Summary</h3>
    <div class="row">
      <div class="col"><strong>URL:</strong> <?php echo htmlspecialchars($targetUrl); ?></div>
      <div class="col"><strong>HTTP Status:</strong> <?php echo (int)$status; ?></div>
      <div class="col"><strong>Links:</strong> internal=<?php echo (int)$internal; ?>, external=<?php echo (int)$external; ?></div>
    </div>
  </div>

  <div class="card">
    <h3>Meta & Head Tags</h3>
    <table>
      <tr><th>Title</th><td><?php echo htmlspecialchars($title ?: "—"); ?></td></tr>
      <tr><th>Meta Description</th><td><?php echo htmlspecialchars($metaDescription ?: "—"); ?></td></tr>
      <tr><th>Meta Keywords</th><td><?php echo htmlspecialchars($metaKeywords ?: "—"); ?></td></tr>
      <tr><th>Viewport</th><td><?php echo htmlspecialchars($metaViewport ?: "—"); ?></td></tr>
      <tr><th>Canonical</th><td><?php echo htmlspecialchars($canonical ?: "—"); ?></td></tr>
    </table>
  </div>

  <div class="card">
    <h3>Headings</h3>
    <?php if (count($h1s)): ?>
      <ul>
        <?php foreach ($h1s as $h): ?>
          <li><?php echo htmlspecialchars($h); ?></li>
        <?php endforeach; ?>
      </ul>
    <?php else: ?>
      <p class="muted">No &lt;h1&gt; found.</p>
    <?php endif; ?>
  </div>

  <div class="card">
    <h3>Social & Analytics</h3>
    <table>
      <tr><th>Open Graph present</th><td><?php echo $ogPresent ? "<span class='ok'>Yes</span>" : "<span class='bad'>No</span>"; ?></td></tr>
      <tr><th>Twitter/X tags present</th><td><?php echo $twitterPresent ? "<span class='ok'>Yes</span>" : "<span class='warn'>No</span>"; ?></td></tr>
      <tr><th>GA4 IDs</th><td><?php echo count($ga4Ids) ? htmlspecialchars(implode(", ", $ga4Ids)) : "<span class='warn'>None found</span>"; ?></td></tr>
    </table>
  </div>

  <p class="muted">Tip: From CLI, run <code>php mini_auditor.php "https://example.com"</code> to print a compact text report.</p>
</div>
</body>
</html>

FAQ: PHP Webpage Auditor Script

1. Can this PHP auditor script run without a database?

Yes. This script only uses cURL and outputs results directly in the browser or CLI. No database is required.

2. How do I prevent users from misusing the demo by injecting URLs?

You can create a fixed demo version where the URL is hard-coded inside the script instead of being user-supplied.

3. Does the script support both browser and CLI execution?

Yes. The script checks PHP_SAPI to detect whether it is run from CLI or a web server, and formats the output accordingly.

4. Can I extend this script to extract more metadata?

Absolutely. You can add more regex or DOM parsing logic with PHP functions like DOMDocument or preg_match to fetch tags such as Open Graph, canonical, or schema markup.

5. Is cURL mandatory for this auditor to work?

Yes. Since the script relies on cURL to fetch external pages, cURL must be enabled in your PHP installation. You can confirm by checking phpinfo().

6. Troubleshooting: Common Issues with the Script

  • cURL not enabled: If you see “Call to undefined function curl_init()”, enable cURL in php.ini by uncommenting extension=curl.
  • SSL errors: Add curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); in testing, but always configure proper certificates in production.
  • Timeout issues: Increase CURL_TIMEOUT or check server firewall if external requests are blocked.
  • Empty response: Ensure the target URL is correct and accessible. Some servers block bots, so try setting a User-Agent header.
Next steps: You can add more fields (e.g., robots meta, hreflang, noindex flags), or save results into SQLite/MySQL and export to Excel—just like our Python auditor series.

cURL cURL DEMO scripts

Subhendu Mohapatra — author at plus2net
Subhendu Mohapatra

Author

🎥 Join me live on YouTube

Passionate about coding and teaching, I publish practical tutorials on PHP, Python, JavaScript, SQL, and web development. My goal is to make learning simple, engaging, and project‑oriented with real examples and source code.



Subscribe to our YouTube Channel here



plus2net.com











PHP video Tutorials
We use cookies to improve your browsing experience. . Learn more
HTML MySQL PHP JavaScript ASP Photoshop Articles Contact us
©2000-2025   plus2net.com   All rights reserved worldwide Privacy Policy Disclaimer