Description
Have you ever tried to generate a sitemap of your own website for marketing purposes, and then realised that the so called “FREE” options is a sub-standard sitemap or only a sitemap of a small portion of your site. leaving you with a incomplete sitemap for your website marketing effort.
The paid versions are generally exorbitant monthly fees to get your site properly mapped regularly. Few small business can afford this!
You have put the energy in to generate the web-pages, now get the search engines to identify your pages and bring the customers to them.
This tool allows you to do exactly that. It is written in Python, so you will need to install python on your system PLUS the other open source programs on which this program depends.
This program was extensively tested over many months on a laptop running Linux Mint 22.3 (Zena). It has only 2 Gb of RAM, a internet connection and for the rest it depends on the open source software mentioned earlier.
This software is bought ONCE, and remains your (including the source code) to do with as you please and requires NO SUBSCRIPTION from us.
There is a complete document attached herewith to explain how to install, run and use this program and below follows detailed on the software itself.
THE PROGRAM CONSISTS OF:
GargoyleSiteMapper0.1.5.py
SOFTWARE PURPOSE:
Generate a sitemap.xml for your website with webpage and image end-points in order to enable Search Engine crawlers to find and map all your webpages.
SOFTWARE INTERFACE:
Graphical User Interface
SOFTWARE REQUIREMENTS (ALL OPEN-SOURCE):
python 3.12.3
beautifulsoup4==4.14.3
bs4==0.0.2
lxml==6.0.2
setuptools==80.10.2
soupsieve==2.8.3
typing_extensions==4.15.0
var_dump==1.2
– Memory (RAM): Minimum 2GB. The program is very light, but if you crawl a site with 10,000+ pages, the URL queue will sit in your RAM.
– Internet Connection: A stable connection is required. If your internet drops, the “Timeout” safety feature will trigger, and the crawler will mark those pages as an “Error.”
– The software needs Write Permissions in the folder where the script is located. It creates a new directory for every domain it crawls. It saves .xml and .txt files inside those directories. Ensure you are not running the script from a “Read-Only” location like a protected System32 folder or a locked USB drive.
SOFTWARE DESCRIPTION:
The Gargoyle Sitemap Generator v0.1.5 is a professional-grade web crawler designed to map the architecture of a website while prioritizing stealth, ethics, and diagnostic accuracy. It acts as an automated explorer that “reads” a website like a human would, but at the speed of multiple concurrent threads.
Here is a breakdown of how the program operates and the specific safety layers I have engineered into it’s acrhitecture.
1. Functional Overview
The program starts at a “Seed URL” and parses the HTML to find every internal link. It follows these links recursively, building a tree-like map of the entire domain.
Engine: A multi-threaded orchestrator that manages a “Queue” (to-do list) and a “Checked” dictionary (history).
Output: It produces a standard sitemap.xml for SEO and a broken_links.txt for site maintenance.
2. Multi-Layered Safety Features
To ensure the program is a “Good Actor” on the web and doesn’t get your IP address banned or crash your server, it uses several safety protocols:
A. The “Good Citizen” Protocol (Robots.txt)
Before the first link is even crawled, the program fetches the site’s robots.txt file. It uses the RobotFileParser to ensure it never enters “Disallowed” directories. If a site owner has marked a folder as private, Gargoyle will skip it automatically.
B. Human-Mimicry (Rate Limiting)
Standard bots hit a server thousands of times a second, which looks like a DDoS attack. Gargoyle uses Randomized Pauses (min_p and max_p). After every page visit, each thread sleeps for a random interval. This makes the traffic look like a natural human browsing patterns rather than a machine.
C. Domain Locking (The “Fence”)
To prevent the crawler from accidentally trying to “map the entire internet,” it uses strict Netloc Validation. If it finds a link to Facebook, Twitter, or an external blog, it recognizes that the “Network Location” doesn’t match your target domain and refuses to follow it.
D. Keyword & Backend Filtering
The program includes a “Blacklist” of keywords (e.g., /admin, wp-login, ?). This prevents the crawler from getting stuck in “Spider Traps” (infinite loops caused by calendar filters) or attempting to access sensitive login portals.
E. Thread-Safe “Locking”
When multiple threads try to write to the same list at once, “Race Conditions” can occur, leading to data corruption. Gargoyle uses a Global Interpreter Lock (GIL) mechanism via threading.Lock(). This ensures that only one thread can update the “Checked” list at a time, keeping your “Genesis Document” data 100% accurate.
F. The “Traffic Light” (Pause/Resume/Stop)
Unlike simpler scripts that you have to “kill” (potentially losing data), Gargoyle uses Condition Variables.
Pause: Gently tells threads to finish their current task and wait without consuming CPU.
Stop: Triggers an immediate “Safe Exit” that stops the engine and saves everything found up to that millisecond.
3. Diagnostic Safety (Broken Link Detection)
The program treats errors (404, 403, 500) as valuable data rather than failures. By tracking the Referrer, it ensures that even if a page is “broken,” you know exactly which healthy page contains the bad link. This allows you to repair the site without manual searching.
TECHNICAL SPECIFICATIONS:
1. Core Architecture
Gargoyle is built on a Non-Blocking Multithreaded Orchestrator. It utilizes a “Producer-Consumer” model where the main engine manages a centralized queue of URLs, and worker threads consume those URLs to perform HTTP requests and HTML parsing.
Concurrency Model: threading.
Thread with a threading.Condition synchronization primitive.
Parsing Engine: lxml for high-performance XPath-based link extraction.
Networking: urllib.request with custom User-Agent rotation headers.
2. Safety & Ethics Protocol (The “Politeness” Engine)
The program is engineered to adhere to the Web Robot Standards.
| Feature | Technical Implementation | Purpose |
|---|---|---|
| Robots.txt | urllib.robotparser | Prevents crawling of private or sensitive server directories. |
| Rate Limiting | random.uniform() | Prevents server strain by injecting human-like delays. |
| Domain Lock | urlparse().netloc | Ensures the bot never wanders onto third-party websites. |
| Keyword Filter | exclude_keywords list | Skips “Spider Traps” and administrative login portals. |
3. User Manual
A. Setting Up the Crawl
1. Target URL: Enter the full domain (e.g., https://example.com).
2. Threads: Recommended 4–6 for standard shared hosting; 8–10 for dedicated servers.
3. Timeout: Set to minimum of 20s. This prevents the program from hanging on slow, unresponsive pages.
4. Pauses: Set Min Pause to 0.5 and Max Pause to 3.0 to maintain a steady, non-aggressive flow.
B. Managing the Session
Pause/Resume: Use this if you notice your internet connection is lagging or if you need to temporarily free up system resources. Worker threads will complete their current page and “sleep” until resumed.
Stop: This is the “Graceful Exit.” It tells the engine to cease all operations and immediately compile the sitemap.xml and broken_links.txt using only the data collected so far.
C. Understanding the Output
Every crawl creates a folder named after the domain (e.g., mysite_co_za). Inside, you will find:
Sitemap (XML): Upload this to your root directory and submit it to Google Search Console.
Repair Report (TXT): Open this to see a list of broken links. It identifies the Source Page, making it easy for you to log into your CMS and fix the typo.
4. Safety Warnings
Note: While Gargoyle is designed to be safe, running too many threads (20+) with zero pauses may cause some firewalls (like Cloudflare) to temporarily block your IP address. Always start with the default settings.
INSTALLATION:
We highly recommend creating a virtual envelope using: python -m venv <env_name>. The envelope can be activated with from within the venv folder you just created using source ./bin/activate
Some variants of linux will require you to install venv using sudo apt install python3-venv
– Graphical User Interface (Tkinter) allows all settings and URL to be changed inside the program
– Set script to executable: `sudo chmod +x GargoyleSiteMapper0.1.5.py`.
– Running the script: `python3 GargoyleSiteMapper0.1.5.py` or `python GargoyleSiteMapper2.0.py` depending on how your system is configured.

