Developer Guide

How to Check URLs in Bulk: Guide for Developers

April 6, 202610 min readBulk URL Checker Team

If you maintain documentation, knowledge bases, or content databases with thousands of URLs, manual link checking is not an option. A broken link in your API docs costs developer trust. A dead URL in your knowledge base creates a support ticket. At 10,000+ links, you need automation.

This guide covers three approaches: a DIY Python script, cloud-based bulk checking, and CI/CD integration — so you can pick what fits your scale.

Why Bulk URL Checking Matters

Broken links in developer-facing content cause real damage:

API documentation with hundreds of endpoint references that go stale after versioning
Internal knowledge bases (Confluence, Notion, GitBook) with cross-linked articles that break during restructuring
Content databases with aggregated links from multiple sources
Third-party integrations referencing external URLs that disappear

When you have 30,000+ URLs across these systems, even a 2% monthly breakage rate means 600 broken links per month.

Approach 1: DIY Python Script

For developers who want full control, here is a concurrent URL checker with retry logic:

python

1import requests
2import csv
3import time
4from concurrent.futures import ThreadPoolExecutor
5from requests.adapters import HTTPAdapter
6from urllib3.util.retry import Retry
7
8def create_session():
9    session = requests.Session()
10    retries = Retry(
11        total=3,
12        backoff_factor=1,
13        status_forcelist=[429, 500, 502, 503, 504]
14    )
15    adapter = HTTPAdapter(max_retries=retries)
16    session.mount('http://', adapter)
17    session.mount('https://', adapter)
18    return session
19
20def check_url(url, session):
21    try:
22        response = session.head(url, timeout=10, allow_redirects=True)
23        return {
24            'url': url,
25            'status_code': response.status_code,
26            'final_url': response.url,
27            'response_time': response.elapsed.total_seconds()
28        }
29    except requests.exceptions.RequestException as e:
30        return {
31            'url': url,
32            'status_code': 'ERROR',
33            'final_url': url,
34            'error': str(e)
35        }
36
37def check_urls_bulk(urls, max_workers=10):
38    session = create_session()
39    results = []
40    with ThreadPoolExecutor(max_workers=max_workers) as executor:
41        futures = [executor.submit(check_url, url, session) for url in urls]
42        for future in futures:
43            results.append(future.result())
44            time.sleep(0.1)  # basic rate limiting
45    return results
46
47# Usage
48urls = open('urls.txt').read().splitlines()
49results = check_urls_bulk(urls)
50
51# Save results
52with open('results.csv', 'w', newline='') as f:
53    writer = csv.DictWriter(f, fieldnames=['url', 'status_code', 'final_url', 'response_time'])
54    writer.writeheader()
55    writer.writerows(results)

Where this breaks down

Hits rate limits (429 errors) at 5,000+ URLs with no proxy rotation
Runs on your machine — ties up local resources for hours on large batches
No soft 404 detection (pages that return 200 but show error content)
No redirect chain tracking
No persistent history or trend tracking

This approach works for one-off checks under 5,000 URLs. Beyond that, you need infrastructure.

Approach 2: Cloud-Based Bulk URL Checker

For production workloads with 10,000-75,000 URLs, a cloud-based checker solves the scaling problems:

Upload a CSV and walk away. Cloud infrastructure processes your batch. You get an email when the report is ready.
Automatic proxy rotation. Handles 429/403 errors so your entire batch completes — every time.
Soft 404 detection. Identifies pages that return 200 but actually show error content.
CSV/JSON export. Get results in developer-friendly formats for programmatic analysis.
Dashboard with filtering. Search by status code, filter broken links, view redirect chains.

Check Up to 75,000 URLs — Free to Start

Upload your CSV, get your report by email. 300 free URL checks, no credit card required.

Check URLs Free →

Real Example: Checking a Documentation Site

Here is a practical workflow for checking links in a documentation site with 5,000+ pages.

Step 1: Extract URLs

For static site generators (Next.js, Hugo, Gatsby, MkDocs):

bash

1# Extract all external links from built HTML
2grep -r -o 'https://[^"]*' ./build > urls.txt
3
4# Deduplicate
5sort -u urls.txt > unique_urls.txt
6echo "Found $(wc -l < unique_urls.txt) unique URLs"

Step 2: Convert to CSV

bash

1echo "url" > urls.csv
2cat unique_urls.txt >> urls.csv

Step 3: Upload and check

Upload the CSV to a bulk URL checker. For 5,000 URLs, processing typically takes 1-3 hours depending on target server response times.

Step 4: Filter broken links

python

1import pandas as pd
2
3df = pd.read_csv('results.csv')
4
5# Find broken links (4xx and 5xx)
6broken = df[df['status_code'] >= 400]
7print(f"Found {len(broken)} broken links")
8
9# Group by status code
10print(broken.groupby('status_code')['url'].count())
11
12# Export for fixing
13broken.to_csv('broken_links.csv', index=False)

CI/CD Integration

For continuous validation, run URL checks as part of your deployment pipeline:

yaml

1# .github/workflows/check-links.yml
2name: Check Documentation Links
3
4on:
5  schedule:
6    - cron: '0 0 * * 1'  # Every Monday
7  workflow_dispatch:
8
9jobs:
10  check-links:
11    runs-on: ubuntu-latest
12    steps:
13      - uses: actions/checkout@v4
14
15      - name: Extract URLs from docs
16        run: |
17          grep -r -o 'https://[^"]*' ./docs/build > urls.txt
18          sort -u urls.txt > unique_urls.txt
19          echo "url" > urls.csv
20          cat unique_urls.txt >> urls.csv
21          echo "Extracted $(wc -l < unique_urls.txt) URLs"
22
23      - name: Upload to Bulk URL Checker
24        run: |
25          # Upload CSV to your bulk checker
26          # Process results and flag broken links
27          echo "Upload urls.csv to app.bulkurlchecker.com"
28          echo "Review results in dashboard"

Comparison: Which Approach to Use

Feature	DIY Script	Desktop Tool	Cloud-Based
Max URLs (practical)	~5,000	~10,000	75,000
Rate limit handling	Manual	Limited	Automatic (proxy rotation)
Soft 404 detection	No	Yes	Yes
Local resources	High	High	None
Babysitting required	Yes	Yes	No
Cost	Free	£149/year	From $9.99

Under 5,000 URLs: The Python script works fine. 5,000-10,000: Desktop tools like Screaming Frog can handle it. 10,000-75,000: Cloud-based bulk checkers are the only practical option.

Best Practices

Check regularly. Monthly checks catch broken links before users find them.
Prioritize internal links. Broken internal links hurt SEO more than external ones.
Track redirect chains. Long chains (3+ hops) slow page load and should be shortened.
Monitor response times. Slow external resources affect your page performance.
Use proper User-Agent headers. Identify yourself to avoid being blocked as a bot.

For more on keeping documentation links healthy, see our dedicated guide. And check our comparison table to see how cloud-based checking stacks up.

Ready to Check Your URLs?

300 free URL checks. No credit card. Upload your CSV and get your report by email.