How to Check URLs in Bulk: Guide for Developers
If you maintain documentation, knowledge bases, or content databases with thousands of URLs, manual link checking is not an option. A broken link in your API docs costs developer trust. A dead URL in your knowledge base creates a support ticket. At 10,000+ links, you need automation.
This guide covers three approaches: a DIY Python script, cloud-based bulk checking, and CI/CD integration — so you can pick what fits your scale.
Why Bulk URL Checking Matters
Broken links in developer-facing content cause real damage:
- API documentation with hundreds of endpoint references that go stale after versioning
- Internal knowledge bases (Confluence, Notion, GitBook) with cross-linked articles that break during restructuring
- Content databases with aggregated links from multiple sources
- Third-party integrations referencing external URLs that disappear
When you have 30,000+ URLs across these systems, even a 2% monthly breakage rate means 600 broken links per month.
Approach 1: DIY Python Script
For developers who want full control, here is a concurrent URL checker with retry logic:
1import requests
2import csv
3import time
4from concurrent.futures import ThreadPoolExecutor
5from requests.adapters import HTTPAdapter
6from urllib3.util.retry import Retry
7
8def create_session():
9 session = requests.Session()
10 retries = Retry(
11 total=3,
12 backoff_factor=1,
13 status_forcelist=[429, 500, 502, 503, 504]
14 )
15 adapter = HTTPAdapter(max_retries=retries)
16 session.mount('http://', adapter)
17 session.mount('https://', adapter)
18 return session
19
20def check_url(url, session):
21 try:
22 response = session.head(url, timeout=10, allow_redirects=True)
23 return {
24 'url': url,
25 'status_code': response.status_code,
26 'final_url': response.url,
27 'response_time': response.elapsed.total_seconds()
28 }
29 except requests.exceptions.RequestException as e:
30 return {
31 'url': url,
32 'status_code': 'ERROR',
33 'final_url': url,
34 'error': str(e)
35 }
36
37def check_urls_bulk(urls, max_workers=10):
38 session = create_session()
39 results = []
40 with ThreadPoolExecutor(max_workers=max_workers) as executor:
41 futures = [executor.submit(check_url, url, session) for url in urls]
42 for future in futures:
43 results.append(future.result())
44 time.sleep(0.1) # basic rate limiting
45 return results
46
47# Usage
48urls = open('urls.txt').read().splitlines()
49results = check_urls_bulk(urls)
50
51# Save results
52with open('results.csv', 'w', newline='') as f:
53 writer = csv.DictWriter(f, fieldnames=['url', 'status_code', 'final_url', 'response_time'])
54 writer.writeheader()
55 writer.writerows(results)Where this breaks down
- Hits rate limits (429 errors) at 5,000+ URLs with no proxy rotation
- Runs on your machine — ties up local resources for hours on large batches
- No soft 404 detection (pages that return 200 but show error content)
- No redirect chain tracking
- No persistent history or trend tracking
This approach works for one-off checks under 5,000 URLs. Beyond that, you need infrastructure.
Approach 2: Cloud-Based Bulk URL Checker
For production workloads with 10,000-75,000 URLs, a cloud-based checker solves the scaling problems:
- Upload a CSV and walk away. Cloud infrastructure processes your batch. You get an email when the report is ready.
- Automatic proxy rotation. Handles 429/403 errors so your entire batch completes — every time.
- Soft 404 detection. Identifies pages that return 200 but actually show error content.
- CSV/JSON export. Get results in developer-friendly formats for programmatic analysis.
- Dashboard with filtering. Search by status code, filter broken links, view redirect chains.
Check Up to 75,000 URLs — Free to Start
Upload your CSV, get your report by email. 300 free URL checks, no credit card required.
Check URLs Free →Real Example: Checking a Documentation Site
Here is a practical workflow for checking links in a documentation site with 5,000+ pages.
Step 1: Extract URLs
For static site generators (Next.js, Hugo, Gatsby, MkDocs):
1# Extract all external links from built HTML
2grep -r -o 'https://[^"]*' ./build > urls.txt
3
4# Deduplicate
5sort -u urls.txt > unique_urls.txt
6echo "Found $(wc -l < unique_urls.txt) unique URLs"Step 2: Convert to CSV
1echo "url" > urls.csv
2cat unique_urls.txt >> urls.csvStep 3: Upload and check
Upload the CSV to a bulk URL checker. For 5,000 URLs, processing typically takes 1-3 hours depending on target server response times.
Step 4: Filter broken links
1import pandas as pd
2
3df = pd.read_csv('results.csv')
4
5# Find broken links (4xx and 5xx)
6broken = df[df['status_code'] >= 400]
7print(f"Found {len(broken)} broken links")
8
9# Group by status code
10print(broken.groupby('status_code')['url'].count())
11
12# Export for fixing
13broken.to_csv('broken_links.csv', index=False)CI/CD Integration
For continuous validation, run URL checks as part of your deployment pipeline:
1# .github/workflows/check-links.yml
2name: Check Documentation Links
3
4on:
5 schedule:
6 - cron: '0 0 * * 1' # Every Monday
7 workflow_dispatch:
8
9jobs:
10 check-links:
11 runs-on: ubuntu-latest
12 steps:
13 - uses: actions/checkout@v4
14
15 - name: Extract URLs from docs
16 run: |
17 grep -r -o 'https://[^"]*' ./docs/build > urls.txt
18 sort -u urls.txt > unique_urls.txt
19 echo "url" > urls.csv
20 cat unique_urls.txt >> urls.csv
21 echo "Extracted $(wc -l < unique_urls.txt) URLs"
22
23 - name: Upload to Bulk URL Checker
24 run: |
25 # Upload CSV to your bulk checker
26 # Process results and flag broken links
27 echo "Upload urls.csv to app.bulkurlchecker.com"
28 echo "Review results in dashboard"Comparison: Which Approach to Use
| Feature | DIY Script | Desktop Tool | Cloud-Based |
|---|---|---|---|
| Max URLs (practical) | ~5,000 | ~10,000 | 75,000 |
| Rate limit handling | Manual | Limited | Automatic (proxy rotation) |
| Soft 404 detection | No | Yes | Yes |
| Local resources | High | High | None |
| Babysitting required | Yes | Yes | No |
| Cost | Free | £149/year | From $9.99 |
Under 5,000 URLs: The Python script works fine. 5,000-10,000: Desktop tools like Screaming Frog can handle it. 10,000-75,000: Cloud-based bulk checkers are the only practical option.
Best Practices
- Check regularly. Monthly checks catch broken links before users find them.
- Prioritize internal links. Broken internal links hurt SEO more than external ones.
- Track redirect chains. Long chains (3+ hops) slow page load and should be shortened.
- Monitor response times. Slow external resources affect your page performance.
- Use proper User-Agent headers. Identify yourself to avoid being blocked as a bot.
For more on keeping documentation links healthy, see our dedicated guide. And check our comparison table to see how cloud-based checking stacks up.
Ready to Check Your URLs?
300 free URL checks. No credit card. Upload your CSV and get your report by email.
Check URLs Free →Related Articles
How to Check for 404 Errors on Your Website →
Find and fix 404 errors hurting your SEO with Google Search Console, crawlers, and bulk checkers.
Free vs Paid Broken Link Checkers →
When free tools are enough and when you need a paid broken link checker.
How to Find Broken Links on Any Website (2026 Guide) →
Free methods, browser tools, and bulk checking to find and fix broken links on any website.