Web Scraping Best Practices for Reliable Data Collection
Building a web scraper that works once is easy. Building one that works reliably at scale, day after day, is a different problem entirely. This guide covers the practices that separate production-grade scraping from quick scripts that break after a week.
1. Use Residential Proxies for Protected Sites
Most websites with valuable data have some form of anti-bot protection. Datacenter IPs are the first thing they block.
Residential proxies route your requests through real consumer IP addresses, making them indistinguishable from normal user traffic. FastWebScraper supports multiple proxy types:
- Residential: Best for sites with strong anti-bot protection. Higher success rates but slower.
- Datacenter: Faster and cheaper. Good for sites with minimal protection.
- ISP proxies: A middle ground — datacenter speed with residential-level trust.
// Use residential proxies for protected sites
const response = await fetch('https://api.fastwebscraper.com/v1/scrape/sync', {
method: 'POST',
headers: {
'X-API-Key': 'YOUR_API_KEY',
'Content-Type': 'application/json',
},
body: JSON.stringify({
url: 'https://protected-site.com/data',
mode: 'auto',
country: 'US', // Target a specific geo for localized content
}),
});2. Wait for Dynamic Content
Modern websites load content dynamically with JavaScript. If you scrape too early, you get empty containers instead of data.
Use the waitForSelector parameter to tell the scraper to wait until specific elements are present in the DOM:
const response = await fetch('https://api.fastwebscraper.com/v1/scrape/sync', {
method: 'POST',
headers: {
'X-API-Key': 'YOUR_API_KEY',
'Content-Type': 'application/json',
},
body: JSON.stringify({
url: 'https://example.com/products',
mode: 'auto',
waitForSelector: '.product-card, [data-product-id]',
}),
});
const data = await response.json();
// HTML now contains fully rendered product cards3. Handle Errors Gracefully
Scraping jobs fail. Sites go down, change their structure, or block your request. Your system needs to handle this without losing data or crashing.
Retry strategy:
- Retry failed requests with exponential backoff (1s, 2s, 4s, 8s)
- Cap retries at 3-5 attempts
- Use different proxy IPs on each retry
- Log failures for debugging
async function scrapeWithRetry(url: string, maxRetries = 3): Promise<any> {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
const response = await fetch('https://api.fastwebscraper.com/v1/scrape/sync', {
method: 'POST',
headers: {
'X-API-Key': 'YOUR_API_KEY',
'Content-Type': 'application/json',
},
body: JSON.stringify({ url, mode: 'auto' }),
});
const result = await response.json();
if (response.ok && result.data?.html) {
return result;
}
const delay = Math.pow(2, attempt) * 1000;
console.log(`Attempt ${attempt} failed. Retrying in ${delay}ms...`);
await new Promise(resolve => setTimeout(resolve, delay));
}
throw new Error(`Failed to scrape ${url} after ${maxRetries} attempts`);
}4. Respect Rate Limits
Even with proxies, sending too many requests too fast is counterproductive. It triggers defenses, wastes credits, and can harm the target site.
Guidelines:
- Start with 1-2 concurrent requests per domain and increase gradually
- Add 1-3 second delays between requests to the same domain
- Spread requests across time rather than sending bursts
- Use the async API for large batches — queue jobs and poll for results
const urls = [
'https://example.com/page/1',
'https://example.com/page/2',
'https://example.com/page/3',
];
for (const url of urls) {
const response = await fetch('https://api.fastwebscraper.com/v1/scrape/async', {
method: 'POST',
headers: {
'X-API-Key': 'YOUR_API_KEY',
'Content-Type': 'application/json',
},
body: JSON.stringify({ url, mode: 'auto' }),
});
const { data } = await response.json();
console.log(`Queued: ${url} -> Job ${data.jobId}`);
// Pause between submissions
await new Promise(r => setTimeout(r, 1000));
}5. Validate Extracted Data
HTML parsing is fragile. A missing element, changed class name, or empty container can produce garbage data. Always validate what you extract.
Validation checklist:
- Check that extracted values are non-empty
- Validate data types (prices should be numbers, dates should parse correctly)
- Set reasonable bounds (a product price of $0.00 or $999,999 is probably wrong)
- Compare against previous values to detect anomalies
- Log validation failures separately from scraping failures
function validatePrice(priceText: string): number | null {
// Remove currency symbols and whitespace
const cleaned = priceText.trim().replace(/[^\d.]/g, '');
if (!cleaned) return null;
const price = parseFloat(cleaned);
if (isNaN(price) || price <= 0 || price > 100_000) {
return null;
}
return price;
}
// Usage
const price = validatePrice('$29.99');
console.log(price); // 29.996. Use Async Scraping for Large Jobs
When scraping hundreds or thousands of URLs, synchronous requests create a bottleneck. Use the async API to queue jobs in bulk and process results as they complete.
Pattern:
- Submit all URLs as async jobs
- Collect the job IDs
- Poll for completion in batches
- Process results as they arrive
This approach maximizes throughput while keeping your code simple.
7. Store Raw HTML Before Parsing
Always save the raw HTML before parsing it. If your parsing logic has a bug or the site structure changes, you can re-parse stored HTML without re-scraping.
Benefits:
- Debug parsing issues without making new requests
- Backfill new data fields from historical scrapes
- Audit trail of what the page looked like at each scrape time
- Reduce API usage and costs
8. Monitor Your Scraping Pipeline
Treat your scraping pipeline like any production system:
- Success rate: Track what percentage of scrapes return valid data
- Latency: Monitor how long scrapes take — slowdowns often indicate detection
- Data freshness: Ensure data is being updated on schedule
- Alerting: Set up alerts for success rate drops or complete failures
Summary
| Practice | Why It Matters |
|---|---|
| Residential proxies | Higher success rate on protected sites |
| Wait for selectors | Get fully rendered dynamic content |
| Retry with backoff | Recover from transient failures |
| Rate limiting | Avoid blocks and be a good internet citizen |
| Data validation | Catch bad data before it enters your system |
| Async scraping | Scale to thousands of URLs efficiently |
| Store raw HTML | Enable re-parsing and debugging |
| Pipeline monitoring | Catch issues before they impact your data |
Following these practices will save you significant debugging time and produce higher quality data. For more details on the API parameters mentioned here, see the API Reference.