Security Best Practices
Protect your data and API keys with security best practices
Security is Critical
Compromised API keys can lead to unauthorized access, data breaches, and unexpected charges. Following these security best practices is essential to protect your account and data.
API Key Security
Never Commit Keys to Version Control
Don't Do This
# ❌ NEVER hardcode API keys
api_key = "sk_live_1234567890" # This is BAD!
client = ScrapeHubClient(api_key=api_key)Do This Instead
# ✅ Use environment variables
import os
api_key = os.getenv('SCRAPEHUB_API_KEY')
client = ScrapeHubClient(api_key=api_key)Environment Variables
Store API keys in environment variables or secure credential managers:
Terminal# .env file (add to .gitignore!) SCRAPEHUB_API_KEY=sk_live_xxxx_xxxx SCRAPEHUB_WEBHOOK_SECRET=whsec_xxxx_xxxx # Never commit this file to git!
Terminal# Add to .gitignore .env .env.local .env.*.local *.key secrets.json credentials.json
Loading Environment Variables
# Python with python-dotenv
from dotenv import load_dotenv
import os
load_dotenv() # Load from .env file
api_key = os.getenv('SCRAPEHUB_API_KEY')
if not api_key:
raise ValueError("API key not found in environment variables")// Node.js with dotenv
require('dotenv').config();
const apiKey = process.env.SCRAPEHUB_API_KEY;
if (!apiKey) {
throw new Error('API key not found in environment variables');
}API Key Management
Use Multiple Keys
Create separate API keys for development, staging, and production environments
Rotate Regularly
Rotate API keys every 90 days or immediately if compromised
Monitor Usage
Regularly review API key usage in your dashboard for suspicious activity
Revoke Unused Keys
Delete API keys that are no longer in use to minimize risk
Secure Data Handling
Sanitize Scraped Data
Always sanitize and validate scraped data before using it:
import html
import re
def sanitize_text(text):
"""Remove potentially harmful content from scraped text"""
if not text:
return ""
# HTML entity decode
text = html.unescape(text)
# Remove script tags and content
text = re.sub(r'<script[^>]*>.*?</script>', '', text, flags=re.DOTALL | re.IGNORECASE)
# Remove HTML tags
text = re.sub(r'<[^>]+>', '', text)
# Remove excessive whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
# Use sanitization
result = client.scrape("https://example.com")
clean_title = sanitize_text(result.data.get('title'))
clean_description = sanitize_text(result.data.get('description'))Validate URLs
from urllib.parse import urlparse
def is_safe_url(url, allowed_domains=None):
"""Validate URL before scraping"""
try:
parsed = urlparse(url)
# Check scheme
if parsed.scheme not in ['http', 'https']:
return False
# Check domain whitelist
if allowed_domains:
if parsed.netloc not in allowed_domains:
return False
# Avoid localhost and private IPs
if parsed.netloc in ['localhost', '127.0.0.1', '0.0.0.0']:
return False
return True
except Exception:
return False
# Use validation
url = user_provided_url # From user input
if is_safe_url(url, allowed_domains=['example.com', 'api.example.com']):
result = client.scrape(url)
else:
raise ValueError("Invalid or unsafe URL")Encrypt Sensitive Data
from cryptography.fernet import Fernet
import os
# Generate or load encryption key
encryption_key = os.getenv('ENCRYPTION_KEY')
if not encryption_key:
encryption_key = Fernet.generate_key()
print(f"Save this key securely: {encryption_key.decode()}")
cipher = Fernet(encryption_key)
# Encrypt sensitive scraped data
result = client.scrape("https://example.com")
sensitive_data = str(result.data).encode()
encrypted_data = cipher.encrypt(sensitive_data)
# Store encrypted_data in database
# Later, decrypt when needed
decrypted_data = cipher.decrypt(encrypted_data).decode()
print(decrypted_data)Network Security
Use HTTPS
Always use HTTPS for API requests and webhook endpoints:
# Configure client to enforce HTTPS
client = ScrapeHubClient(
api_key=api_key,
enforce_https=True # Reject non-HTTPS URLs
)
# Webhook endpoints must use HTTPS in production
webhook_url = "https://your-server.com/webhook" # ✅ HTTPS
# webhook_url = "http://your-server.com/webhook" # ❌ HTTPIP Whitelisting
Restrict API access to specific IP addresses in the dashboard:
Terminal# In ScrapeHub Dashboard → Settings → Security # Add allowed IP addresses: # 203.0.113.0/24 (your server IP range) # 198.51.100.50 (your office IP) # API requests from other IPs will be rejected
Rate Limiting
from time import sleep
from datetime import datetime, timedelta
class RateLimiter:
def __init__(self, max_requests, time_window):
self.max_requests = max_requests
self.time_window = time_window # seconds
self.requests = []
def wait_if_needed(self):
now = datetime.now()
cutoff = now - timedelta(seconds=self.time_window)
# Remove old requests
self.requests = [r for r in self.requests if r > cutoff]
if len(self.requests) >= self.max_requests:
# Wait until oldest request expires
sleep_time = (self.requests[0] - cutoff).total_seconds()
sleep(sleep_time)
self.requests.pop(0)
self.requests.append(now)
# Use rate limiter
limiter = RateLimiter(max_requests=10, time_window=60)
for url in urls:
limiter.wait_if_needed()
result = client.scrape(url)Error Handling
Don't Expose Sensitive Info in Errors
import logging
# Configure logging (don't log sensitive data)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
try:
result = client.scrape(url)
except Exception as e:
# ❌ Bad: Exposes API key in logs
# logger.error(f"Scraping failed with key {api_key}: {e}")
# ✅ Good: Generic error message
logger.error(f"Scraping failed for URL: {e}")
# Don't expose internal errors to end users
# ❌ Bad: raise e
# ✅ Good: raise generic error
raise Exception("Unable to fetch data. Please try again later.")Webhook Security
Always Verify Signatures
import hmac
import hashlib
from flask import Flask, request, abort
app = Flask(__name__)
WEBHOOK_SECRET = os.getenv('SCRAPEHUB_WEBHOOK_SECRET')
@app.route('/webhook', methods=['POST'])
def webhook():
# Get signature
signature = request.headers.get('X-ScrapeHub-Signature')
if not signature:
logger.warning("Webhook request missing signature")
abort(401)
# Verify signature
payload = request.get_data()
expected = hmac.new(
WEBHOOK_SECRET.encode(),
payload,
hashlib.sha256
).hexdigest()
if not hmac.compare_digest(signature, expected):
logger.warning("Invalid webhook signature")
abort(401)
# Process webhook
data = request.json
process_webhook(data)
return {'status': 'success'}, 200Compliance & Privacy
Respect robots.txt
Check and honor robots.txt directives before scraping. ScrapeHub provides built-in robots.txt parsing.
Handle Personal Data Carefully
If scraping personal information, ensure compliance with GDPR, CCPA, and other privacy regulations.
Terms of Service
Always review and comply with website terms of service before scraping.
Security Checklist
Pre-Production Security Checklist
- API keys stored in environment variables, not in code
- .env and credential files added to .gitignore
- Separate keys for development and production
- HTTPS enforced for all API requests and webhooks
- Webhook signature verification implemented
- Input validation for URLs and scraped data
- Rate limiting configured appropriately
- Error handling doesn't expose sensitive information
- Logging configured to exclude API keys and secrets
- Regular API key rotation scheduled
- IP whitelisting configured if applicable
- Security monitoring and alerts set up
Incident Response
If Your API Key is Compromised
- Immediately revoke the compromised key in the ScrapeHub dashboard
- Generate a new API key with a different name
- Update your applications with the new key
- Review recent activity for unauthorized usage
- Check your billing for unexpected charges
- Contact support if you notice suspicious activity
- Audit your codebase to prevent future exposures
Report Security Issues
If you discover a security vulnerability in ScrapeHub, please email security@scrapehub.io immediately. Do not post security issues publicly.