Compliance
Understand legal and regulatory compliance requirements
ScrapeHub Compliance Commitment
ScrapeHub is committed to maintaining the highest standards of data protection and regulatory compliance. We continuously monitor and adapt to evolving regulations to ensure our platform meets global compliance requirements.
GDPR Compliance
The General Data Protection Regulation (GDPR) governs the processing of personal data for EU citizens. ScrapeHub helps you stay compliant when scraping and processing personal information.
Key GDPR Principles
Data Minimization
Only collect and process data that is necessary for your specific purpose
Purpose Limitation
Use collected data only for the purposes you specified when collecting it
Data Accuracy
Ensure personal data is accurate, up-to-date, and corrected when necessary
Storage Limitation
Retain personal data only for as long as necessary for processing purposes
Implementing GDPR-Compliant Scraping
from scrapehub import ScrapeHubClient
import hashlib
from datetime import datetime, timedelta
client = ScrapeHubClient(api_key="your_api_key")
# Configure data minimization
result = client.scrape(
url="https://example.com",
selectors={
# Only extract necessary fields
"product_name": ".product-title",
"price": ".product-price",
# Don't extract personal data unless required
# "user_email": ".user-email" # ❌ Avoid unless necessary
}
)
# Pseudonymize personal data if collection is required
def pseudonymize(data):
"""Hash personal identifiers for GDPR compliance"""
if 'email' in data:
data['email_hash'] = hashlib.sha256(
data['email'].encode()
).hexdigest()
del data['email'] # Remove original email
return data
# Apply retention policy
retention_period = timedelta(days=90)
data_expiry = datetime.now() + retention_period
# Store with expiry metadata
scraped_data = {
'data': result.data,
'collected_at': datetime.now().isoformat(),
'expires_at': data_expiry.isoformat(),
'purpose': 'price_monitoring' # Document purpose
}
print(f"Data will be deleted after: {data_expiry}")Data Subject Rights
GDPR grants individuals specific rights over their personal data. Ensure your systems can support:
Right to Access
Individuals can request copies of their personal data you hold
Right to Erasure
Individuals can request deletion of their personal data ("right to be forgotten")
Right to Rectification
Individuals can request corrections to inaccurate personal data
Right to Data Portability
Individuals can request their data in a machine-readable format
CCPA Compliance
The California Consumer Privacy Act (CCPA) provides privacy rights and consumer protection for California residents.
CCPA Requirements
from scrapehub import ScrapeHubClient
client = ScrapeHubClient(api_key="your_api_key")
class CCPACompliantScraper:
def __init__(self):
self.data_collection_log = []
def scrape_with_notice(self, url, purpose):
"""Scrape with documented purpose (CCPA notice requirement)"""
# Log collection activity
self.data_collection_log.append({
'url': url,
'timestamp': datetime.now().isoformat(),
'purpose': purpose,
'categories': ['commercial_info', 'online_identifiers']
})
result = client.scrape(url)
return result
def opt_out_handler(self, user_id):
"""Handle California residents' opt-out requests"""
# Delete all data associated with user
# Stop future data collection for this user
print(f"Processing opt-out for user: {user_id}")
# Implementation depends on your database
def provide_data_access(self, user_id):
"""Provide users access to their collected data"""
# Return all data collected about the user
# Must be provided within 45 days of request
user_data = self.get_user_data(user_id)
return {
'data': user_data,
'categories': ['personal_info', 'commercial_info'],
'sources': ['web_scraping'],
'purposes': ['analytics', 'price_monitoring']
}
scraper = CCPACompliantScraper()
result = scraper.scrape_with_notice(
"https://example.com",
purpose="competitive_price_analysis"
)robots.txt Compliance
Respecting robots.txt is a fundamental ethical and often legal requirement for web scraping.
Automatic robots.txt Checking
from scrapehub import ScrapeHubClient
client = ScrapeHubClient(
api_key="your_api_key",
respect_robots_txt=True # Enable automatic checking
)
# ScrapeHub will automatically check robots.txt
# and reject requests to disallowed paths
try:
result = client.scrape("https://example.com/admin")
except client.RobotsTxtError as e:
print(f"Scraping blocked by robots.txt: {e}")
# Handle appropriately
# Check robots.txt manually
robots_info = client.check_robots_txt("https://example.com")
print(f"Can scrape: {robots_info.can_scrape}")
print(f"Crawl delay: {robots_info.crawl_delay} seconds")
print(f"Disallowed paths: {robots_info.disallowed_paths}")Understanding robots.txt
# Example robots.txt file
User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 10
User-agent: ScrapeHubBot
Allow: /public-data/
Crawl-delay: 5
# Always respect these directivesTerms of Service Compliance
Review Before Scraping
- Always read the target website's Terms of Service before scraping
- Some websites explicitly prohibit automated data collection
- Respect rate limits and crawl delays specified by websites
- Identify your scraper with a proper User-Agent string
- Consider reaching out to site owners for permission or API access
Setting a Proper User-Agent
from scrapehub import ScrapeHubClient
client = ScrapeHubClient(
api_key="your_api_key",
user_agent="MyCompanyBot/1.0 (+https://mycompany.com/bot)"
)
# This helps website owners:
# - Identify your scraper
# - Contact you if needed
# - Apply appropriate rate limits
result = client.scrape(
url="https://example.com",
headers={
"User-Agent": "MyCompanyBot/1.0 (+https://mycompany.com/bot)",
"From": "bot@mycompany.com" # Contact email
}
)Data Protection Certifications
SOC 2 Type II
ScrapeHub maintains SOC 2 Type II certification for security, availability, and confidentiality
ISO 27001
Certified for information security management system standards
GDPR Compliant
Full compliance with EU General Data Protection Regulation
CCPA Compliant
Adheres to California Consumer Privacy Act requirements
Industry-Specific Compliance
Healthcare (HIPAA)
If scraping healthcare-related data, ensure HIPAA compliance:
Do not scrape Protected Health Information (PHI) without proper authorization
Implement encryption for PHI at rest and in transit
Maintain audit logs of all PHI access and processing
Financial Services (PCI DSS)
Never Scrape Payment Card Data
- Do not extract credit card numbers, CVV codes, or cardholder data
- Scraping payment information is prohibited and illegal in most jurisdictions
- PCI DSS compliance requires strict controls that scraping cannot meet
Compliance Best Practices
Compliance Checklist
- Document the purpose and legal basis for data collection
- Implement data retention and deletion policies
- Maintain records of processing activities
- Conduct regular compliance audits
- Provide clear privacy notices to data subjects
- Establish processes for data subject rights requests
- Train team members on compliance requirements
- Review and update compliance practices regularly
- Consult with legal counsel for specific use cases
Data Processing Agreements
ScrapeHub acts as a data processor when you use our service. We provide standard Data Processing Agreements (DPAs) for GDPR compliance.
Terminal# Request a DPA # 1. Log in to ScrapeHub Dashboard # 2. Navigate to Settings → Legal & Compliance # 3. Download the standard DPA # 4. For custom DPAs, contact enterprise@scrapehub.io