🕷️ Web Scraping & Data Collection with OpenClaw

Complete tutorial to automate web scraping and data collection

In this guide we cover web scraping and data collection with OpenClaw—extraction, price monitoring, and reusable workflows. Plan for about 25–35 minutes.

What You'll Build

Here's what you'll have when you're done:

Automated web scraping workflows
Price monitoring system
Content extraction pipelines
Data collection and storage automation
Reusable scraping patterns

Prerequisites

Before starting:

OpenClaw installed - Complete Getting Started Tutorial
Browser Automation Tutorial - Review Browser Automation Tutorial
Browser control enabled - Chrome/Chromium for web automation
Basic understanding of HTML - Helpful but not required

⚠️ Important: Always respect website terms of service and robots.txt. Use reasonable delays between requests and don't overload servers.

Step 1: Basic Data Extraction

Start with simple data extraction requests:

Extract Product Information

"Go to https://example.com/product and extract:
- Product name
- Price
- Description
- Rating
- Availability
Save this data to product.json"

OpenClaw will navigate to the page, identify the elements, extract the data, and save it in your preferred format.

Step 2: Price Monitoring

Set up automated price monitoring for products you're interested in:

Price Monitoring Setup

"Every day at 9am, check the price of [product URL]:
1. Visit the product page
2. Extract the current price
3. Compare with previous price
4. Send me a notification if price dropped below $X
5. Save price history to prices.csv"

OpenClaw will create a cron job to monitor prices automatically and notify you of changes.

Multi-Product Monitoring

Monitor multiple products:

Multi-Product Monitoring

"Monitor these products daily:
- https://example.com/product1
- https://example.com/product2
- https://example.com/product3
Create a price comparison report each day"

Step 3: Content Collection

Collect content from websites for research or archiving:

Article Collection

Collect Articles

"Visit https://example.com/articles and:
1. Extract all article links
2. Visit each article page
3. Extract title, author, content, date
4. Save each article as a markdown file
5. Create an index file with all articles"

News Aggregation

"Collect today's top stories from:
- https://news.ycombinator.com
- https://www.reddit.com/r/programming
- https://example.com/tech-news
Extract headlines, summaries, and links
Create a daily news digest document"

Step 4: Scraping Multiple Pages

Automate scraping across multiple pages:

Pagination Scraping

"Scrape all products from this e-commerce site:
1. Visit the first page
2. Extract all product links
3. Click 'Next' button
4. Repeat for all pages
5. Visit each product page
6. Extract product details
7. Save all data to products.json"

OpenClaw will handle pagination, navigation, and data extraction automatically.

Step 5: Data Processing and Storage

Process and store scraped data effectively:

Save as JSON

"Extract all products and save as JSON:
- Format: array of product objects
- Include: name, price, description, url
- Save to: ~/data/products.json"

Save as CSV

"Extract price history and save as CSV:
- Columns: date, price, product_name
- Append to: ~/data/prices.csv"

Save as Markdown

"Extract articles and save as markdown files:
- One file per article
- Filename: YYYY-MM-DD-title.md
- Include: title, content, source, date"

Step 6: Scheduled Scraping

Set up automated, scheduled scraping tasks:

Scheduled Scraping

{
  "cron": {
    "jobs": [
      {
        "schedule": "0 9 * * *",
        "command": "agent --message 'Check prices for monitored products and send me a summary'"
      },
      {
        "schedule": "0 12 * * *",
        "command": "agent --message 'Collect today's news articles and create a digest'"
      }
    ]
  }
}

Step 7: Best Practices

Respectful Scraping

Check robots.txt before scraping
Add delays between requests (2-5 seconds)
Respect rate limits
Use user-agent headers
Don't overload servers

Error Handling

Handle page load failures
Verify elements exist before extraction
Retry failed requests
Log errors for debugging

Data Validation

Verify extracted data format
Check for missing fields
Validate data types
Handle edge cases

Real-World Examples

Example 1: Job Board Scraping

Job Board Scraping

"Scrape job listings from [job board]:
- Extract: title, company, location, salary, description
- Filter: remote jobs only, salary > $X
- Save: matching jobs to jobs.json
- Send: daily summary of new jobs"

Example 2: Research Paper Collection

Research Paper Collection

"Collect research papers on [topic]:
- Visit academic database
- Search for papers
- Extract: title, authors, abstract, PDF link
- Download PDFs
- Create bibliography document"

Example 3: Stock Price Tracking

Stock Price Tracking

"Track stock prices every hour:
- Visit stock quote page
- Extract current price
- Compare with previous price
- Calculate percentage change
- Alert if change > 5%
- Log price history"

Troubleshooting

Elements Not Found

Wait for page to fully load
Check element selectors are correct
Use browser dev tools to verify
Try more specific selectors

Scraping Blocked

Check robots.txt
Add delays between requests
Use different user-agent
Consider using API if available

Data Quality Issues

Verify selectors are accurate
Handle missing data gracefully
Validate extracted data
Review and clean data

Next Steps

Now that you can scrape websites, explore these related topics:

🌐 Browser Automation

Complete browser automation guide

View Tutorial →

🔍 Research Assistant

Automate research tasks

View Tutorial →

🌐 Browser Reference

Complete browser features

View Reference →

💡 Pro Tip

Combine web scraping with other OpenClaw features for powerful workflows. Use memory to track scraping history, automation for scheduled tasks, and skills for specialized scraping needs.