This tutorial will show you how to use OpenClaw for web scraping and automated data collection. You'll learn to extract data from websites, monitor prices, collect content, and build reusable scraping workflows. Estimated time: 25-35 minutes.

What You'll Build

By the end of this tutorial, you'll have:

  • Automated web scraping workflows
  • Price monitoring system
  • Content extraction pipelines
  • Data collection and storage automation
  • Reusable scraping patterns

Prerequisites

Before starting:

⚠️ Important: Always respect website terms of service and robots.txt. Use reasonable delays between requests and don't overload servers.

Step 1: Basic Data Extraction

Start with simple data extraction requests:

Extract Product Information
"Go to https://example.com/product and extract:
- Product name
- Price
- Description
- Rating
- Availability
Save this data to product.json"

OpenClaw will navigate to the page, identify the elements, extract the data, and save it in your preferred format.

Step 2: Price Monitoring

Set up automated price monitoring for products you're interested in:

Price Monitoring Setup
"Every day at 9am, check the price of [product URL]:
1. Visit the product page
2. Extract the current price
3. Compare with previous price
4. Send me a notification if price dropped below $X
5. Save price history to prices.csv"

OpenClaw will create a cron job to monitor prices automatically and notify you of changes.

Multi-Product Monitoring

Monitor multiple products:

Multi-Product Monitoring
"Monitor these products daily:
- https://example.com/product1
- https://example.com/product2
- https://example.com/product3
Create a price comparison report each day"

Step 3: Content Collection

Collect content from websites for research or archiving:

Article Collection

Collect Articles
"Visit https://example.com/articles and:
1. Extract all article links
2. Visit each article page
3. Extract title, author, content, date
4. Save each article as a markdown file
5. Create an index file with all articles"

News Aggregation

News Aggregation
"Collect today's top stories from:
- https://news.ycombinator.com
- https://www.reddit.com/r/programming
- https://example.com/tech-news
Extract headlines, summaries, and links
Create a daily news digest document"

Step 4: Scraping Multiple Pages

Automate scraping across multiple pages:

Pagination Scraping
"Scrape all products from this e-commerce site:
1. Visit the first page
2. Extract all product links
3. Click 'Next' button
4. Repeat for all pages
5. Visit each product page
6. Extract product details
7. Save all data to products.json"

OpenClaw will handle pagination, navigation, and data extraction automatically.

Step 5: Data Processing and Storage

Process and store scraped data effectively:

Save as JSON

Save as JSON
"Extract all products and save as JSON:
- Format: array of product objects
- Include: name, price, description, url
- Save to: ~/data/products.json"

Save as CSV

Save as CSV
"Extract price history and save as CSV:
- Columns: date, price, product_name
- Append to: ~/data/prices.csv"

Save as Markdown

Save as Markdown
"Extract articles and save as markdown files:
- One file per article
- Filename: YYYY-MM-DD-title.md
- Include: title, content, source, date"

Step 6: Scheduled Scraping

Set up automated, scheduled scraping tasks:

Scheduled Scraping
{
  "cron": {
    "jobs": [
      {
        "schedule": "0 9 * * *",
        "command": "agent --message 'Check prices for monitored products and send me a summary'"
      },
      {
        "schedule": "0 12 * * *",
        "command": "agent --message 'Collect today's news articles and create a digest'"
      }
    ]
  }
}

Step 7: Best Practices

Respectful Scraping

  • Check robots.txt before scraping
  • Add delays between requests (2-5 seconds)
  • Respect rate limits
  • Use user-agent headers
  • Don't overload servers

Error Handling

  • Handle page load failures
  • Verify elements exist before extraction
  • Retry failed requests
  • Log errors for debugging

Data Validation

  • Verify extracted data format
  • Check for missing fields
  • Validate data types
  • Handle edge cases

Real-World Examples

Example 1: Job Board Scraping

Job Board Scraping
"Scrape job listings from [job board]:
- Extract: title, company, location, salary, description
- Filter: remote jobs only, salary > $X
- Save: matching jobs to jobs.json
- Send: daily summary of new jobs"

Example 2: Research Paper Collection

Research Paper Collection
"Collect research papers on [topic]:
- Visit academic database
- Search for papers
- Extract: title, authors, abstract, PDF link
- Download PDFs
- Create bibliography document"

Example 3: Stock Price Tracking

Stock Price Tracking
"Track stock prices every hour:
- Visit stock quote page
- Extract current price
- Compare with previous price
- Calculate percentage change
- Alert if change > 5%
- Log price history"

Troubleshooting

Elements Not Found

  • Wait for page to fully load
  • Check element selectors are correct
  • Use browser dev tools to verify
  • Try more specific selectors

Scraping Blocked

  • Check robots.txt
  • Add delays between requests
  • Use different user-agent
  • Consider using API if available

Data Quality Issues

  • Verify selectors are accurate
  • Handle missing data gracefully
  • Validate extracted data
  • Review and clean data

Next Steps

Now that you can scrape websites, explore these related topics:

🌐 Browser Automation

Complete browser automation guide

View Tutorial →

🔍 Research Assistant

Automate research tasks

View Tutorial →

🌐 Browser Reference

Complete browser features

View Reference →

💡 Pro Tip

Combine web scraping with other OpenClaw features for powerful workflows. Use memory to track scraping history, automation for scheduled tasks, and skills for specialized scraping needs.