화. 7월 22nd, 2025

Data is the new oil, and extracting valuable information from the web is a crucial skill for businesses and individuals alike. Whether you’re tracking competitor activity, monitoring news mentions, or performing market research, web crawling can provide you with the insights you need. While coding custom crawlers can be complex, tools like n8n empower you to build powerful automation workflows with a visual, low-code interface.

This guide will walk you through the process of building a keyword-based web crawler using n8n, helping you extract the data that matters most to you. Let’s dive in! 🚀


1. What is n8n and Why Use it for Web Crawling?

n8n (pronounced “n-eight-n”) is a powerful, open-source workflow automation tool. It allows you to connect various applications, APIs, and services to automate tasks and data flows without writing extensive code.

Why n8n for Web Crawling?

  • Visual Workflow Builder: Drag-and-drop interface makes building complex logic straightforward and intuitive. No more deciphering lines of code! 🎨
  • Low-Code/No-Code: Great for both developers and non-developers. You can achieve sophisticated results with minimal coding, or even none at all.
  • Powerful Built-in Nodes: n8n comes with pre-built nodes for HTTP requests, HTML parsing (Cheerio), data manipulation, and integration with databases, spreadsheets, and cloud services.
  • Self-Hostable or Cloud: You have the flexibility to run n8n on your own server for maximum control and data privacy, or use their cloud service.
  • Extensibility: For advanced scenarios, you can use the Code node to write custom JavaScript, giving you ultimate flexibility.
  • Scheduling: Automate your crawls to run at specific intervals (hourly, daily, weekly) to keep your data fresh. ⏰

2. Prerequisites

Before we start building our crawler, ensure you have the following:

  • An n8n Instance:
    • Local: You can run n8n on your machine using Docker or npm.
    • Cloud: Sign up for an n8n Cloud account.
  • Basic Understanding of n8n: Familiarity with nodes, connecting them, and workflow execution.
  • Basic Understanding of HTML/CSS Selectors (Optional but Recommended): Knowing how to identify elements on a webpage using browser developer tools will greatly assist in extracting specific data.

3. Key n8n Nodes for Web Crawling

These are the primary nodes you’ll be using to build your keyword-based web crawler:

  • Start / Trigger Node:
    • Initiates your workflow. Can be a Manual trigger for testing, or a Cron node for scheduled runs.
  • HTTP Request Node:
    • Sends HTTP requests (GET, POST, etc.) to fetch the content of a webpage. This is where you specify the URL of the site you want to crawl.
    • Pro-Tip: Always set a User-Agent header to mimic a real browser to avoid getting blocked.
  • Cheerio Node:
    • The workhorse for parsing HTML. Cheerio is a fast, flexible, and lean implementation of core jQuery designed specifically for the server. It allows you to select specific elements from the HTML content using CSS selectors (e.g., .post-title, #main-content, a[href]). 🎯
  • Set Node:
    • Used to add, modify, or remove data fields in your workflow. We’ll use this to add our “keyword found” flag or the search keyword itself.
  • Code (Function) Node:
    • For advanced logic, custom data transformations, or when Cheerio or Set nodes aren’t flexible enough. You can write JavaScript to perform regular expression matching (regex) for more sophisticated keyword detection.
  • IF Node:
    • Conditional logic. Useful for filtering data, for example, only keeping items where your keyword was found.
  • Write to File / Google Sheets / Database Node:
    • For storing your extracted data. Choose the destination that suits your needs (e.g., a CSV file, a Google Sheet, or a database like PostgreSQL). 💾

4. Building a Keyword Crawling Workflow: A Step-by-Step Example

Let’s build a practical example: We’ll crawl a hypothetical tech blog to find articles related to a specific keyword, say “AI” or “Automation.”

Our Goal:

  1. Visit a blog’s main page.
  2. Extract the titles and URLs of articles.
  3. Check if the article title (or snippet) contains our target keyword.
  4. Save the relevant articles to a CSV file.

Scenario Walkthrough:

Imagine we want to crawl https://n8n.io/blog/ to find articles related to “Artificial Intelligence.”

  1. Start Node (Manual Trigger)

    • Drag and drop a Start node onto your canvas.

    • Select Manual to easily test your workflow.

    • Why: This node gets your workflow going. For recurring crawls, you’d use a Cron node.

  2. HTTP Request Node (Fetch the Blog Page)

    • Add an HTTP Request node and connect it to the Start node.

    • Method: GET

    • URL: https://n8n.io/blog/ (Replace with your target blog URL)

    • Headers: Click “Add Header”

      • Name: User-Agent
      • Value: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 (Using a common user agent helps avoid detection as a bot).
    • Why: This node sends a request to the blog’s server and retrieves the entire HTML content of the page.

  3. Cheerio Node (Parse HTML and Extract Data)

    • Add a Cheerio node and connect it to the HTTP Request node.

    • Operation: Extract Data

    • HTML Data: Current Node (This will automatically use the output from the HTTP Request node).

    • Add Fields:

      • Click “Add Field”
      • Field Name: articleTitle
      • CSS Selector: .blog-post-card__title a (You’ll need to inspect the target website’s HTML to find the correct selector for titles. Use your browser’s developer tools!)
      • Click “Add Field” again
      • Field Name: articleUrl
      • CSS Selector: .blog-post-card__title a (Often the URL is linked to the title element)
      • Attribute: href (Specify that you want the href attribute’s value).
      • (Optional) Field Name: articleSnippet
      • (Optional) CSS Selector: .blog-post-card__description (Find a selector for the article’s summary or first paragraph).
    • Why: The Cheerio node takes the raw HTML and lets you pluck out specific pieces of information (like titles, links, summaries) using CSS selectors, turning them into structured data.

  4. Code Node (Keyword Detection – Advanced)

    • While Set node can do simple includes(), a Code node offers more robust regex matching for keywords.
    • Add a Code node and connect it to the Cheerio node.
    • Code:

      const keyword = 'AI'; // Your target keyword
      const regex = new RegExp(`\\b${keyword}\\b`, 'i'); // \b for whole word, i for case-insensitive
      
      for (const item of $input.all()) {
        const title = item.json.articleTitle || '';
        const snippet = item.json.articleSnippet || '';
      
        const keywordFound = regex.test(title) || regex.test(snippet);
      
        item.json.keywordDetected = keywordFound;
        item.json.searchedKeyword = keyword;
      }
      return $input.all();
    • Why: This node iterates through each article extracted by Cheerio. It checks if either the title or the snippet contains our keyword using a case-insensitive regular expression (regex.test()). It then adds two new fields: keywordDetected (boolean) and searchedKeyword (for reference).
  5. IF Node (Filter by Keyword)

    • Add an IF node and connect it to the Code node.

    • Conditions:

      • Value 1: {{ $json.keywordDetected }}
      • Operation: Is True
      • Value 2: (Leave blank)
    • Why: This node acts as a filter. Only articles where keywordDetected is true (meaning the keyword was found) will pass through the “True” branch. The rest will go down the “False” branch (which you can leave unconnected or use for logging rejected items).

  6. Write to File Node (Save Results)

    • Add a Write to File node and connect it to the “True” branch of the IF node.

    • File Path: keyword_articles.csv (You can specify a full path like /tmp/keyword_articles.csv on Linux or C:\temp\keyword_articles.csv on Windows, or just a filename if n8n can write to its current directory).

    • Mode: Append (to add new articles to the file without overwriting).

    • File Content: CSV

    • Add Fields: Map the fields you want to save:

      • articleTitle
      • articleUrl
      • searchedKeyword
    • Why: This node takes the filtered data and saves it into a structured CSV file, ready for analysis. You could also choose Google Sheets or a Database node here depending on your needs.

Visualizing the Workflow:

Your n8n workflow canvas should look something like this (conceptually):

Start ➡️ HTTP Request ➡️ Cheerio ➡️ Code (Keyword Detect) ➡️ IF (Keyword Found?) ⬇️ (True Branch) Write to File


5. Advanced Considerations & Best Practices

As you scale your crawling efforts, consider these points:

  • Pagination: Many websites split content across multiple pages. To handle this, you’ll need to:
    • Identify the URL pattern for pagination (e.g., /blog?page=1, /blog?page=2).
    • Use a Code node or Set node to generate a list of all page URLs.
    • Use a Split In Batches node, then loop back with another HTTP Request and Cheerio for each page.
  • Dynamic Content (JavaScript-rendered): The HTTP Request node only fetches the initial HTML. If content loads via JavaScript after the page loads (e.g., infinite scroll, single-page applications), you’ll need more advanced tools like the Puppeteer or Playwright nodes. These nodes launch a headless browser to render the page fully before you can extract data. 🎭
  • Error Handling: Use Try/Catch nodes to gracefully handle errors (e.g., network issues, website changes) without stopping your entire workflow.
  • Rate Limiting & Delays: To avoid overwhelming the target server or getting blocked, use the Delay node between HTTP Request calls. Be a good net citizen! 🐢
  • Proxies: For large-scale crawling, consider using proxy services to rotate IP addresses and avoid IP bans.
  • robots.txt: Always check the website’s robots.txt file (e.g., https://n8n.io/robots.txt) before crawling. This file tells crawlers which parts of the site they are allowed to access. Respect these rules!
  • Legal & Ethical Considerations: Be aware of terms of service, data privacy regulations (like GDPR, CCPA), and copyright laws. Not all publicly available data is fair game for scraping.
  • Data Cleaning and Transformation: Once you have the raw data, you might need additional Set or Code nodes to clean up text, remove unwanted characters, or reformat values before saving.

6. Conclusion

n8n offers an incredibly powerful and accessible way to automate web crawling, especially for keyword-based data extraction. Its visual interface, robust node library, and flexibility allow you to build sophisticated workflows without being a seasoned programmer.

Start with simple workflows, experiment with different nodes, and gradually add complexity as you become more comfortable. The world of web data is waiting for you to unlock its potential! Happy crawling! 🕸️✨ G

답글 남기기

이메일 주소는 공개되지 않습니다. 필수 필드는 *로 표시됩니다