Data is the new oil, and extracting valuable information from the web is a crucial skill for businesses and individuals alike. Whether you’re tracking competitor activity, monitoring news mentions, or performing market research, web crawling can provide you with the insights you need. While coding custom crawlers can be complex, tools like n8n empower you to build powerful automation workflows with a visual, low-code interface.
This guide will walk you through the process of building a keyword-based web crawler using n8n, helping you extract the data that matters most to you. Let’s dive in! 🚀
1. What is n8n and Why Use it for Web Crawling?
n8n (pronounced “n-eight-n”) is a powerful, open-source workflow automation tool. It allows you to connect various applications, APIs, and services to automate tasks and data flows without writing extensive code.
Why n8n for Web Crawling?
- Visual Workflow Builder: Drag-and-drop interface makes building complex logic straightforward and intuitive. No more deciphering lines of code! 🎨
- Low-Code/No-Code: Great for both developers and non-developers. You can achieve sophisticated results with minimal coding, or even none at all.
- Powerful Built-in Nodes: n8n comes with pre-built nodes for HTTP requests, HTML parsing (Cheerio), data manipulation, and integration with databases, spreadsheets, and cloud services.
- Self-Hostable or Cloud: You have the flexibility to run n8n on your own server for maximum control and data privacy, or use their cloud service.
- Extensibility: For advanced scenarios, you can use the
Code
node to write custom JavaScript, giving you ultimate flexibility. - Scheduling: Automate your crawls to run at specific intervals (hourly, daily, weekly) to keep your data fresh. ⏰
2. Prerequisites
Before we start building our crawler, ensure you have the following:
- An n8n Instance:
- Local: You can run n8n on your machine using Docker or npm.
- Cloud: Sign up for an n8n Cloud account.
- Basic Understanding of n8n: Familiarity with nodes, connecting them, and workflow execution.
- Basic Understanding of HTML/CSS Selectors (Optional but Recommended): Knowing how to identify elements on a webpage using browser developer tools will greatly assist in extracting specific data.
3. Key n8n Nodes for Web Crawling
These are the primary nodes you’ll be using to build your keyword-based web crawler:
Start
/Trigger
Node:- Initiates your workflow. Can be a
Manual
trigger for testing, or aCron
node for scheduled runs.
- Initiates your workflow. Can be a
HTTP Request
Node:- Sends HTTP requests (GET, POST, etc.) to fetch the content of a webpage. This is where you specify the URL of the site you want to crawl.
- Pro-Tip: Always set a
User-Agent
header to mimic a real browser to avoid getting blocked.
Cheerio
Node:- The workhorse for parsing HTML. Cheerio is a fast, flexible, and lean implementation of core jQuery designed specifically for the server. It allows you to select specific elements from the HTML content using CSS selectors (e.g.,
.post-title
,#main-content
,a[href]
). 🎯
- The workhorse for parsing HTML. Cheerio is a fast, flexible, and lean implementation of core jQuery designed specifically for the server. It allows you to select specific elements from the HTML content using CSS selectors (e.g.,
Set
Node:- Used to add, modify, or remove data fields in your workflow. We’ll use this to add our “keyword found” flag or the search keyword itself.
Code (Function)
Node:- For advanced logic, custom data transformations, or when
Cheerio
orSet
nodes aren’t flexible enough. You can write JavaScript to perform regular expression matching (regex) for more sophisticated keyword detection.
- For advanced logic, custom data transformations, or when
IF
Node:- Conditional logic. Useful for filtering data, for example, only keeping items where your keyword was found.
Write to File
/Google Sheets
/Database
Node:- For storing your extracted data. Choose the destination that suits your needs (e.g., a CSV file, a Google Sheet, or a database like PostgreSQL). 💾
4. Building a Keyword Crawling Workflow: A Step-by-Step Example
Let’s build a practical example: We’ll crawl a hypothetical tech blog to find articles related to a specific keyword, say “AI” or “Automation.”
Our Goal:
- Visit a blog’s main page.
- Extract the titles and URLs of articles.
- Check if the article title (or snippet) contains our target keyword.
- Save the relevant articles to a CSV file.
Scenario Walkthrough:
Imagine we want to crawl https://n8n.io/blog/
to find articles related to “Artificial Intelligence.”
-
Start Node (Manual Trigger)
-
Drag and drop a
Start
node onto your canvas. -
Select
Manual
to easily test your workflow. -
Why: This node gets your workflow going. For recurring crawls, you’d use a
Cron
node.
-
-
HTTP Request Node (Fetch the Blog Page)
-
Add an
HTTP Request
node and connect it to theStart
node. -
Method:
GET
-
URL:
https://n8n.io/blog/
(Replace with your target blog URL) -
Headers: Click “Add Header”
- Name:
User-Agent
- Value:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
(Using a common user agent helps avoid detection as a bot).
- Name:
-
Why: This node sends a request to the blog’s server and retrieves the entire HTML content of the page.
-
-
Cheerio Node (Parse HTML and Extract Data)
-
Add a
Cheerio
node and connect it to theHTTP Request
node. -
Operation:
Extract Data
-
HTML Data:
Current Node
(This will automatically use the output from theHTTP Request
node). -
Add Fields:
- Click “Add Field”
- Field Name:
articleTitle
- CSS Selector:
.blog-post-card__title a
(You’ll need to inspect the target website’s HTML to find the correct selector for titles. Use your browser’s developer tools!) - Click “Add Field” again
- Field Name:
articleUrl
- CSS Selector:
.blog-post-card__title a
(Often the URL is linked to the title element) - Attribute:
href
(Specify that you want thehref
attribute’s value). - (Optional) Field Name:
articleSnippet
- (Optional) CSS Selector:
.blog-post-card__description
(Find a selector for the article’s summary or first paragraph).
-
Why: The
Cheerio
node takes the raw HTML and lets you pluck out specific pieces of information (like titles, links, summaries) using CSS selectors, turning them into structured data.
-
-
Code Node (Keyword Detection – Advanced)
- While
Set
node can do simpleincludes()
, aCode
node offers more robust regex matching for keywords. - Add a
Code
node and connect it to theCheerio
node. -
Code:
const keyword = 'AI'; // Your target keyword const regex = new RegExp(`\\b${keyword}\\b`, 'i'); // \b for whole word, i for case-insensitive for (const item of $input.all()) { const title = item.json.articleTitle || ''; const snippet = item.json.articleSnippet || ''; const keywordFound = regex.test(title) || regex.test(snippet); item.json.keywordDetected = keywordFound; item.json.searchedKeyword = keyword; } return $input.all();
- Why: This node iterates through each article extracted by
Cheerio
. It checks if either the title or the snippet contains ourkeyword
using a case-insensitive regular expression (regex.test()
). It then adds two new fields:keywordDetected
(boolean) andsearchedKeyword
(for reference).
- While
-
IF Node (Filter by Keyword)
-
Add an
IF
node and connect it to theCode
node. -
Conditions:
- Value 1:
{{ $json.keywordDetected }}
- Operation:
Is True
- Value 2: (Leave blank)
- Value 1:
-
Why: This node acts as a filter. Only articles where
keywordDetected
istrue
(meaning the keyword was found) will pass through the “True” branch. The rest will go down the “False” branch (which you can leave unconnected or use for logging rejected items).
-
-
Write to File Node (Save Results)
-
Add a
Write to File
node and connect it to the “True” branch of theIF
node. -
File Path:
keyword_articles.csv
(You can specify a full path like/tmp/keyword_articles.csv
on Linux orC:\temp\keyword_articles.csv
on Windows, or just a filename if n8n can write to its current directory). -
Mode:
Append
(to add new articles to the file without overwriting). -
File Content:
CSV
-
Add Fields: Map the fields you want to save:
articleTitle
articleUrl
searchedKeyword
-
Why: This node takes the filtered data and saves it into a structured CSV file, ready for analysis. You could also choose
Google Sheets
or aDatabase
node here depending on your needs.
-
Visualizing the Workflow:
Your n8n workflow canvas should look something like this (conceptually):
Start
➡️ HTTP Request
➡️ Cheerio
➡️ Code (Keyword Detect)
➡️ IF (Keyword Found?)
⬇️ (True Branch)
Write to File
5. Advanced Considerations & Best Practices
As you scale your crawling efforts, consider these points:
- Pagination: Many websites split content across multiple pages. To handle this, you’ll need to:
- Identify the URL pattern for pagination (e.g.,
/blog?page=1
,/blog?page=2
). - Use a
Code
node orSet
node to generate a list of all page URLs. - Use a
Split In Batches
node, then loop back with anotherHTTP Request
andCheerio
for each page.
- Identify the URL pattern for pagination (e.g.,
- Dynamic Content (JavaScript-rendered): The
HTTP Request
node only fetches the initial HTML. If content loads via JavaScript after the page loads (e.g., infinite scroll, single-page applications), you’ll need more advanced tools like thePuppeteer
orPlaywright
nodes. These nodes launch a headless browser to render the page fully before you can extract data. 🎭 - Error Handling: Use
Try/Catch
nodes to gracefully handle errors (e.g., network issues, website changes) without stopping your entire workflow. - Rate Limiting & Delays: To avoid overwhelming the target server or getting blocked, use the
Delay
node betweenHTTP Request
calls. Be a good net citizen! 🐢 - Proxies: For large-scale crawling, consider using proxy services to rotate IP addresses and avoid IP bans.
robots.txt
: Always check the website’srobots.txt
file (e.g.,https://n8n.io/robots.txt
) before crawling. This file tells crawlers which parts of the site they are allowed to access. Respect these rules!- Legal & Ethical Considerations: Be aware of terms of service, data privacy regulations (like GDPR, CCPA), and copyright laws. Not all publicly available data is fair game for scraping.
- Data Cleaning and Transformation: Once you have the raw data, you might need additional
Set
orCode
nodes to clean up text, remove unwanted characters, or reformat values before saving.
6. Conclusion
n8n offers an incredibly powerful and accessible way to automate web crawling, especially for keyword-based data extraction. Its visual interface, robust node library, and flexibility allow you to build sophisticated workflows without being a seasoned programmer.
Start with simple workflows, experiment with different nodes, and gradually add complexity as you become more comfortable. The world of web data is waiting for you to unlock its potential! Happy crawling! 🕸️✨ G