Web Scraper

Table of Contents

What is the Web Extractor Node and what is it used for?
Configurations
Usage Tips

The Web Extractor node is perfect for obtaining information from one or more pages, and then reusing it in different parts of your flow. Below, we will explore its functions and configurations so you can make the most of it within AI Content Labs.

What is the Web Extractor Node and what is it used for?

The Web Extractor (or Web Scraper in the interface) allows you to collect content from one or more websites to process it through other nodes. For example, you can extract text from articles, products from an online store, or even take a screenshot for a Prompt Node with a vision model to analyze later.

Its main advantages are:

Flexibility in sources: you can use internal scraping or a specialized external service.
Ease of integration: it connects with other nodes, such as the Text Splitter Node to divide the text into parts and process them independently.
Customization: define what data to collect (full content, headers, etc.) and include or exclude images, links, and more.

Configurations

To start using the Web Extractor, you must select the content source and set the options for how the data will be obtained. These are the main configurations:

1. Source

The available sources may vary depending on the plan and the providers active in your account. For example:

Url Content AI Content Labs
Url Content Frase
Url Content ScrapeOwl
Screenshot ScrapeOwl
Url Content Scrape.do
Screenshot Scrape.do

Choose the option that best suits the protection or complexity of the site you are going to extract. For example, “Screenshot” is useful when you want to perform a subsequent visual analysis.

2. Variables and URLs

In the URLs field, you can add one or more addresses to extract information. It is even possible to use results from previous nodes to dynamically generate the list of sites.

3. Scraping Options

There are several functions to adjust how the information will be obtained:

Retry: if the provider fails to extract the information, it will try with another one.
Premium Proxies: for websites with advanced protection, activate this option and use a specialized proxy service.
Render Javascript: allows rendering the content if the page depends on scripts.
CSS Elements: extracts only certain sections of the site, for example a specific div with the content that interests you.

4. Data to Return

Decide what information will be returned:

Raw Content: includes all the HTML of the page.
Headers: limited to H1, H2 tags, etc., useful for identifying the structure of an article.
Exclude Images or Exclude Links: selective removal to focus on the text.
Word Count: calculates the total number of words, ideal for measuring the length of the content.

5. Output Settings

The Web Extractor includes the same output options as other nodes, such as hiding its result or not sending it to a webhook. However, Separator Pattern stands out here, which allows you to separate the content of multiple URLs with a specific pattern. Thus, if you get results from multiple pages, you can easily separate them with a Text Splitter Node later.

Usage Tips

Combine with Text Splitter Node: if you extract content from multiple links, configure a pattern in “Separator Pattern” and then use a “Text Splitter” node to divide the result into more manageable sections.
Include a Prompt Node: after scraping, pass the information to a “Prompt Node” so that a language model can perform a summary, translation or analysis of the extracted text.
Screenshots for Visual Intelligence: if you need to analyze the appearance of the website or graphic elements, select a “Screenshot” provider and then connect it to a Prompt Node with a vision model for processing.
Optimize extraction: use the CSS and Render Javascript selectors only if necessary. This will save time and resources in your flow.

In short, the Web Extractor (Scraper) node is your starting point for working with external content, offering you a wide range of customization. By combining it with other nodes, you will have a solid and automated flow to collect, clean and process information from the web.