View Categories

Web Scraper

The Web Extractor node is perfect for obtaining information from one or more pages, and then reusing it in different parts of your flow. Below, we will explore its functions and configurations so you can make the most of it within AI Content Labs.

What is the Web Extractor Node and what is it used for?

The Web Extractor (or Web Scraper in the interface) allows you to collect content from one or more websites to process it through other nodes. For example, you can extract text from articles, products from an online store, or even take a screenshot for a Prompt Node with a vision model to analyze later.

Web Scraper Node Overview.

Its main advantages are:

  • Flexibility in sources: you can use internal scraping or a specialized external service.
  • Ease of integration: it connects with other nodes, such as the Text Splitter Node to divide the text into parts and process them independently.
  • Customization: define what data to collect (full content, headers, etc.) and include or exclude images, links, and more.

Configurations

To start using the Web Extractor, you must select the content source and set the options for how the data will be obtained. These are the main configurations:

1. Source

The available sources may vary depending on the plan and the providers active in your account. For example:

  • Url Content AI Content Labs
  • Url Content Frase
  • Url Content ScrapeOwl
  • Screenshot ScrapeOwl
  • Url Content Scrape.do
  • Screenshot Scrape.do

List of providers and scraping options.

Choose the option that best suits the protection or complexity of the site you are going to extract. For example, “Screenshot” is useful when you want to perform a subsequent visual analysis.

2. Variables and URLs

In the URLs field, you can add one or more addresses to extract information. It is even possible to use results from previous nodes to dynamically generate the list of sites.

3. Scraping Options

There are several functions to adjust how the information will be obtained:

Scraper Node Configuration Options.

  • Retry: if the provider fails to extract the information, it will try with another one.
  • Premium Proxies: for websites with advanced protection, activate this option and use a specialized proxy service.
  • Render Javascript: allows rendering the content if the page depends on scripts.
  • CSS Elements: extracts only certain sections of the site, for example a specific div with the content that interests you.

4. Data to Return

Decide what information will be returned:

  • Raw Content: includes all the HTML of the page.
  • Headers: limited to H1, H2 tags, etc., useful for identifying the structure of an article.
  • Exclude Images or Exclude Links: selective removal to focus on the text.
  • Word Count: calculates the total number of words, ideal for measuring the length of the content.

5. Output Settings

The Web Extractor includes the same output options as other nodes, such as hiding its result or not sending it to a webhook. However, Separator Pattern stands out here, which allows you to separate the content of multiple URLs with a specific pattern. Thus, if you get results from multiple pages, you can easily separate them with a Text Splitter Node later.

Output settings, with focus on Separator Pattern.

Usage Tips

  • Combine with Text Splitter Node: if you extract content from multiple links, configure a pattern in “Separator Pattern” and then use a “Text Splitter” node to divide the result into more manageable sections.
  • Include a Prompt Node: after scraping, pass the information to a “Prompt Node” so that a language model can perform a summary, translation or analysis of the extracted text.
  • Screenshots for Visual Intelligence: if you need to analyze the appearance of the website or graphic elements, select a “Screenshot” provider and then connect it to a Prompt Node with a vision model for processing.
  • Optimize extraction: use the CSS and Render Javascript selectors only if necessary. This will save time and resources in your flow.

In short, the Web Extractor (Scraper) node is your starting point for working with external content, offering you a wide range of customization. By combining it with other nodes, you will have a solid and automated flow to collect, clean and process information from the web.