AutoCrawler : A Progressive Understanding Web Agent for Web Crawler Generation
๐ Abstract
The paper introduces AutoCrawler, a two-stage framework that leverages the hierarchical structure of HTML for progressive understanding to generate executable action sequences for web crawler tasks. The key contributions are:
- Proposing the web crawler generation task and the paradigm of utilizing large language models (LLMs) to generate crawlers.
- Introducing the AutoCrawler framework with a heuristic algorithm of top-down and step-back operations to progressively prune and refine the HTML content.
- Demonstrating the effectiveness of the AutoCrawler framework through comprehensive experiments with multiple LLMs.
๐ Q&A
[01] Introduction
1. What are the limitations of traditional web automation methods like wrappers?
- Traditional web automation methods like wrappers have limited adaptability and scalability when faced with new or altered website structures. They depend on manually annotated examples for each website, which is not scalable.
2. What are the challenges in using LLMs for web automation?
- LLMs lack exposure to markup languages like HTML, limiting their understanding of the complex structures and semantics in HTML.
- HTML contains both structured (tags, attributes) and unstructured (textual content) elements, making it challenging for LLMs to accurately capture and utilize the hierarchical structure.
- While LLMs excel at textual comprehension, they still fall short in understanding the structural information of lengthy HTML documents.
3. What is the key idea behind the AutoCrawler framework?
- AutoCrawler leverages the hierarchical structure of HTML for progressive understanding, using a heuristic algorithm of top-down and step-back operations to refine and prune the HTML content.
- This process helps correct erroneous executions and progressively identify the relevant parts of the HTML to successfully extract the target information.
[02] Preliminaries
1. How is the crawler generation task formulated? The crawler generation task is formulated as follows:
- Given a set of webpages on the same website describing a subject entity, and a predefined target attribute,
- The task is to generate an executable rule/action sequence to extract the target information from all the webpages.
2. What datasets are used for the experiments, and how are they preprocessed? The experiments use the following datasets:
- Swde, Extended Swde, and Ds1, which contain webpages and annotations from various domains.
- The datasets are preprocessed by removing non-semantic elements (scripts, styles) and normalizing the annotations.
3. How are the evaluation metrics defined for the crawler generation task? The evaluation metrics are defined as:
- Executable Evaluation: Measures the proportion of generated action sequences that are correct, over-extractive, unexecutable, or have other issues.
- IE Evaluation: Measures the precision, recall, and F1-score of the extracted information compared to the ground truth.
[03] AutoCrawler
1. How does the AutoCrawler framework generate the action sequences? AutoCrawler generates action sequences in two phases:
- Progressive Generation: Uses a heuristic algorithm of top-down and step-back operations to progressively refine and prune the HTML content, generating an executable action sequence.
- Synthesis: Generates multiple action sequences from seed webpages and selects the one that can extract the target information from all webpages.
2. How does the progressive generation phase work? The progressive generation phase:
- Starts from the root node of the DOM tree and progressively refines down to the specific node containing the target information (top-down).
- If execution fails, it moves up the DOM tree to choose a more reliable and broadly applicable node as a foundation (step-back).
- This process helps correct erroneous executions and prune irrelevant parts of the HTML.
3. How does the synthesis phase improve the reusability of the action sequences? The synthesis phase:
- Generates multiple action sequences from randomly selected seed webpages.
- Executes the different action sequences on the seed webpages and selects the one that can extract the target information from all webpages.
- This enhances the reusability of the action sequences by addressing differences in the specific location and structure of the target information across webpages.
[04] Experiment
1. How do the different LLMs and methods perform in the crawler generation task? The experimental results show that:
- AutoCrawler outperforms the baseline methods (COT and Reflexion) in terms of the proportion of correct and unexecutable action sequences generated.
- Larger LLMs (e.g., GPT-4, Mixtral 87B) demonstrate more stable performance in the task compared to smaller models.
- Traditional information extraction evaluation metrics (precision, recall, F1) do not fully capture the success rate of the crawler generation task, as they do not account for unexecutable or empty extractions.
2. How does the performance change when providing the golden label of the instruction? When provided with the golden label of the extraction target, the results show that:
- AutoCrawler still effectively enhances the model's performance compared to the baseline methods.
- However, LLMs still struggle to accurately understand the hierarchical structure of webpages, even with the golden label.
- Open-source LLMs are unable to achieve sustained performance improvement, indicating that the bottleneck lies in understanding the webpage structure rather than the content.
3. In which scenarios do current frameworks still not perform well? The experiments reveal that:
- Open-source LLMs with smaller parameter sizes (e.g., Mistral 7B) have significant difficulties in understanding and writing executable action sequences, making them challenging to apply in this task.
- Even with the progressive understanding framework of AutoCrawler, LLMs still struggle to fully capture the complex hierarchical structure of HTML, limiting their performance in the crawler generation task.