W

Wikipedia Agentic Workflow

Extract Content — Wikipedia Agentic Workflow

Extract article title, first paragraph, and main content from the current Wikipedia page

Available free v1.0.0 Browser LLM
$ sidebutton install wikipedia
Download ZIP

A focused extractor for Wikipedia article pages. It captures the article title, the lead paragraph, and the main body content — skipping the sidebar, references, navigation boxes, and other chrome. The output is clean plain text suitable for downstream summarisation, knowledge extraction, or quoting.

Assumes the browser is already on an article page; use it in chain with the open-article workflow when the starting point is a topic name. Language is preserved — if Wikipedia redirected to a localised edition, the extracted content matches that edition rather than silently switching to English.

Steps

  1. 1.
    Extract text from a selector
    selector
    #firstHeading
    as
    title
    browser.extract
  2. 2.
    Extract text from a selector
    selector
    #mw-content-text .mw-parser-output > p:not([class])
    as
    first_paragraph
    browser.extract
  3. 3.
    browser extractAll
    selector
    #mw-content-text .mw-parser-output > p:not([class])
    as
    content
    separator
    \n\n
    browser.extractAll

Workflow definition

schema_version: 1
version: "1.0.0"
last_verified: "2025-12-21"
id: wikipedia_extract_content
title: "Extract Content"
description: "Extract article title, first paragraph, and main content from the current Wikipedia page"
overview: |
  A focused extractor for Wikipedia article pages. It captures the article title, the lead paragraph, and the main body content — skipping the sidebar, references, navigation boxes, and other chrome. The output is clean plain text suitable for downstream summarisation, knowledge extraction, or quoting.

  Assumes the browser is already on an article page; use it in chain with the open-article workflow when the starting point is a topic name. Language is preserved — if Wikipedia redirected to a localised edition, the extracted content matches that edition rather than silently switching to English.

category:
  level: task
  domain: research
  reusable: true
policies:
  allowed_domains:
    - wikipedia.org
    - "*.wikipedia.org"
steps:
  # Extract article title
  - type: browser.extract
    selector: "#firstHeading"
    as: title

  # Extract first paragraph (the lead/summary)
  - type: browser.extract
    selector: "#mw-content-text .mw-parser-output > p:not([class])"
    as: first_paragraph

  # Extract main content (multiple paragraphs)
  - type: browser.extractAll
    selector: "#mw-content-text .mw-parser-output > p:not([class])"
    as: content
    separator: "\n\n"