Wikipedia Agentic Workflow
Extract Content — Wikipedia Agentic Workflow
Extract article title, first paragraph, and main content from the current Wikipedia page
sidebutton install wikipedia A focused extractor for Wikipedia article pages. It captures the article title, the lead paragraph, and the main body content — skipping the sidebar, references, navigation boxes, and other chrome. The output is clean plain text suitable for downstream summarisation, knowledge extraction, or quoting.
Assumes the browser is already on an article page; use it in chain with the open-article workflow when the starting point is a topic name. Language is preserved — if Wikipedia redirected to a localised edition, the extracted content matches that edition rather than silently switching to English.
Steps
- 1. Extract text from a selector
- selector
- #firstHeading
- as
- title
browser.extract - 2. Extract text from a selector
- selector
- #mw-content-text .mw-parser-output > p:not([class])
- as
- first_paragraph
browser.extract - 3. browser extractAll
- selector
- #mw-content-text .mw-parser-output > p:not([class])
- as
- content
- separator
- \n\n
browser.extractAll
Workflow definition
schema_version: 1
version: "1.0.0"
last_verified: "2025-12-21"
id: wikipedia_extract_content
title: "Extract Content"
description: "Extract article title, first paragraph, and main content from the current Wikipedia page"
overview: |
A focused extractor for Wikipedia article pages. It captures the article title, the lead paragraph, and the main body content — skipping the sidebar, references, navigation boxes, and other chrome. The output is clean plain text suitable for downstream summarisation, knowledge extraction, or quoting.
Assumes the browser is already on an article page; use it in chain with the open-article workflow when the starting point is a topic name. Language is preserved — if Wikipedia redirected to a localised edition, the extracted content matches that edition rather than silently switching to English.
category:
level: task
domain: research
reusable: true
policies:
allowed_domains:
- wikipedia.org
- "*.wikipedia.org"
steps:
# Extract article title
- type: browser.extract
selector: "#firstHeading"
as: title
# Extract first paragraph (the lead/summary)
- type: browser.extract
selector: "#mw-content-text .mw-parser-output > p:not([class])"
as: first_paragraph
# Extract main content (multiple paragraphs)
- type: browser.extractAll
selector: "#mw-content-text .mw-parser-output > p:not([class])"
as: content
separator: "\n\n"