Mastering XPath Expressions
Learn to navigate HTML documents with precision. Write XPath expressions, test them live, and extract exactly the data you need.
XPath Syntax Cheat Sheet
Your quick reference for every XPath expression you'll need in web scraping.
| Expression | Description |
|---|---|
//tag | Select all elements of this tag anywhere |
/html/body/h1 | Absolute path from root |
@attr | Select an attribute |
* | Wildcard — matches any element |
text() | Text content of a node |
//tag[1] | First element (XPath is 1-indexed) |
//tag[last()] | Last element |
[@attr='val'] | Filter by exact attribute value |
[@attr] | Has this attribute (any value) |
[position()<=3] | First N elements |
contains() | Partial string match |
starts-with() | Match beginning of string |
normalize-space() | Strip extra whitespace |
not() | Logical NOT |
following-sibling:: | Next siblings at same level |
preceding-sibling:: | Previous siblings |
parent:: | Parent node |
ancestor:: | All ancestors |
A | B | Union — select from both paths |
//tag[@a][@b] | Multiple predicates (AND) |
Key Insight
Use contains(@class, "value") instead of @class="value" — HTML elements often have multiple classes, and exact matching will fail.
Interactive Playground
Write XPath expressions and see results instantly against a practice HTML document.
Quick Examples
Practice HTML Document
hkbu-comm.html<html> <head> <title>HKBU Comm News</title> </head> <body> <header> <h1 class="site-title">HKBU Communication Studies</h1> <nav> <a href="/about">About</a> <a href="/research">Research</a> <a href="/people">People</a> </nav> </header> <main> <section id="news"> <h2>Latest News</h2> <ul class="news-list"> <li data-year="2024"> <a href="/news/1" class="news-link"> AI in Journalism Research </a> <span class="date">2024-03-01</span> </li> <li data-year="2024"> <a href="/news/2" class="news-link"> New Media Lab Opens </a> <span class="date">2024-02-15</span> </li> <li data-year="2023"> <a href="/news/3" class="news-link"> Annual Conference Report </a> <span class="date">2023-12-10</span> </li> </ul> </section> <section id="staff"> <h2>Academic Staff</h2> <table class="staff-table"> <thead> <tr> <th>Name</th> <th>Title</th> <th>Email</th> </tr> </thead> <tbody> <tr> <td class="name">Prof. Chan Tai Man</td> <td class="title">Professor</td> <td class="email"> <a href="mailto:[email protected]">[email protected]</a> </td> </tr> <tr> <td class="name">Dr. Lee Siu Fong</td> <td class="title">Associate Professor</td> <td class="email"> <a href="mailto:[email protected]">[email protected]</a> </td> </tr> <tr> <td class="name">Dr. Wong Ka Wai</td> <td class="title">Assistant Professor</td> <td class="email"> <a href="mailto:[email protected]">[email protected]</a> </td> </tr> </tbody> </table> </section> </main> <footer> <p class="copyright">© 2024 HKBU. All rights reserved.</p> </footer> </body> </html>
Press Enter or click Run to evaluate
Practice Exercises
Four levels of difficulty — from basic selection to real-world scraping patterns.
Level 1: Warm-Up — Basic node selection — no filters needed
Select all <h2> headings on the page
Get the text content of the page <title>
Select all <a> tags that have an href attribute
Get the src attribute value of every <img> tag
Practice on a Real Hong Kong Website
This page replicates the exact HTML structure of lihkg.com/category/1 — same hashed class names, same DOM tree. XPath expressions you write here will work directly on the real site.
Tools & Workflow
The recommended workflow: test in browser → transfer to Python.
Chrome DevTools
Built-in browser console
Steps
- 1Open any webpage → Press F12 (or right-click → "Inspect")
- 2Go to the Console tab
- 3Type: $x("//your/xpath")
- 4Press Enter — matched elements appear below
- 5Hover over results to highlight them on the page
// In Chrome Console:
$x('//h1/text()')
// → ["HKBU Communication Studies"]
$x('//a[@class="news-link"]/@href')
// → ["/news/1", "/news/2", "/news/3"]
$x('//li[contains(@class,"active")]')
// → [li.active]XPath Helper
Chrome Extension
Steps
- 1Search "XPath Helper" in Chrome Web Store and install
- 2Press Ctrl+Shift+X to open the XPath Helper panel
- 3Type your XPath — matching elements highlight instantly
- 4Hold Shift and hover any element to auto-generate its XPath
- 5Copy the XPath directly into your Python code
// XPath Helper shows: // Expression: //td[@class="name"]/text() // Results: // "Prof. Chan Tai Man" // "Dr. Lee Siu Fong" // "Dr. Wong Ka Wai" // Count: 3
Python lxml
For production scraping
Steps
- 1Install: pip install requests lxml
- 2Fetch the page with requests.get()
- 3Parse with lxml.html.fromstring(response.content)
- 4Apply .xpath() method with your expression
- 5Results are Python lists — iterate or zip() them
import requests
from lxml import html
url = "https://www.comm.hkbu.edu.hk/..."
resp = requests.get(url)
tree = html.fromstring(resp.content)
# Extract data
names = tree.xpath('//td[@class="name"]/text()')
emails = tree.xpath('//td[@class="email"]/a/@href')
for name, email in zip(names, emails):
print(name, email.replace("mailto:",""))Common Mistakes & Fixes
| Common Mistake | Problem | Fix |
|---|---|---|
@class="nav active" | Exact match fails when element has multiple classes | contains(@class, "nav") |
/html/body/div[3]/ul/li | Absolute path breaks when layout changes | //ul[@class="nav-list"]/li |
//h1 (returns element, not text) | Returns element object, not the string value | //h1/text() |
//li[0] (zero index) | XPath is 1-indexed, not 0-indexed like Python | //li[1] for the first element |
XPath returns [] in Python | Page content loaded by JavaScript (dynamic) | Use Selenium or Playwright to render JS first |