XPath
Web Scraping Workshop 3HKBU · Dept. of Communication Studies

Mastering XPath Expressions

Learn to navigate HTML documents with precision. Write XPath expressions, test them live, and extract exactly the data you need.

Live Playground
🎯4-Level Exercises
📋Syntax Cheat Sheet
🔧Tool Guide
01

XPath Syntax Cheat Sheet

Your quick reference for every XPath expression you'll need in web scraping.

BasicPredicatesFunctionsAxesAdvanced
ExpressionDescription
//tagSelect all elements of this tag anywhere
/html/body/h1Absolute path from root
@attrSelect an attribute
*Wildcard — matches any element
text()Text content of a node
//tag[1]First element (XPath is 1-indexed)
//tag[last()]Last element
[@attr='val']Filter by exact attribute value
[@attr]Has this attribute (any value)
[position()<=3]First N elements
contains()Partial string match
starts-with()Match beginning of string
normalize-space()Strip extra whitespace
not()Logical NOT
following-sibling::Next siblings at same level
preceding-sibling::Previous siblings
parent::Parent node
ancestor::All ancestors
A | BUnion — select from both paths
//tag[@a][@b]Multiple predicates (AND)
💡

Key Insight

Use contains(@class, "value") instead of @class="value" — HTML elements often have multiple classes, and exact matching will fail.

02

Interactive Playground

Write XPath expressions and see results instantly against a practice HTML document.

Quick Examples

Practice HTML Document

hkbu-comm.html
<html>
  <head>
    <title>HKBU Comm News</title>
  </head>
  <body>
    <header>
      <h1 class="site-title">HKBU Communication Studies</h1>
      <nav>
        <a href="/about">About</a>
        <a href="/research">Research</a>
        <a href="/people">People</a>
      </nav>
    </header>

    <main>
      <section id="news">
        <h2>Latest News</h2>
        <ul class="news-list">
          <li data-year="2024">
            <a href="/news/1" class="news-link">
              AI in Journalism Research
            </a>
            <span class="date">2024-03-01</span>
          </li>
          <li data-year="2024">
            <a href="/news/2" class="news-link">
              New Media Lab Opens
            </a>
            <span class="date">2024-02-15</span>
          </li>
          <li data-year="2023">
            <a href="/news/3" class="news-link">
              Annual Conference Report
            </a>
            <span class="date">2023-12-10</span>
          </li>
        </ul>
      </section>

      <section id="staff">
        <h2>Academic Staff</h2>
        <table class="staff-table">
          <thead>
            <tr>
              <th>Name</th>
              <th>Title</th>
              <th>Email</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td class="name">Prof. Chan Tai Man</td>
              <td class="title">Professor</td>
              <td class="email">
                <a href="mailto:[email protected]">[email protected]</a>
              </td>
            </tr>
            <tr>
              <td class="name">Dr. Lee Siu Fong</td>
              <td class="title">Associate Professor</td>
              <td class="email">
                <a href="mailto:[email protected]">[email protected]</a>
              </td>
            </tr>
            <tr>
              <td class="name">Dr. Wong Ka Wai</td>
              <td class="title">Assistant Professor</td>
              <td class="email">
                <a href="mailto:[email protected]">[email protected]</a>
              </td>
            </tr>
          </tbody>
        </table>
      </section>
    </main>

    <footer>
      <p class="copyright">© 2024 HKBU. All rights reserved.</p>
    </footer>
  </body>
</html>
$x()

Press Enter or click Run to evaluate

→ Enter an XPath expression above and press Run
03

Practice Exercises

Four levels of difficulty — from basic selection to real-world scraping patterns.

Level 1: Warm-Up — Basic node selection — no filters needed

Select all <h2> headings on the page

Get the text content of the page <title>

Select all <a> tags that have an href attribute

Get the src attribute value of every <img> tag

Use the Playground above to test your answers before revealing them.
Real StructureLIHKG 吹水台

Practice on a Real Hong Kong Website

This page replicates the exact HTML structure of lihkg.com/category/1 — same hashed class names, same DOM tree. XPath expressions you write here will work directly on the real site.

Open LIHKG Practice →
04

Tools & Workflow

The recommended workflow: test in browser → transfer to Python.

Chrome DevTools
XPath Helper
Python lxml
→ Production Scraper
🔍

Chrome DevTools

Built-in browser console

No install required

Steps

  1. 1Open any webpage → Press F12 (or right-click → "Inspect")
  2. 2Go to the Console tab
  3. 3Type: $x("//your/xpath")
  4. 4Press Enter — matched elements appear below
  5. 5Hover over results to highlight them on the page
// In Chrome Console:
$x('//h1/text()')
// → ["HKBU Communication Studies"]

$x('//a[@class="news-link"]/@href')
// → ["/news/1", "/news/2", "/news/3"]

$x('//li[contains(@class,"active")]')
// → [li.active]
Tip: This is the fastest way to test XPath. Always start here before writing Python.
🧩

XPath Helper

Chrome Extension

Free extension

Steps

  1. 1Search "XPath Helper" in Chrome Web Store and install
  2. 2Press Ctrl+Shift+X to open the XPath Helper panel
  3. 3Type your XPath — matching elements highlight instantly
  4. 4Hold Shift and hover any element to auto-generate its XPath
  5. 5Copy the XPath directly into your Python code
// XPath Helper shows:
// Expression: //td[@class="name"]/text()
// Results:
//   "Prof. Chan Tai Man"
//   "Dr. Lee Siu Fong"
//   "Dr. Wong Ka Wai"
// Count: 3
Tip: The Shift+hover feature auto-generates XPath for any element — great for getting started quickly.
🐍

Python lxml

For production scraping

pip install lxml

Steps

  1. 1Install: pip install requests lxml
  2. 2Fetch the page with requests.get()
  3. 3Parse with lxml.html.fromstring(response.content)
  4. 4Apply .xpath() method with your expression
  5. 5Results are Python lists — iterate or zip() them
import requests
from lxml import html

url = "https://www.comm.hkbu.edu.hk/..."
resp = requests.get(url)
tree = html.fromstring(resp.content)

# Extract data
names  = tree.xpath('//td[@class="name"]/text()')
emails = tree.xpath('//td[@class="email"]/a/@href')

for name, email in zip(names, emails):
    print(name, email.replace("mailto:",""))
Tip: Always test your XPath in Chrome DevTools first, then paste it directly into tree.xpath().

Common Mistakes & Fixes

Common MistakeProblemFix
@class="nav active"Exact match fails when element has multiple classescontains(@class, "nav")
/html/body/div[3]/ul/liAbsolute path breaks when layout changes//ul[@class="nav-list"]/li
//h1 (returns element, not text)Returns element object, not the string value//h1/text()
//li[0] (zero index)XPath is 1-indexed, not 0-indexed like Python//li[1] for the first element
XPath returns [] in PythonPage content loaded by JavaScript (dynamic)Use Selenium or Playwright to render JS first