Web Scraping Workshop 3HKBU · Dept. of Communication Studies

Mastering XPath Expressions

Learn to navigate HTML documents with precision. Write XPath expressions, test them live, and extract exactly the data you need.

⚡Live Playground

🎯4-Level Exercises

📋Syntax Cheat Sheet

🔧Tool Guide

$x('//h2[@class="title"]')

→ [h2, h2, h2] ✓ 3 matches

$x('//a/@href')

→ [attr, attr...] ✓ 12 matches

$x('//li[contains(@class,"active")]')

→ [li] ✓ 1 match

XPath Syntax Cheat Sheet

Your quick reference for every XPath expression you'll need in web scraping.

BasicPredicatesFunctionsAxesAdvanced

Expression	Description	Example	Category
`//tag`	Select all elements of this tag anywhere	`//div`	Basic
`/html/body/h1`	Absolute path from root	`/html/body/h1`	Basic
`@attr`	Select an attribute	`//a/@href`	Basic
`*`	Wildcard — matches any element	`//*[@id]`	Basic
`text()`	Text content of a node	`//h1/text()`	Basic
`//tag[1]`	First element (XPath is 1-indexed)	`//li[1]`	Predicates
`//tag[last()]`	Last element	`//li[last()]`	Predicates
`[@attr='val']`	Filter by exact attribute value	`//div[@class="news"]`	Predicates
`[@attr]`	Has this attribute (any value)	`//a[@href]`	Predicates
`[position()<=3]`	First N elements	`//tr[position()<=3]`	Predicates
`contains()`	Partial string match	`//div[contains(@class,"card")]`	Functions
`starts-with()`	Match beginning of string	`//a[starts-with(@href,"/news")]`	Functions
`normalize-space()`	Strip extra whitespace	`//p[normalize-space()="Hello"]`	Functions
`not()`	Logical NOT	`//li[not(@class)]`	Functions
`following-sibling::`	Next siblings at same level	`//h2/following-sibling::p`	Axes
`preceding-sibling::`	Previous siblings	`//li[3]/preceding-sibling::li`	Axes
`parent::`	Parent node	`//span/parent::div`	Axes
`ancestor::`	All ancestors	`//a/ancestor::nav`	Axes
`A \| B`	Union — select from both paths	`//a \| //button`	Advanced
`//tag[@a][@b]`	Multiple predicates (AND)	`//div[@class="x"][@data-id]`	Advanced

💡

Key Insight

Use contains(@class, "value") instead of @class="value" — HTML elements often have multiple classes, and exact matching will fail.

Interactive Playground

Write XPath expressions and see results instantly against a practice HTML document.

Quick Examples

Practice HTML Document

hkbu-comm.html

<html>
  <head>
    <title>HKBU Comm News</title>
  </head>
  <body>
    <header>
      <h1 class="site-title">HKBU Communication Studies</h1>
      <nav>
        <a href="/about">About</a>
        <a href="/research">Research</a>
        <a href="/people">People</a>
      </nav>
    </header>

    <main>
      <section id="news">
        <h2>Latest News</h2>
        <ul class="news-list">
          <li data-year="2024">
            <a href="/news/1" class="news-link">
              AI in Journalism Research
            </a>
            <span class="date">2024-03-01</span>
          </li>
          <li data-year="2024">
            <a href="/news/2" class="news-link">
              New Media Lab Opens
            </a>
            <span class="date">2024-02-15</span>
          </li>
          <li data-year="2023">
            <a href="/news/3" class="news-link">
              Annual Conference Report
            </a>
            <span class="date">2023-12-10</span>
          </li>
        </ul>
      </section>

      <section id="staff">
        <h2>Academic Staff</h2>
        <table class="staff-table">
          <thead>
            <tr>
              <th>Name</th>
              <th>Title</th>
              <th>Email</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td class="name">Prof. Chan Tai Man</td>
              <td class="title">Professor</td>
              <td class="email">
                <a href="mailto:[email protected]">[email protected]</a>
              </td>
            </tr>
            <tr>
              <td class="name">Dr. Lee Siu Fong</td>
              <td class="title">Associate Professor</td>
              <td class="email">
                <a href="mailto:[email protected]">[email protected]</a>
              </td>
            </tr>
            <tr>
              <td class="name">Dr. Wong Ka Wai</td>
              <td class="title">Assistant Professor</td>
              <td class="email">
                <a href="mailto:[email protected]">[email protected]</a>
              </td>
            </tr>
          </tbody>
        </table>
      </section>
    </main>

    <footer>
      <p class="copyright">© 2024 HKBU. All rights reserved.</p>
    </footer>
  </body>
</html>

XPath Expression

$x()

Press Enter or click Run to evaluate

Results

→ Enter an XPath expression above and press Run

Practice Exercises

Four levels of difficulty — from basic selection to real-world scraping patterns.

Level 1: Warm-Up — Basic node selection — no filters needed

Select all <h2> headings on the page

Get the text content of the page <title>

Select all <a> tags that have an href attribute

Get the src attribute value of every <img> tag

Use the Playground above to test your answers before revealing them.

Real StructureLIHKG 吹水台

Practice on a Real Hong Kong Website

This page replicates the exact HTML structure of lihkg.com/category/1 — same hashed class names, same DOM tree. XPath expressions you write here will work directly on the real site.

Open LIHKG Practice →

Tools & Workflow

The recommended workflow: test in browser → transfer to Python.

Chrome DevTools

→

XPath Helper

→

Python lxml

→ Production Scraper

🔍

Chrome DevTools

Built-in browser console

No install required

Steps

1Open any webpage → Press F12 (or right-click → "Inspect")
2Go to the Console tab
3Type: $x("//your/xpath")
4Press Enter — matched elements appear below
5Hover over results to highlight them on the page

// In Chrome Console:
$x('//h1/text()')
// → ["HKBU Communication Studies"]

$x('//a[@class="news-link"]/@href')
// → ["/news/1", "/news/2", "/news/3"]

$x('//li[contains(@class,"active")]')
// → [li.active]

Tip: This is the fastest way to test XPath. Always start here before writing Python.

🧩

XPath Helper

Chrome Extension

Free extension

Steps

1Search "XPath Helper" in Chrome Web Store and install
2Press Ctrl+Shift+X to open the XPath Helper panel
3Type your XPath — matching elements highlight instantly
4Hold Shift and hover any element to auto-generate its XPath
5Copy the XPath directly into your Python code

// XPath Helper shows:
// Expression: //td[@class="name"]/text()
// Results:
//   "Prof. Chan Tai Man"
//   "Dr. Lee Siu Fong"
//   "Dr. Wong Ka Wai"
// Count: 3

Tip: The Shift+hover feature auto-generates XPath for any element — great for getting started quickly.

🐍

Python lxml

For production scraping

pip install lxml

Steps

1Install: pip install requests lxml
2Fetch the page with requests.get()
3Parse with lxml.html.fromstring(response.content)
4Apply .xpath() method with your expression
5Results are Python lists — iterate or zip() them

import requests
from lxml import html

url = "https://www.comm.hkbu.edu.hk/..."
resp = requests.get(url)
tree = html.fromstring(resp.content)

# Extract data
names  = tree.xpath('//td[@class="name"]/text()')
emails = tree.xpath('//td[@class="email"]/a/@href')

for name, email in zip(names, emails):
    print(name, email.replace("mailto:",""))

Tip: Always test your XPath in Chrome DevTools first, then paste it directly into tree.xpath().

Common Mistakes & Fixes

Common Mistake	Problem	Fix
`@class="nav active"`	Exact match fails when element has multiple classes	`contains(@class, "nav")`
`/html/body/div[3]/ul/li`	Absolute path breaks when layout changes	`//ul[@class="nav-list"]/li`
`//h1 (returns element, not text)`	Returns element object, not the string value	`//h1/text()`
`//li[0] (zero index)`	XPath is 1-indexed, not 0-indexed like Python	`//li[1] for the first element`
`XPath returns [] in Python`	Page content loaded by JavaScript (dynamic)	`Use Selenium or Playwright to render JS first`