XPath Practice Lab — HKBU Comm

|OpenRice ScraperXPath + requests + lxml + Cookie
 Visit OpenRice
⭐⭐ 中級🍪 需要 CookieSICSS Singapore 2023 · 2-Part Scraper
OpenRice 香港餐廳
XPath 爬蟲教學本教學改編自 SICSS Singapore 2023 工作坊代碼，分為兩部分： Part A 爬取餐廳搜尋頁列表（10 個字段）， Part B 爬取每間餐廳的詳情頁和全部用戶評論。 注意：書籤頁為 JS 動態渲染，改用靜態 HTML 的搜尋頁。OpenRice 需要登入 Cookie 才能正常訪問。
需要 Cookie 才能運行
1. 用 Chrome 登入 openrice.com → 2. F12 → Network → 刷新頁面 → 3. 點擊任意 openrice.com 請求 → Headers → 找 Cookie: → 4. 複製整行值，貼入 Cell 3 的 COOKIE_STR
TARGET URLhttps://www.openrice.com/zh/hongkong/restaurants?page={N}
PART A — XPATH FIELDS（餐廳搜尋列表，10 個字段）
字段XPath Expression語法重點
餐廳名稱.//div[@class='poi-name poi-list-cell-link']/text()精確 class 匹配
餐廳連結.//a[@class='poi-list-cell-desktop-right-link-overlay']/@href@href 屬性提取
地址.//div[@class='poi-list-cell-desktop-right-top-wrapper-main']/div[2]/text()位置索引 [2]
地區.//span[contains(@class,'poi-list-cell-line-info-link')][1]//text()contains() + 位置 predicate [1]
菜式.//span[contains(@class,'poi-list-cell-line-info-link')][2]//text()contains() + 位置 predicate [2]
人均消費.//span[contains(@class,'poi-list-cell-line-info-link')][3]//text()contains() + 位置 predicate [3]
笑臉數.//div[@class='smile icon-wrapper big-score']//div[@class='text']/text()descendant // 搜尋
哭臉數.//div[@class='cry icon-wrapper']//div[@class='text']/text()descendant // 搜尋
收藏數.//div[@class='tbb-count']/text()精確 class 匹配
標籤.//span[@class='desktop-poi-tag']/text()多節點提取
Setup📦 Cell 1 — 安裝套件
安裝 jsonlines（讀寫 .jsonl 格式）、tqdm（進度條）、lxml（XPath 解析引擎）。
# 安裝所需套件
!pip install jsonlines   # 讀寫 JSON Lines 格式（每行一個 JSON 對象）
!pip install tqdm        # 顯示進度條，讓你知道爬了多少
!pip install lxml        # 快速 XML/HTML 解析器，支援 XPath
Import📚 Cell 2 — 導入套件
導入所有需要的 Python 套件。
import requests      # 發送 HTTP 請求
import time           # sleep() 控制請求間隔
import csv            # 儲存為 CSV 格式
import jsonlines      # 讀寫 .jsonl 文件
from tqdm import tqdm           # 進度條
from lxml import etree          # XPath 解析引擎
from google.colab import files  # Colab 自動下載
Cookie⚙️ Cell 3 — Cookie 設定（必須）
OpenRice 需要登入 Cookie 才能正常訪問餐廳列表（未登入會被重定向）。請按下方步驟獲取你的 Cookie。
# ═══════════════════════════════════════════════════════
# 如何獲取 Cookie（必讀）：
# 1. 用 Chrome 登入 https://www.openrice.com
# 2. 按 F12 → Network 分頁 → 刷新頁面
# 3. 點擊任意 openrice.com 請求 → Headers → 找 "Cookie:"
# 4. 複製整行 Cookie 值，貼到下方 COOKIE_STR
# ═══════════════════════════════════════════════════════

COOKIE_STR = "PASTE_YOUR_COOKIE_HERE"  # ← 貼入你的 Cookie

def process_cookie(cookie_str):
    """
    將瀏覽器 Cookie 字符串轉換為 Python dict
    
    Input:  "key1=val1; key2=val2; key3=val3"
    Output: {"key1": "val1", "key2": "val2", "key3": "val3"}
    
    為何需要：requests.get() 接受 dict 格式，不接受原始字符串
    """
    real_cookie = {}
    for each_kv in cookie_str.split(';'):
        each_kv = each_kv.strip()
        if '=' in each_kv:
            # 只在第一個 '=' 分割，避免值中含有 '=' 的情況
            k = each_kv.split('=')[0].strip()
            v = '='.join(each_kv.split('=')[1:]).strip()
            real_cookie[k] = v
    return real_cookie

COOKIES = process_cookie(COOKIE_STR)

# HTTP Headers：模擬真實 Chrome 瀏覽器，避免被封鎖
HEADERS = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'accept-language': 'zh-HK,zh;q=0.9,en;q=0.8',
    'referer': 'https://www.openrice.com/zh/hongkong/restaurants',
}

# 設定參數
OPENRICE_BASE_URL = 'https://www.openrice.com'  # 用於拼接相對 URL
SEARCH_URL        = 'https://www.openrice.com/zh/hongkong/restaurants'  # 餐廳搜尋頁
MAX_PAGES         = 5                           # 爬幾頁（每頁 15 間餐廳）

import random
def sleep_random(min_s=1.0, max_s=2.5):
    """隨機休息 min_s 到 max_s 秒，模擬人類瀏覽行為，防止被反爬封鎖"""
    time.sleep(random.uniform(min_s, max_s))

print("✅ Cookie 設定完成")
XPath Core🔍 Cell 4 — 爬取餐廳搜尋列表（XPath 核心）
用 XPath 從 OpenRice 餐廳搜尋頁提取餐廳名稱、URL、地址、地區、菜式、人均消費、笑臉/哭臉數、收藏數、標籤。注意：書籤頁為 JS 動態渲染，需改用搜尋頁（靜態 HTML）。
# ═══════════════════════════════════════════════════════
# 目標 URL：
# https://www.openrice.com/zh/hongkong/restaurants?page=N
# 注意：書籤頁（bookmarkrestaurant.htm）為 JS 動態渲染，
#       requests 無法獲取數據，改用靜態 HTML 的搜尋頁
# ═══════════════════════════════════════════════════════

print("=" * 60)
print("Part A: 爬取 OpenRice 餐廳搜尋列表")
print(f"頁數: 1 到 {MAX_PAGES}（每頁約 15 間餐廳）")
print("=" * 60)

bookmark_records = []  # 儲存結果列表

for page_num in tqdm(range(1, MAX_PAGES + 1), desc="搜尋頁"):
    sleep_random()  # 每次請求前先休息，防止反爬

    # 構建分頁 URL（page 參數控制頁碼）
    url = f'{SEARCH_URL}?page={page_num}'
    response = requests.get(url, headers=HEADERS, cookies=COOKIES)
    
    # 將 HTML 解析為 lxml 元素樹，才能執行 XPath
    html = etree.HTML(response.text)

    # ─── XPath 1：選取所有餐廳卡片 ──────────────────────────────
    # 精確匹配 class 名稱（含兩個 class 值）
    cards = html.xpath("//div[@class='poi-list-cell poi-list-cell-desktop-container']")
    print(f"  第 {page_num} 頁：找到 {len(cards)} 個餐廳卡片")

    for card in cards:
        rec = {}

        # ─── XPath 2：餐廳名稱 ───────────────────────────────────
        # 精確匹配含兩個 class 的 div
        name_nodes = card.xpath(".//div[@class='poi-name poi-list-cell-link']/text()")
        rec['name'] = name_nodes[0].strip() if name_nodes else ''

        # ─── XPath 3：餐廳詳情頁連結 ────────────────────────────
        # 使用 overlay link（覆蓋整個卡片的透明連結）
        url_nodes = card.xpath(".//a[@class='poi-list-cell-desktop-right-link-overlay']/@href")
        rec['url'] = OPENRICE_BASE_URL + url_nodes[0] if url_nodes else ''

        # ─── XPath 4：地址（第 2 個 div 子元素）────────────────
        addr_nodes = card.xpath(".//div[@class='poi-list-cell-desktop-right-top-wrapper-main']/div[2]/text()")
        rec['address'] = addr_nodes[0].strip() if addr_nodes else ''

        # ─── XPath 5：地區（第 1 個 info-link）────────────────
        # 用 contains() 因為 class 可能含多個值
        # 用 //text() 提取所有後代文字節點（包括 <span> 子元素內的文字）
        # 過濾空白後取最後一個非空文字（跳過圖示 span 的空文字）
        district_nodes = [t.strip() for t in card.xpath(".//span[contains(@class,'poi-list-cell-line-info-link')][1]//text()") if t.strip()]
        rec['district'] = district_nodes[-1] if district_nodes else ''

        # ─── XPath 6：菜式（第 2 個 info-link）─────────────────
        cuisine_nodes = [t.strip() for t in card.xpath(".//span[contains(@class,'poi-list-cell-line-info-link')][2]//text()") if t.strip()]
        rec['cuisine'] = cuisine_nodes[-1] if cuisine_nodes else ''

        # ─── XPath 7：人均消費（第 3 個 info-link）──────────────
        price_nodes = [t.strip() for t in card.xpath(".//span[contains(@class,'poi-list-cell-line-info-link')][3]//text()") if t.strip()]
        rec['price'] = price_nodes[-1] if price_nodes else ''

        # ─── XPath 8：笑臉數（好評）─────────────────────────────
        # 使用 descendant // 搜尋，找到 big-score 容器內的 text div
        smile_nodes = card.xpath(".//div[@class='smile icon-wrapper big-score']//div[@class='text']/text()")
        rec['smile_count'] = smile_nodes[0].strip() if smile_nodes else '0'

        # ─── XPath 9：哭臉數（差評）─────────────────────────────
        cry_nodes = card.xpath(".//div[@class='cry icon-wrapper']//div[@class='text']/text()")
        rec['cry_count'] = cry_nodes[0].strip() if cry_nodes else '0'

        # ─── XPath 10：收藏數 ────────────────────────────────────
        bookmark_nodes = card.xpath(".//div[@class='tbb-count']/text()")
        rec['bookmark_count'] = bookmark_nodes[0].strip() if bookmark_nodes else '0'

        # ─── XPath 11：餐廳標籤（多個）──────────────────────────
        # 提取所有 desktop-poi-tag span 的文字，用逗號連接
        tag_nodes = card.xpath(".//span[@class='desktop-poi-tag']/text()")
        rec['tags'] = ', '.join([t.strip() for t in tag_nodes if t.strip()])

        if rec['name']:  # 只保留有名稱的記錄
            bookmark_records.append(rec)

print(f"\n✅ Part A 完成 — 共收集 {len(bookmark_records)} 間餐廳")
Export💾 Cell 5 — 儲存書籤列表為 CSV 並下載
將書籤餐廳列表儲存為 CSV，並在 Google Colab 中自動觸發下載。
PART_A_CSV = 'openrice_restaurants.csv'

# encoding='utf-8-sig' 加入 BOM（Byte Order Mark）
# 確保 Excel 在 Windows 上正確顯示中文字符
if not bookmark_records:
    print("⚠️ 無數據，請先執行 Cell 4")
else:
    fieldnames = ['name', 'url', 'address', 'district', 'cuisine', 'price',
                  'smile_count', 'cry_count', 'bookmark_count', 'tags']
    with open(PART_A_CSV, 'w', newline='', encoding='utf-8-sig') as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction='ignore')
        writer.writeheader()
        writer.writerows(bookmark_records)

    print(f"✅ 已儲存 {len(bookmark_records)} 間餐廳到 {PART_A_CSV}")

    # ── 自動下載 CSV（三種方式，依序嘗試）────────────────────
    def colab_download(filename):
        """
        在 Google Colab 中可靠地下載文件
        方法 1：files.download()（標準方式）
        方法 2：JavaScript 直接觸發瀏覽器下載（更可靠）
        方法 3：顯示下載連結（備用，手動點擊）
        """
        import base64, os
        from IPython.display import display, Javascript, HTML

        # 方法 2：用 JavaScript 直接觸發下載（繞過彈出視窗攔截）
        with open(filename, 'rb') as f:
            data = base64.b64encode(f.read()).decode('utf-8')
        # 注意：用 .format() 而非 f-string，避免與 Python f-string 的大括號衝突
        js = """
        var link = document.createElement('a');
        link.href = 'data:text/csv;base64," + data + "';
        link.download = '" + filename + "';
        document.body.appendChild(link);
        link.click();
        document.body.removeChild(link);
        """
        display(Javascript(js))
        print(f"📥 已觸發下載：{filename}")

        # 方法 3：同時顯示備用下載連結（若瀏覽器攔截可手動點擊）
        html_link = '<a href="data:text/csv;base64,' + data + '" download="' + filename + '" style="color:#4ade80;font-family:monospace">⬇️ 點此手動下載 ' + filename + '</a>'
        display(HTML(html_link))

    colab_download(PART_A_CSV)
🚀 使用步驟1. 登入 OpenRice，按上方說明獲取 Cookie
2. 打開 Google Colab，新建筆記本
3. 先執行 Part A（Cell 1–5），下載餐廳搜尋列表 CSV
4. 再執行 Part B（Cell 6–8），下載詳情和評論 CSV
5. ⚠️ 注意：Cookie 有效期約 7 天，過期後需重新獲取
字段	XPath Expression	語法重點
餐廳名稱	`.//div[@class='poi-name poi-list-cell-link']/text()`	精確 class 匹配
餐廳連結	`.//a[@class='poi-list-cell-desktop-right-link-overlay']/@href`	@href 屬性提取
地址	`.//div[@class='poi-list-cell-desktop-right-top-wrapper-main']/div[2]/text()`	位置索引 [2]
地區	`.//span[contains(@class,'poi-list-cell-line-info-link')][1]//text()`	contains() + 位置 predicate [1]
菜式	`.//span[contains(@class,'poi-list-cell-line-info-link')][2]//text()`	contains() + 位置 predicate [2]
人均消費	`.//span[contains(@class,'poi-list-cell-line-info-link')][3]//text()`	contains() + 位置 predicate [3]
笑臉數	`.//div[@class='smile icon-wrapper big-score']//div[@class='text']/text()`	descendant // 搜尋
哭臉數	`.//div[@class='cry icon-wrapper']//div[@class='text']/text()`	descendant // 搜尋
收藏數	`.//div[@class='tbb-count']/text()`	精確 class 匹配
標籤	`.//span[@class='desktop-poi-tag']/text()`	多節點提取
OpenRice 香港餐廳XPath 爬蟲教學

🚀 使用步驟

OpenRice 香港餐廳
XPath 爬蟲教學