XPath Practice Lab — HKBU Comm

|JobsDB ScraperXPath + Selenium + lxml
 Visit JobsDB
⭐⭐⭐ 進階Advanced · React SPA · Selenium Required · data-automation Attributes
JobsDB 香港傳媒職位
XPath 爬蟲教學JobsDB 是 React 單頁應用（SPA），需要 Selenium 執行 JavaScript 後才能提取 HTML。但它提供了穩定的 data-automation 屬性（如 jobTitle、jobCompany），比 class 名稱更適合用 XPath 定位。
⚡ 與 RTHK / OpenRice 的關鍵差異
RTHK 和 OpenRice 用 requests 直接獲取 HTML；JobsDB 是 React SPA， 必須先用 Selenium 渲染 JavaScript，再將 driver.page_source 傳給 lxml XPath 解析。 這是靜態爬蟲 vs 動態爬蟲的核心區別。
TARGET URLhttps://hk.jobsdb.com/jobs-in-media-communications
XPATH FIELDS — 10 個可爬取字段
字段XPath Expression語法重點
職位名稱.//a[@data-automation="jobTitle"]/text()data-automation 屬性
公司名稱.//a[@data-automation="jobCompany"]/text()data-automation 屬性
工作地點.//a[@data-automation="jobLocation"]/text()data-automation 屬性
薪酬範圍.//span[@data-automation="jobSalary"]/text()span + data-automation
職位類型.//span[@data-automation="jobWorkType"]/text()工作類型標籤
發布時間.//span[@data-automation="jobListingDate"]/text()時間文字
職位連結.//a[@data-automation="jobTitle"]/@href@href 屬性選取
公司 Logo.//img[@data-automation="company-logo"]/@src@src 屬性選取
職位描述摘要.//span[@data-automation="jobShortDescription"]/text()多行文字
職位 ID.//article/@idarticle 元素 ID
Setup📦 Cell 1 — 安裝套件
JobsDB 使用 React 渲染，需要 Selenium 來執行 JavaScript 後再提取 HTML。Selenium 在 Google Colab 中需要特別設定 ChromeDriver。
# 安裝所需套件（Google Colab 專用設定）
!pip install selenium lxml beautifulsoup4 requests

# 在 Colab 中安裝 Chrome 和 ChromeDriver
!apt-get update -qq
!apt-get install -y chromium-browser chromium-chromedriver -qq

# 確認安裝成功
import subprocess
result = subprocess.run(['chromium-browser', '--version'], capture_output=True, text=True)
print(f"✅ Chrome 版本：{result.stdout.strip()}")
Import📚 Cell 2 — 導入套件
導入 Selenium、lxml 等套件，並設定 Chrome 無頭模式（Headless Mode）——在沒有顯示器的環境中運行瀏覽器。
from selenium import webdriver                        # 瀏覽器自動化
from selenium.webdriver.chrome.options import Options  # Chrome 設定
from selenium.webdriver.common.by import By            # 元素定位方式
from selenium.webdriver.support.ui import WebDriverWait# 等待元素出現
from selenium.webdriver.support import expected_conditions as EC
from lxml import etree                                 # XPath 解析引擎
import csv                                             # 儲存為 CSV
import time                                            # 控制等待時間
import random                                          # 隨機休息（防反爬）
from datetime import datetime                          # 記錄爬取時間
from google.colab import files                         # Colab 自動下載

print("✅ 所有套件導入成功")
Config⚙️ Cell 3 — 設定 Selenium Chrome Driver
設定 Chrome 無頭模式（Headless Mode）和各種反檢測選項。JobsDB 會檢測自動化瀏覽器，需要偽裝成真實用戶。
# ═══════════════════════════════════════════
# 設定參數
# ═══════════════════════════════════════════

# 目標 URL：JobsDB 香港傳媒職位
TARGET_URL = "https://hk.jobsdb.com/jobs-in-media-communications"

# 輸出 CSV 檔案名稱
OUTPUT_FILE = "jobsdb_media_jobs.csv"

# 每頁最多爬取的職位數
MAX_JOBS = 100

def create_driver():
    """
    建立 Selenium Chrome Driver（無頭模式）
    
    Returns:
        WebDriver: 設定好的 Chrome 瀏覽器實例
    """
    # 設定 Chrome 選項
    options = Options()
    
    # 無頭模式：不顯示瀏覽器視窗（Colab 環境必須）
    options.add_argument("--headless=new")
    
    # 以下選項用於避免被網站檢測為自動化爬蟲
    options.add_argument("--no-sandbox")                    # Colab 必須
    options.add_argument("--disable-dev-shm-usage")        # 避免記憶體問題
    options.add_argument("--disable-blink-features=AutomationControlled")  # 隱藏自動化標記
    options.add_argument("--window-size=1920,1080")         # 模擬真實螢幕尺寸
    options.add_argument(
        "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/120.0.0.0 Safari/537.36"
    )
    
    # 指定 ChromeDriver 路徑（Colab 安裝位置）
    options.binary_location = "/usr/bin/chromium-browser"
    
    # 建立 Driver
    driver = webdriver.Chrome(options=options)
    
    # 隱藏 webdriver 屬性（防止被檢測）
    driver.execute_script(
        "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
    )
    
    return driver

print("✅ Chrome Driver 設定函數定義完成")
XPath Core🔍 Cell 4 — XPath 解析函數（核心）
JobsDB 使用穩定的 data-automation 屬性作為爬蟲友好的標識符，這比 class 名稱更穩定。XPath 用 @data-automation 屬性選取元素。
def parse_job_listings(html_content):
    """
    使用 XPath 解析 JobsDB 職位列表頁面
    
    JobsDB HTML 結構（React 渲染後）：
    <article id="jobsdb-hk-job-ad-XXXXXXXX">
      <a data-automation="jobTitle">職位名稱</a>
      <a data-automation="jobCompany">公司名稱</a>
      <a data-automation="jobLocation">工作地點</a>
      <span data-automation="jobSalary">薪酬範圍</span>
      <span data-automation="jobWorkType">工作類型</span>
      <span data-automation="jobListingDate">發布時間</span>
    </article>
    """
    # 將 HTML 字符串解析為 lxml 樹狀結構
    tree = etree.HTML(html_content)
    
    # ─── XPath 1：選取所有職位卡片 ────────────────────────────
    # article[@id] 選取有 id 屬性的 article 元素
    # starts-with(@id, "jobsdb") 確保只選取職位卡片
    # starts-with() 是 XPath 字符串函數，類似 contains() 但只匹配開頭
    job_cards = tree.xpath('//article[starts-with(@id, "jobsdb")]')
    
    print(f"   找到 {len(job_cards)} 個職位")
    
    results = []
    
    for card in job_cards:
        # ─── XPath 2：提取職位 ID（從 article 的 id 屬性）──────
        # @id 直接提取 article 元素的 id 屬性值
        job_id_list = card.xpath('@id')
        job_id = job_id_list[0] if job_id_list else ""
        
        # ─── XPath 3：提取職位名稱 ────────────────────────────
        # @data-automation="jobTitle" 用 data-automation 屬性定位
        # 這比 class 名稱更穩定，JobsDB 專門為爬蟲設計了這些屬性
        title_list = card.xpath('.//a[@data-automation="jobTitle"]//text()')
        title = "".join(title_list).strip()
        
        # ─── XPath 4：提取公司名稱 ────────────────────────────
        company_list = card.xpath('.//a[@data-automation="jobCompany"]//text()')
        company = "".join(company_list).strip()
        
        # ─── XPath 5：提取工作地點 ────────────────────────────
        location_list = card.xpath('.//a[@data-automation="jobLocation"]//text()')
        location = "".join(location_list).strip()
        
        # ─── XPath 6：提取薪酬範圍 ────────────────────────────
        # span 元素用 @data-automation="jobSalary" 定位
        salary_list = card.xpath('.//span[@data-automation="jobSalary"]//text()')
        salary = "".join(salary_list).strip()
        
        # ─── XPath 7：提取工作類型（全職/兼職/合約）──────────
        work_type_list = card.xpath('.//span[@data-automation="jobWorkType"]//text()')
        work_type = "".join(work_type_list).strip()
        
        # ─── XPath 8：提取發布時間 ────────────────────────────
        date_list = card.xpath('.//span[@data-automation="jobListingDate"]//text()')
        listing_date = "".join(date_list).strip()
        
        # ─── XPath 9：提取職位連結 ────────────────────────────
        # @href 提取連結屬性
        link_list = card.xpath('.//a[@data-automation="jobTitle"]/@href')
        link = link_list[0] if link_list else ""
        if link and not link.startswith("http"):
            link = "https://hk.jobsdb.com" + link
        
        # ─── XPath 10：提取職位描述摘要 ──────────────────────
        # 用 normalize-space() 清除多餘空白
        desc_parts = card.xpath('.//span[@data-automation="jobShortDescription"]//text()')
        description = " ".join([p.strip() for p in desc_parts if p.strip()])
        
        # 記錄爬取時間
        scraped_at = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        
        if title:
            results.append({
                "job_id": job_id,           # 職位 ID
                "title": title,             # 職位名稱
                "company": company,         # 公司名稱
                "location": location,       # 工作地點
                "salary": salary,           # 薪酬範圍
                "work_type": work_type,     # 工作類型
                "listing_date": listing_date,# 發布時間
                "link": link,               # 職位連結
                "description": description, # 職位描述摘要
                "scraped_at": scraped_at,   # 爬取時間
            })
    
    return results

print("✅ XPath 解析函數定義完成")
Fetch🌐 Cell 5 — 用 Selenium 爬取並解析
用 Selenium 打開 JobsDB 頁面，等待 JavaScript 渲染完成，然後將 HTML 傳給 lxml XPath 解析。包含自動滾動加載更多職位的功能。
def scrape_jobsdb():
    """
    用 Selenium 爬取 JobsDB 職位列表
    
    流程：
    1. 打開瀏覽器 → 訪問 URL
    2. 等待頁面 JS 渲染完成
    3. 自動滾動加載更多職位
    4. 提取 HTML → 用 XPath 解析
    """
    driver = None
    
    try:
        print("🚀 啟動 Chrome 瀏覽器...")
        driver = create_driver()
        
        print(f"📡 正在訪問：{TARGET_URL}")
        driver.get(TARGET_URL)
        
        # 等待職位卡片出現（最多等待 15 秒）
        # EC.presence_of_element_located 等待指定元素出現在 DOM 中
        wait = WebDriverWait(driver, 15)
        wait.until(EC.presence_of_element_located(
            (By.CSS_SELECTOR, 'article[id^="jobsdb"]')
        ))
        print("   ✅ 頁面加載完成")
        
        # ─── 自動滾動加載更多職位 ────────────────────────────
        print("   📜 自動滾動加載更多職位...")
        last_height = driver.execute_script("return document.body.scrollHeight")
        scroll_count = 0
        max_scrolls = 5  # 最多滾動 5 次
        
        while scroll_count < max_scrolls:
            # 滾動到頁面底部
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            
            # 等待新內容加載（隨機 2-4 秒）
            sleep_time = random.uniform(2.0, 4.0)
            print(f"   💤 等待 {sleep_time:.1f} 秒加載新內容...")
            time.sleep(sleep_time)
            
            # 檢查頁面高度是否增加（有新內容加載）
            new_height = driver.execute_script("return document.body.scrollHeight")
            if new_height == last_height:
                print("   ✅ 已到達頁面底部，停止滾動")
                break
            
            last_height = new_height
            scroll_count += 1
            print(f"   📜 第 {scroll_count} 次滾動完成")
        
        # 獲取完整 HTML（JavaScript 渲染後的）
        html_content = driver.page_source
        print(f"   📄 頁面大小：{len(html_content):,} 字符")
        
        # 調用 XPath 解析函數
        jobs = parse_job_listings(html_content)
        
        return jobs
        
    except Exception as e:
        print(f"❌ 爬取失敗：{e}")
        return []
    
    finally:
        # 確保瀏覽器被關閉（即使出錯也要關閉）
        if driver:
            driver.quit()
            print("   🔒 瀏覽器已關閉")


# 執行爬取
print("=" * 50)
all_jobs = scrape_jobsdb()
print("=" * 50)
print(f"✅ 共爬取 {len(all_jobs)} 個職位")

# 預覽前 3 個職位
if all_jobs:
    print("\n📋 預覽前 3 個職位：")
    for i, job in enumerate(all_jobs[:3], 1):
        print(f"  {i}. {job['title']}")
        print(f"     公司：{job['company']} | 地點：{job['location']}")
        print(f"     薪酬：{job['salary']} | 類型：{job['work_type']}")
Export💾 Cell 6 — 儲存 CSV 並自動下載
將職位數據儲存為 CSV 並自動下載。
def save_and_download_csv(data, filename):
    """
    將數據儲存為 CSV 並在 Colab 中自動下載
    """
    if not data:
        print("❌ 沒有數據可以儲存")
        return
    
    fieldnames = [
        "job_id",        # 職位 ID
        "title",         # 職位名稱
        "company",       # 公司名稱
        "location",      # 工作地點
        "salary",        # 薪酬範圍
        "work_type",     # 工作類型
        "listing_date",  # 發布時間
        "link",          # 職位連結
        "description",   # 職位描述摘要
        "scraped_at",    # 爬取時間
    ]
    
    # encoding="utf-8-sig" 加入 BOM，確保 Excel 正確顯示中文
    with open(filename, "w", newline="", encoding="utf-8-sig") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data)
    
    print(f"✅ 已儲存 {len(data)} 個職位到 {filename}")
    
    # 在 Google Colab 中自動觸發下載
    files.download(filename)
    print(f"📥 正在下載 {filename}...")


# 執行儲存和下載
save_and_download_csv(all_jobs, OUTPUT_FILE)
Analysis📊 Cell 7 — 查看結果統計
用 pandas 分析職位數據，查看薪酬分佈和工作類型。
import pandas as pd

df = pd.read_csv(OUTPUT_FILE, encoding="utf-8-sig")

print(f"📊 JobsDB 爬取結果統計")
print(f"{'='*40}")
print(f"  總職位數：{len(df)} 個")
print()

# 工作類型分佈
if "work_type" in df.columns:
    print("💼 工作類型分佈：")
    print(df["work_type"].value_counts().to_string())
    print()

# 薪酬分佈（有薪酬資訊的職位）
if "salary" in df.columns:
    has_salary = df[df["salary"].notna() & (df["salary"] != "")]
    print(f"💰 有薪酬資訊的職位：{len(has_salary)} 個 / {len(df)} 個")
    print()

# 公司排名（招聘最多的公司）
if "company" in df.columns:
    print("🏢 招聘最多的公司（前 10）：")
    print(df["company"].value_counts().head(10).to_string())
    print()

# 顯示前 5 筆
print("📋 前 5 個職位：")
print(df[["title", "company", "location", "salary"]].head().to_string(index=False))
🚀 使用步驟1. 打開 Google Colab，新建筆記本
2. 點擊右上角「Copy All Code」，貼入第一個 Cell
3. ⚠️ 必須先執行 Cell 1（安裝 Chrome + ChromeDriver），這需要 1-2 分鐘
4. 從 Cell 1 開始逐格執行（Shift + Enter）
5. Cell 5 會自動滾動加載更多職位（約需 30-60 秒）
6. 執行 Cell 6 後，CSV 檔案會自動下載
字段	XPath Expression	語法重點
職位名稱	`.//a[@data-automation="jobTitle"]/text()`	data-automation 屬性
公司名稱	`.//a[@data-automation="jobCompany"]/text()`	data-automation 屬性
工作地點	`.//a[@data-automation="jobLocation"]/text()`	data-automation 屬性
薪酬範圍	`.//span[@data-automation="jobSalary"]/text()`	span + data-automation
職位類型	`.//span[@data-automation="jobWorkType"]/text()`	工作類型標籤
發布時間	`.//span[@data-automation="jobListingDate"]/text()`	時間文字
職位連結	`.//a[@data-automation="jobTitle"]/@href`	@href 屬性選取
公司 Logo	`.//img[@data-automation="company-logo"]/@src`	@src 屬性選取
職位描述摘要	`.//span[@data-automation="jobShortDescription"]/text()`	多行文字
職位 ID	`.//article/@id`	article 元素 ID
JobsDB 香港傳媒職位XPath 爬蟲教學

🚀 使用步驟

JobsDB 香港傳媒職位
XPath 爬蟲教學