【Python】Seleniumを使ったWebスクレイピングのサンプルコードを解説

動画で解説を見る

1 開発環境
2 Pythonのseleniumとは
3 Pythonのseleniumの使い方
4 Pythonのseleniumでスクレイピングする
5 おわりに

開発環境

Python version: python 3.10.11
selenium: 4.17.2

Pythonのseleniumとは

Seleniumは、Webブラウザを自動的に制御してWebアプリケーションのテストやスクレイピングを行うためのツールとして広く使用されているプログラミングフレームワークです。

2004年にJason Hugginsによって作られたSeleniumは、JavaScriptを使用してWebページのテストを自動化するためのツールとして開発されました。

その後、Selenium WebDriverが導入されたことによってブラウザを直接制御できるようになり、より広範で効果的なWebブラウザ自動化ツールとして広く利用されるようになりました。

Seleniumは様々なプログラミング言語向けにバインディングやラッパーが提供されており、Python言語向けにSeleniumを使用するためのラッパーやバインディングも提供されています。

公式ドキュメント：https://www.selenium.dev/ja/documentation/

Pythonのseleniumの使い方

Pythonのseleniumの使い方を解説していきます。

Seleniumをインストール

まずはSeleniumをインストールするために下記のコマンドを実行します。

pip install selenium

1	pip install selenium

WebDriverのインストール

次にWebDriverのインストールをしていきます。

GoogleChromeWebDriverは「https://chromedriver.chromium.org/downloads」からダウンロードすることができます。

Chrome バージョン 115 以降を使用している場合は、Chrome for Testingの可用性ダッシュボード「https://googlechromelabs.github.io/chrome-for-testing/」を参照してください。

wgetやcurlコマンドでもダウンロードできますが、今回は割愛します。

WebDriverを適切な場所に移動

次にダウンロードしたWebDriverを適切な場所に移動します。

Linuxであれば「/usr/local/bin/」配下に設置することが一般的ですが、pythonの実行ファイルの直下などでも問題ありません。

※今回のGoogleChromeWebDriverであればmacやlinuxは「chromedriver」、Windowsであれば「chromedriver.exe」ファイルです。

これで準備完了です。

Pythonのseleniumでスクレイピングする

次にPythonのseleniumでスクレイピングするサンプルコードを解説しています。

スクレイピングのサンプルコード

下記はSeleniumを使用してGoogleでキーワード検索を行い、検索結果からタイトル、URLなどの情報を取得する簡単なウェブスクレイピングのサンプルコードです。

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# WebDriverインスタンスを作成
service = Service(executable_path='./chromedriver.exe')
driver = webdriver.Chrome(service=service)

# URLにアクセス
target_url = 'https://www.google.co.jp/'
driver.get(target_url)

# ページが完全に読み込まれるまで待機（最大で10秒）
wait = WebDriverWait(driver, 10)
search_box = wait.until(EC.presence_of_element_located((By.ID, 'APjFqb')))

# 検索ボックスを見つけてキーワードを入力し、Enterを押す
search_box = driver.find_element(By.ID, 'APjFqb')
search_box.send_keys('スクレイピング')
search_box.send_keys(Keys.RETURN)

# ページが完全に読み込まれるまで待機（最大で10秒）
wait = WebDriverWait(driver, 10)
search_result = wait.until(EC.presence_of_element_located((By.ID, 'center_col')))

# 検索結果の一覧からタイトルとURLを取得する
elements = driver.find_elements(By.CLASS_NAME, 'MjjYud')
list = []

for element in elements:
    exclusion = element.find_elements(By.CLASS_NAME, 'oIk2Cb')
    if len(exclusion) > 0:
        continue

    date = {
        'site': element.find_element(By.CLASS_NAME, 'VuuXrf').text,
        'title': element.find_element(By.CSS_SELECTOR, 'h3.LC20lb.MBeuO.DKV0Md').text,
        'url': element.find_element(By.CSS_SELECTOR, 'a[jsname="UWckNb"]').get_attribute('href'),
    }

    list.append(date)

# 最後に全体を出力
print(list)

# ブラウザを閉じる
driver.quit()

from selenium import webdriver

from selenium.webdriver.chrome.service import Service

from selenium.webdriver.common.by import By

from selenium.webdriver.common.keys import Keys

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

import time

# WebDriverインスタンスを作成

service = Service(executable_path='./chromedriver.exe')

driver = webdriver.Chrome(service=service)

# URLにアクセス

target_url = 'https://www.google.co.jp/'

driver.get(target_url)

# ページが完全に読み込まれるまで待機（最大で10秒）

wait = WebDriverWait(driver, 10)

search_box = wait.until(EC.presence_of_element_located((By.ID, 'APjFqb')))

# 検索ボックスを見つけてキーワードを入力し、Enterを押す

search_box = driver.find_element(By.ID, 'APjFqb')

search_box.send_keys('スクレイピング')

search_box.send_keys(Keys.RETURN)

# ページが完全に読み込まれるまで待機（最大で10秒）

wait = WebDriverWait(driver, 10)

search_result = wait.until(EC.presence_of_element_located((By.ID, 'center_col')))

# 検索結果の一覧からタイトルとURLを取得する

elements = driver.find_elements(By.CLASS_NAME, 'MjjYud')

list = []

for element in elements:

exclusion = element.find_elements(By.CLASS_NAME, 'oIk2Cb')

if len(exclusion) > 0:

continue

date = {

'site': element.find_element(By.CLASS_NAME, 'VuuXrf').text,

'title': element.find_element(By.CSS_SELECTOR, 'h3.LC20lb.MBeuO.DKV0Md').text,

'url': element.find_element(By.CSS_SELECTOR, 'a[jsname="UWckNb"]').get_attribute('href'),

}

list.append(date)

# 最後に全体を出力

print(list)

# ブラウザを閉じる

driver.quit()

WebDriverのインスタンスを作成

下記のコードではWebDriverのインスタンスを作成し、ChromeDriverを指定してChromeブラウザを起動します。

service = Service(executable_path='./chromedriver.exe')
driver = webdriver.Chrome(service=service)

1 2	service = Service(executable_path='./chromedriver.exe') driver = webdriver.Chrome(service=service)

Seleniumは自動的にインストールされたChromeDriverを探して使用してくれますが、特定のChromeDriverバージョンが必要な場合や、カスタム設定が必要な場合はexecutable_pathを使用します。

Googleの検索ページにアクセス

下記はGoogleの検索ページにアクセスしています。

target_url = 'https://www.google.co.jp/'
driver.get(target_url)

1 2	target_url = 'https://www.google.co.jp/' driver.get(target_url)

キーワードを入力して検索を実行

下記はキーワードを入力して検索を実行しています。

wait = WebDriverWait(driver, 10)
search_box = wait.until(EC.presence_of_element_located((By.ID, 'APjFqb')))
search_box = driver.find_element(By.ID, 'APjFqb')
search_box.send_keys('スクレイピング')
search_box.send_keys(Keys.RETURN)

wait = WebDriverWait(driver, 10)

search_box = wait.until(EC.presence_of_element_located((By.ID, 'APjFqb')))

search_box = driver.find_element(By.ID, 'APjFqb')

search_box.send_keys('スクレイピング')

search_box.send_keys(Keys.RETURN)

また、WebDriverWaitを使用する子により、検索ボックスが読み込まれるまで待機するように制御しています。

検索結果が読み込まれるまで待機

下記は検索結果が読み込まれるまで待機しています。

wait = WebDriverWait(driver, 10)
search_result = wait.until(EC.presence_of_element_located((By.ID, 'center_col')))

1 2	wait = WebDriverWait(driver, 10) search_result = wait.until(EC.presence_of_element_located((By.ID, 'center_col')))

検索結果が読み込まれるまでに次の処理が行われた場合、要素が見つからずエラーになることがあるので注意しましょう。

検索結果の一覧から各要素の情報を取得

下記は検索結果の一覧から各要素の情報（サイト名、タイトル、URL）を取得し、データをリストに追加しています。

elements = driver.find_elements(By.CLASS_NAME, 'MjjYud')
data_list = []

for element in elements:
    # "oIk2Cb"というクラスが存在する場合は除外
    exclusion = element.find_elements(By.CLASS_NAME, 'oIk2Cb')
    if len(exclusion) > 0:
        continue

    data = {
        'site': element.find_element(By.CLASS_NAME, 'VuuXrf').text,
        'title': element.find_element(By.CSS_SELECTOR, 'h3.LC20lb.MBeuO.DKV0Md').text,
        'url': element.find_element(By.CSS_SELECTOR, 'a[jsname="UWckNb"]').get_attribute('href'),
    }

    data_list.append(data)

elements = driver.find_elements(By.CLASS_NAME, 'MjjYud')

data_list = []

for element in elements:

# "oIk2Cb"というクラスが存在する場合は除外

exclusion = element.find_elements(By.CLASS_NAME, 'oIk2Cb')

if len(exclusion) > 0:

continue

data = {

'site': element.find_element(By.CLASS_NAME, 'VuuXrf').text,

'title': element.find_element(By.CSS_SELECTOR, 'h3.LC20lb.MBeuO.DKV0Md').text,

'url': element.find_element(By.CSS_SELECTOR, 'a[jsname="UWckNb"]').get_attribute('href'),

}

data_list.append(data)

要素を取得する方法はHTML要素のid属性、name属性、class属性、タグ名のほかに XPathクエリやCSSセレクタなどがあります。

取得したデータを出力

最後に全体のデータを出力し、ブラウザを閉じます。

print(data_list)
driver.quit()

1 2	print(data_list) driver.quit()

おわりに

今回はPythonでSeleniumを使ったスクレイピングの方法について解説していきましたが、いかがだったでしょうか。

SeleniumはWebテスト自動化のために開発されたライブラリではりますが、WebスクレイピングなどHTMLを解析してWebページから情報を抽出するためにも使用できます。

特定のWebサイトからデータを収集したり、特定のアクションを実行したりするために役立ちますので、是非、使いこなしていきましょう。