2021年11月8日 – mowareのブログ

# 代替スクリプト改良版 # 文字コードをUTF-8に変換してソース取り込み html = driver.page_source.encode('utf-8') # BeautifulSoupでデータ抽出 soup = BeautifulSoup(html, "html.parser") # soupから３番目のtableを抽出 table = soup.find_all("table",attrs={"cellspacing" : "1"})[2] rows = table.findAll("tr") list_rows = [] for row in rows: list_row = [] for cell in row.findAll(['td', 'th']): text = cell.get_text() text2 = text.replace('"','').replace("\n","").replace(" ","").replace("　","") list_row.append(text2) list_rows.append(list_row) # 2次元リストをヘッダとデータに分割 header = list_rows[0] data = list_rows[1:] # データフレームに変換 df = pd.DataFrame(data,columns = header)

今のところM1 Macにおいてpipコマンドだけでライブラリを揃える場合、lxmlをインストールできないためpandas.read_htmlを使うケースでは代替スクリプトを考える必要があります。

私のスクリプトは以下のように書き換えました。tableを一旦CSVファイルにしてからデータフレームとして読み込んでいます。まどろっこしいですが仕方ないです。

# 代替スクリプト

# 文字コードをUTF-8に変換してソース取り込み
html = driver.page_source.encode('utf-8')

# BeautifulSoupでデータ抽出
soup = BeautifulSoup(html, "html.parser")

# soupから３番目のtableを抽出
table = soup.find_all("table",attrs={"cellspacing" : "1"})[2]
rows = table.findAll("tr")

filename = "table.csv"
with open(filename, "w", encoding='utf-8') as file:
    writer = csv.writer(file)
    for row in rows:
        csvRow = []
        for cell in row.findAll(['td', 'th']):
            text = cell.get_text()
            text2 = text.replace('"','').replace("\n","").replace(" ","").replace("　","")
            csvRow.append(text2)
        writer.writerow(csvRow)

# CSVファイルをデータフレームに変換
df = pd.read_csv(filename)

# 旧スクリプト

# 文字コードをUTF-8に変換してソース取り込み
html = driver.page_source.encode('utf-8')

# BeautifulSoupでデータ抽出
soup = BeautifulSoup(html, "html.parser")

# soupから３番目のtableを抽出
table_data = soup.find_all("table",attrs={"cellspacing" : "1"})
df_stock_specific = pd.read_html(str(table_data), header=0)[2]
labels_specific = ['A','B','C','D','E']
df_stock_specific2 = df_stock_specific.reindex(labels_specific, axis=1)

日	月	火	水	木	金	土
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

日: 2021年11月8日

[Python] 317 lxmlがない場合のpandas.read_html代替スクリプト改良版

[Python] 316 lxmlがない場合のpandas.read_html代替スクリプト