不写爬虫，也能读取网页的表格数据

发布时间：2022-08-18 18:35

引言

pandas中的read_html()函数是将HTML的表格转换为DataFrame的一种快速方便的方法，这个函数对于快速合并来自不同网页上的表格非常有用。在合并时，不需要用爬虫获取站点的HTML。但是，在分析数据之前，数据的清理和格式化可能会遇到一些问题。在本文中，我将讨论如何使用pandas的read_html()来读取和清理来自维基百科的多个HTML表格，以便对它们做进一步的数值分析。

基本方法

在第一个例子中，我们将尝试解析一个表格。这个表格来自维基百科页面中明尼苏达州的政治部分(https://en.wikipedia.org/wiki/Minnesota)。

read_html的基本用法非常简单，在许多维基百科页面上都能运行良好，因为表格并不复杂。首先，要导入一些库，在后面的数据清理中都会用到：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from unicodedata import normalize

table_MN = pd.read_html('https://en.wikipedia.org/wiki/Minnesota')

特别注意，上面代码中得到的table_MN是页面上所有表格的列表：

print(f'Total tables: {len(table_MN)}')

Total tables: 38

很难在38张表格中找到你需要的那张，要想容易地找出来，可以设置match参数，如下面的代码所示，用mathch参数指明要选择标题为“Election results from statewide races”的那张表格。

table_MN = pd.read_html('https://en.wikipedia.org/wiki/Minnesota', match='Election results from statewide races')
len(table_MN)

# 输出
1

df = table_MN[0]
df.head()

输出：

显然，用Pandas能够很容易地读取到了表格，此外，从上面的输出结果可以看出，跨多行的Year列也得到了很好地处理，这要比自己写爬虫工具专门收集数据简单多了。

总的来说，这样的操作看起来还不错，然而，如果用df.info()来查看数据类型:


RangeIndex: 24 entries, 0 to 23
Data columns (total 5 columns):
#   Column  Non-Null Count  Dtype
---  ------  --------------  -----
0   Year    24 non-null     int64
1   Office  24 non-null     object
2   GOP     24 non-null     object
3   DFL     24 non-null     object
4   Others  24 non-null     object
dtypes: int64(1), object(4)
memory usage: 1.1+ KB

如果想对这些数据进行分析，需要将GOP、DFL和其他类型为object的列转换为数值。

如果这么操作：

df['GOP'].astype('float')

系统就会报错：

ValueError: could not convert string to float: '42.4%'

最有可能的罪魁祸首是%，下面用pandas的replace()函数删除它。

df['GOP'].replace({'%':''}, regex=True).astype('float')

效果看起来不错：

0     42.4
1     36.2
2     42.4
3     44.9
<...>
21    63.3
22    49.1
23    31.9
Name: GOP, dtype: float64

注意，必须使用参数regex=True才能完美地删除，因为%是字符串的一部分，而不是完整的字符串值。

现在，我们可以用pd.to_numeric()和apply()替换所有的%值，并将其转换为数字。

df = df.replace({'%': ''}, regex=True)
df[['GOP', 'DFL', 'Others']] = df[['GOP', 'DFL', 'Others']].apply(pd.to_numeric)
df.info()

# 输出

RangeIndex: 24 entries, 0 to 23
Data columns (total 5 columns):
#   Column  Non-Null Count  Dtype
---  ------  --------------  -----
0   Year    24 non-null     int64
1   Office  24 non-null     object
2   GOP     24 non-null     float64
3   DFL     24 non-null     float64
4   Others  24 non-null     float64
dtypes: float64(3), int64(1), object(1)
memory usage: 1.1+ KB

df.head()

输出：

这个基本过程进展顺利，下面看一个有点难度的。

高级的数据清理方法

前面的例子展示了基本概念，数据清理是任何数据科学项目都不可或缺的，下面看一个有点难度的示例。在接下来的示例中继续使用维基百科，但是这些方法同样适用于其他含有表格的HTML页面。

例如读取美国GDP的数据表：

现在，就不能用match参数指定要获得的那个表格标题——因为这表格没有标题，但是可以将其值设置为“Nominal GDP”，这样依然能匹配到我们想要的表格。

table_GDP = pd.read_html('https://en.wikipedia.org/wiki/Economy_of_the_United_States', match='Nominal GDP')
df_GDP = table_GDP[0]
df_GDP.info()

# 输出

RangeIndex: 41 entries, 0 to 40
Data columns (total 9 columns):
#   Column                                            Non-Null Count  Dtype
---  ------                                            --------------  -----
0   Year                                              41 non-null     object
1   Nominal GDP(in bil. US-Dollar)                    41 non-null     float64
2   GDP per capita(in US-Dollar)                      41 non-null     int64
3   GDP growth(real)                                  41 non-null     object
4   Inflation rate(in percent)                        41 non-null     object
5   Unemployment (in percent)                         41 non-null     object
6   Budget balance(in % of GDP)[107]                  41 non-null     object
7   Government debt held by public(in % of GDP)[108]  41 non-null     object
8   Current account balance(in % of GDP)              41 non-null     object
dtypes: float64(1), int64(1), object(7)
memory usage: 3.0+ KB

不出所料，数据清理是避免不了得了。根据前面的经验，先删除%。

df_GDP['GDP growth(real)'].replace({'%': ''}, regex=True).astype('float')

很遗憾，报错了：

ValueError: could not convert string to float: '−5.9\xa0'

问题的根源在于有一个隐藏字符xa0，它导致了错误，它是一个特殊字符，即“non-breaking Latin1 (ISO 8859-1) space”，对应的实体是，即空格。

我所使用的一个方法是使用replace直接替换，这种方法奏效了，但我担心它将来是否会与其他字符产生冲突。

在深入研究了Unicode这个坑之后，我决定使用normalize来清理这个值。

我还发现，在其他的一些表格的数据中也有多余的空格。于是编写了一个函数，对所有文本进行清理。

from unicodedata import normalize

def clean_normalize_whitespace(x):
    if isinstance(x, str):
        return normalize('NFKC', x).strip()
    else:
        return x

用applymap将这个函数用于整个DataFrame上：

df_GDP = df_GDP.applymap(clean_normalize_whitespace)

需要注意的是：applymap函数非常慢，所以在使用applymap时应该慎重。

applymap函数是一个非常低效的pandas函数，不推荐你经常使用它。但在本例中，DataFrame很小，像这样的清理又很棘手，所以我认为这是一个有用的权衡。

applymap不能处理列名称，例如：

df_GDP.columns[7]

# 输出
'Government debt held by public(in\xa0% of GDP)[108]'

在列的名称中有可怕的xa0%。解决此问题的方法有多种，在这里还是继续使用clean_normalize_whitespace()函数，将列转换为Series对象，并使用apply来调用这个函数。有点麻烦了，不知道pandas在以后的版本是否会考虑到这里的问题，让操作简化。

df_GDP.columns = df_GDP.columns.to_series().apply(clean_normalize_whitespace)
df_GDP.columns[7]

# 输出
'Government debt held by public(in % of GDP)[108]'

现在我们清理掉了一些隐藏的字符。下一步会怎样呢？

再试一次：

df_GDP['GDP growth(real)'].replace({'%': ''}, regex=True).astype('float')

# 输出
ValueError: could not convert string to float: '−5.9 '

真的很棘手。如果你仔细观察，你可能会发现：−和-看起来有点不同，但真的很难看出，在Unicode中，破折号和减号之间实际上是有区别的。

幸运的是，我们也可以使用replace来清理：

df_GDP['GDP growth(real)'].replace({'%': '', '−': '-'}, regex=True).astype('float')

# 输出
0    -5.9
1     2.2
2     3.0
3     2.3
4     1.7
<...>
38   -1.8
39    2.6
40   -0.2
Name: GDP growth(real), dtype: float64

现在来关注列Year，例如表示“2020年”的值是2020(est)，需要去掉其中的(est)，还要将列转换为整数型。

df['Year'].replace({'%': '', '−': '-', '\(est\)': ''}, regex=True).astype('int')

# 输出
0     2020
1     2019
2     2018
3     2017
4     2016
<...>
40    1980
Name: Year, dtype: int64

在DataFrame中的各列的值，除了整数型之外，其他的是浮点数型，在转化的时候，如果使用pd.numeric()虽然能够实现，但略显笨拙。我们可以使用astype()同时又不需要为每一列手动输入类型信息。

astype()函数可以接受含有列名和数据类型的字典。这真的很有用，直到我写了这篇文章我才知道这一点。下面是对列与其数据类型映射字典：

col_type = {
    'Year': 'int',
    'Nominal GDP(in bil. US-Dollar)': 'float',
    'GDP per capita(in US-Dollar)': 'int',
    'GDP growth(real)': 'float',
    'Inflation rate(in percent)': 'float',
    'Unemployment (in percent)': 'float',
    'Budget balance(in % of GDP)[107]': 'float',
    'Government debt held by public(in % of GDP)[108]': 'float',
    'Current account balance(in % of GDP)': 'float'
}

如果你觉得键入上面这个词典很慢，可以用下面的快捷方法。要注意，这样建立的字典，默认值为float，还需要手动将Year对应的值修改为int：

dict.fromkeys(df_GDP.columns, 'float')

# 输出
{'Year': 'float',
'Nominal GDP(in bil. US-Dollar)': 'float',
'GDP per capita(in US-Dollar)': 'float',
'GDP growth(real)': 'float',
'Inflation rate(in percent)': 'float',
'Unemployment (in percent)': 'float',
'Budget balance(in % of GDP)[107]': 'float',
'Government debt held by public(in % of GDP)[108]': 'float',
'Current account balance(in % of GDP)': 'float'}

再创建了一个字典，其中包含要替换的值：

clean_dict = {'%': '', '−': '-', '\(est\)': ''}

现在我们可以调用这个DataFrame的replace方法，转换为所需的类型，并获得干净的数据：

df_GDP = df_GDP.replace(clean_dict, regex=True).replace({'-n/a ': np.nan}).astype(col_type)
df_GDP.info()

# 输出

RangeIndex: 41 entries, 0 to 40
Data columns (total 9 columns):
#   Column                                            Non-Null Count  Dtype
---  ------                                            --------------  -----
0   Year                                              41 non-null     int64
1   Nominal GDP(in bil. US-Dollar)                    41 non-null     float64
2   GDP per capita(in US-Dollar)                      41 non-null     int64
3   GDP growth(real)                                  41 non-null     float64
4   Inflation rate(in percent)                        41 non-null     float64
5   Unemployment (in percent)                         41 non-null     float64
6   Budget balance(in % of GDP)[107]                  40 non-null     float64
7   Government debt held by public(in % of GDP)[108]  41 non-null     float64
8   Current account balance(in % of GDP)              40 non-null     float64
dtypes: float64(7), int64(2)
memory usage: 3.0 KB

结果如下所示：

为了证明上述操作的效果，我们可以把这些数据绘制成图表：

plt.style.use('seaborn-whitegrid')
df_clean.plot.line(x='Year', y=['Inflation rate(in percent)', 'Unemployment (in percent)'])

如果你紧跟我的思路，可能已经注意到链式方式调用replace的方法：

.replace({'-n/a ': np.nan})

我这样做的原因是我不知道如何使用第一个字典replace来清理n/a。我认为问题的症结在于：我无法预测这些数据的清理顺序，所以不得不分两个阶段来执行替换。

如果读者有更好的方法，请不吝赐教。

完整的代码

最后，把上面的过程，集中用下面的代码实现。从HTML网页上的表格获取数据，并把这些数据转化为DataFrame对象。

import pandas as pd
import numpy as np
from unicodedata import normalize

def clean_normalize_whitespace(x):
    """ 
    Normalize unicode characters and strip trailing spaces
    """
    if isinstance(x, str):
        return normalize('NFKC', x).strip()
    else:
        return x

# Read in the Wikipedia page and get the DataFrame
table_GDP = pd.read_html(
    'https://en.wikipedia.org/wiki/Economy_of_the_United_States',
    match='Nominal GDP')
df_GDP = table_GDP[0]

# Clean up the DataFrame and Columns
df_GDP = df_GDP.applymap(clean_normalize_whitespace)
df_GDP.columns = df_GDP.columns.to_series().apply(clean_normalize_whitespace)

# Determine numeric types for each column
col_type = {
    'Year': 'int',
    'Nominal GDP(in bil. US-Dollar)': 'float',
    'GDP per capita(in US-Dollar)': 'int',
    'GDP growth(real)': 'float',
    'Inflation rate(in percent)': 'float',
    'Unemployment (in percent)': 'float',
    'Budget balance(in % of GDP)[107]': 'float',
    'Government debt held by public(in % of GDP)[108]': 'float',
    'Current account balance(in % of GDP)': 'float'
}

# Values to replace
clean_dict = {'%': '', '−': '-', '\(est\)': ''}

# Replace values and convert to numeric values
df_GDP = df_GDP.replace(clean_dict, regex=True).replace({
    '-n/a ': np.nan
}).astype(col_type)

总结

pandas的read_html()函数对于快速解析页面中的 HTML表格非常有用，尤其是维基百科页面。从HTML页面直接获得的数据，通常不会像你所需要的那样干净，并且清理各种Unicode字符可能会非常耗时。本文展示的几种技术可以用于清理数据、并将其转换为正确的数字格式。如果你需要从维基百科或其他HTML表格中获取数据，这些技巧应该可以为你节省一些时间。

参考: https://pbpython.com/pandas-html-table.html

不写爬虫，也能读取网页的表格数据

引言

基本方法

高级的数据清理方法

完整的代码

总结

相关推荐