Python处理大数据量文本数据思路

发布时间：2023-07-30 15:30

最近，需要用python批量处理一些超过4G的文本数据，在此记录一些处理思路。

1 文本查看

拿到新数据，总是想先打开数据，看看字段和数据情况。然而，我的电脑运存只有16G，超过4G的文本数据如果用记事本或notepad++等文本编辑器直接打开，会一下子涌入运存中，打开很慢或者直接打不开。

EmEditor软件读取大文件很方便。不是免费的，需要注册：EmEditor (Text Editor) – Text Editor for Windows supporting large files and Unicode!

2 文本读取

2.1 文本分块读取

import pandas as pd

table = pd.read_csv(r\"G:data.txt\",
    sep = \'\\t\', #制表符分隔
    header = None, #我这份数据无表头
    encoding = \'utf-8\',
    error_bad_lines = False, #遇到错误数据行忽略
    warn_bad_lines = True,
    iterator=True, #开启迭代器
    chunksize=10000 #读取10000个数据为一个块
    )

path = r\"G:\\test\"

i = 0
for item in table:
    i += 1
    print(\"正在处理第{}个文件\".format(i))
    item.to_csv(path + \"_test_\" + str(i) + \".csv\", index=False,encoding = \'utf-8\')

2.2 中文文本编码获取

用pandas的read_csv读取中文文本时，首先要知道文本的编码是什么，并在encoding这个参数这里设置正确的编码。否则，读取到的数据会是乱码。EmEditor软件可以直接查看文本编码和文本分隔符类型。

也可以python中的chardet包来获取文本编码。

#方法一
import pandas as pd   
 
import os  
 
import chardet
 
def get_encoding(filename): 
 
    \"\"\" 
    返回文件编码格式，因为是按行读取，所以比较适合小文件
    \"\"\" 
 
    with open(filename,\'rb\') as f: 
 
        return chardet.detect(f.read())[\'encoding\']
 
original_file = r\"G:\\data.txt\"
 
print(get_encoding(original_file))

 
#方法二
from chardet.universaldetector import UniversalDetector
 
original_file = r\"G:\\data.txt\"
 
usock = open(original_file, \'rb\')
detector = UniversalDetector()
for line in usock.readlines():
    detector.feed(line)
    if detector.done: break
detector.close()
usock.close()
print (detector.result)
 

#chardet不可能总是正确的猜测。如果你需要正确处理样本，你真的需要知道它们的编码

2.3 中文文本编码转换

EmEditor软件可以转换编码，也可以用如下代码转换编码。下面的代码是将编码转换为“utf-8”。

import codecs
def handleEncoding(original_file,newfile):
    #newfile=original_file[0:original_file.rfind(.)]+\'_copy.csv\'
    f=open(original_file,\'rb+\')
    content=f.read()#读取文件内容，content为bytes类型，而非string类型
    source_encoding=\'utf-8\'
    #####确定encoding类型
    try:
        content.decode(\'utf-8\').encode(\'utf-8\')
        source_encoding=\'utf-8\'
    except:
        try:
            content.decode(\'gbk\').encode(\'utf-8\')
            source_encoding=\'gbk\'
        except:
            try:
                content.decode(\'gb2312\').encode(\'utf-8\')
                source_encoding=\'gb2312\'
            except:
                try:
                    content.decode(\'gb18030\').encode(\'utf-8\')
                    source_encoding=\'gb18030\'
                except:
                    try:
                        content.decode(\'big5\').encode(\'utf-8\')
                        source_encoding=\'big5\'
                    except:
                        try:
                            content.decode(\'cp936\').encode(\'utf-8\')
                            source_encoding=\'cp936\'
                        except:
                            content.decode(\'gbk\').encode(\'utf-8\')
                            source_encoding=\'gbk\'
    f.close()
    
    #####按照确定的encoding读取文件内容，并另存为utf-8编码：
    block_size=10000
    with codecs.open(original_file,\'r\',source_encoding) as f:
        with codecs.open(newfile,\'w\',\'utf-8\') as f2:
            while True:
                content=f.read(block_size)
                if not content:
                    break
                f2.write(content)
                
original_file = r\"G:\\data.txt\"
newfile = r\"G:\\data_new.txt\"
handleEncoding(original_file,newfile)

2.3 文本并行处理

想对分块后的数据，同时运行函数Fuction_test(x)，考虑并行处理。

#GPU并行 dask包
还在研究中……

#CPU并行 joblib包的Parallel函数
还在研究中……

from joblib import Parallel, delayed 
 
def Fuction_test(x):
    y = x + 10
    
    return y


Parallel(n_jobs=-1)(delayed(Fuction_test)(item) for item in table)