如何利用Python分离文件中的英文和中文？

方法一、如果文件中英文和中文不混合，只需要定义一个is_chinese()函数即可，\u4e00表示的是unicode中文编码的第一个，\u9fa5表示的是unicode中文编码的最后一个。方法二、如果文件中英文和中文同一行混合，定义split_chinese()函数，判断一行中每一个字符，只要找到中文字符就返回该字符。方法三、如果文件是英中文混合，定义一个split_chinese()函数，主要

试图成为大佬…

4668人浏览 · 2022-07-31 00:51:52

试图成为大佬… · 2022-07-31 00:51:52 发布

方法一、如果文件中英文和中文不混合，只需要定义一个is_chinese()函数即可，\u4e00表示的是unicode中文编码的第一个，\u9fa5表示的是unicode中文编码的最后一个。

is_chinese(strings)函数的作用是只要字符串中有中文就返回True

文本内容如下: 
This is a test.
这是一个测试。

def is_chinese(strings):
    for _char in strings:
        if '\u4e00' <= _char <= '\u9fa5':
            return True


with open('./测试文件', mode="r", encoding="utf-8") as file:
    for line in file:
        if not is_chinese(line):
            print(line)

结果：

This is a test.
这是一个测试。

方法二、如果文件中英文和中文同一行混合，定义split_chinese()函数，判断一行中每一个字符，只要找到中文字符就返回该字符。

然后可以利用find()查找这个字符的索引，就可以用切片分离中英文了。

文本内容如下：

This is a test.,这是一个测试。
This is a test2.,这是一个测试2。

def split_chinese(strings):
    for _char in strings:
        if '\u4e00' <= _char <= '\u9fa5':
            return _char


with open('./测试文件', mode="r", encoding="utf-8") as file:
    for line in file:
        _char = split_chinese(line)
        index = line.find(_char)
        print("英文: ", line[: index])
        print("中文: ", line[index:])

结果：

英文: This is a test.,
中文: 这是一个测试。

英文: This is a test2.,
中文: 这是一个测试2。

方法三、如果文件是英中文混合，定义一个split_chinese()函数，主要作用是返回分隔字符，根据返回的分隔字符寻找index下标，然后根据下标分隔中文和英文。

is_english(_char)的作用是判断输入的字符是否为字母，是则返回True，否则返回False。

解析：

如果我们要获取文本中的 “这是一个测试”，我们就要获取 “这” 和 “A” 的下标，spllit_chinese()的作用就是获取所有分段文本的下标。

文本内容如下：

This is a test.,这是一个测试,Are you OK?,你还好吗？,I split.,我裂开
This is a test2.,这是一个测试2,Are you OK?2,你还好吗？2,I split.,我裂开2

def is_english(_char):
    for s in _char:
        if (u'\u0041' <= s <= u'\u005a') or (u'\u0061' <= s <= u'\u007a'):
            return True
    return False


def split_chinese(strings):
    chars = []
    append = True   # 用于判断是否应该返回字符
    for _char in strings:
        eg_judge_res = is_english(_char)    # 判断_char是否为英文
        if eg_judge_res:
            if not append:  # 如果_char是英文，并且是遍历中文后第一个英文字符, 保留
                chars.append(_char)
                append = True
        if '\u4e00' <= _char <= '\u9fa5':
            if append:      # 如果_char是第一个出现的中文字符（或者是遍历英文后的第一个中文字符）, 保留
                chars.append(_char)
                append = False
    return chars


with open('./测试文件', mode="r", encoding="utf-8") as file:
    copy_file = file
    all_index = []
    for line in file:
        chars = split_chinese(line)
        print(chars)
        index_tmp = []  # 保留前一个index
        for i, _char in enumerate(chars):
            index = line.find(_char)
            if not index_tmp:  # 第一个
                print("英文: ", line[:index])
            if index_tmp:
                if is_english(_char):   # 如果_char是字母字符, 前一个index —— index 这段为中文
                    print("中文: ", line[index_tmp[0]: index])
                else:
                    print("英文: ", line[index_tmp[0]: index])
                if i+1 == len(chars):
                    print("中文: ", line[index:])   # 最后一个
            index_tmp = []
            index_tmp.append(index)

结果：

['这', 'A', '你', 'I', '我']
英文: This is a test.,
中文: 这是一个测试,
英文: Are you OK?,
中文: 你还好吗？,
英文: I split.,
中文: 我裂开

['这', 'A', '你', 'I', '我']
英文: This is a test2.,
中文: 这是一个测试2,
英文: Are you OK?2,
中文: 你还好吗？2,
英文: I split.,
中文: 我裂开2