百度360必应搜狗淘宝本站头条
当前位置:网站首页 > 技术分类 > 正文

Python 中文件比较和合并的几种有效策略

ztj100 2025-02-18 14:24 42 浏览 0 评论

在日常编程或数据分析任务中,处理比较和合并多个文件是很常见的。Python 具有强大的文件处理能力和广泛的库支持,是处理此类任务的理想选择。

下面,我们将探讨几种有效的文件比较和合并策略,每种策略都附有详细的代码示例和解释。

  1. 基本文件读写

首先,了解如何读取和写入文件是基础。

# Open and read content from the input file
with open('input_file.txt', 'r') as input_file:  
    data = input_file.readlines()  # Read all lines from the input file

# Open the output file and write the content into it
with open('output_file.txt', 'w') as output_file:  
    for line in data:  
        output_file.write(line)  # Write each line to the output file

2. 文件内容比较

使用 difflib 库来比较两个文件之间的差异。

# Import the difflib module for file comparison
import difflib  

# Open and read the first input file
with open('input_file1.txt', 'r') as input_file1, open('input_file2.txt', 'r') as input_file2:
    
    # Compare the content of the two files using unified_diff
    diff = difflib.unified_diff(input_file1.readlines(), input_file2.readlines())
    
    # Print the differences line by line
    print('\n'.join(diff))

3. 合并 CSV 文件

对于 CSV 文件,pandas 库可用于合并操作。

# Import pandas library for data manipulation
import pandas as pd  

# Read the first CSV file into a DataFrame
df1 = pd.read_csv('data_file1.csv')  

# Read the second CSV file into a DataFrame
df2 = pd.read_csv('data_file2.csv')  

# Merge the two DataFrames by concatenating them, assuming matching column names
merged_df = pd.concat([df1, df2], ignore_index=True)  

# Save the merged DataFrame to a new CSV file
merged_df.to_csv('output_merged.csv', index=False)

4. 逐列 CSV 合并

合并特定列,例如基于公共键联接文件。

# Import pandas library for data manipulation
import pandas as pd  

# Read the first CSV file into a DataFrame
df1 = pd.read_csv('data_file1.csv')  

# Read the second CSV file into a DataFrame
df2 = pd.read_csv('data_file2.csv')

# Merge the two DataFrames based on a common column named 'common_key'
# 'how="outer"' ensures that all rows from both DataFrames are included, 
# with missing values filled as NaN where data does not match
merged_df = pd.merge(df1, df2, on='common_key', how='outer')  

# Save the merged DataFrame to a new CSV file
merged_df.to_csv('output_merged_by_key.csv', index=False)  

5. 基于行的合并

当基于相似行结构合并文件时,直接迭代和追加行。

# Initialize an empty list to store the content from all input files
data = []  

# List of input text files to be read and merged
for filename in ['input_file1.txt', 'input_file2.txt']:  
    # Open each file in read mode
    with open(filename, 'r') as file:  
        # Read all lines from the current file and add them to the data list
        data.extend(file.readlines())  

# Open the output file in write mode
with open('output_merged_file.txt', 'w') as merged_file:  
    # Write each line from the data list into the output file
    for line in data:  
        merged_file.write(line)

6. 去重合并

使用 sets 在合并之前删除重复的行。

# Initialize a set to store unique lines from all input files
unique_lines = set()  

# List of input text files to be read and merged
for filename in ['input_file1.txt', 'input_file2.txt']:  
    # Open each file in read mode
    with open(filename, 'r') as file:  
        # Add all lines from the current file to the set (duplicates are automatically removed)
        unique_lines.update(file.readlines())  

# Open the output file in write mode
with open('output_merged_unique.txt', 'w') as merged_file:  
    # Sort the unique lines to ensure consistent output order
    for line in sorted(unique_lines):  
        # Write each unique line into the output file
        merged_file.write(line)

7. 文本文件的二进制比较

使用 filecmp 模块比较文件的二进制内容。

# Import the filecmp module for file comparison
import filecmp  

# Compare the binary contents of 'input_file1.txt' and 'input_file2.txt'
if filecmp.cmp('input_file1.txt', 'input_file2.txt'):  
    print("Files are identical.")  # Output message if files are identical
else:
    print("Files differ.")  # Output message if files differ

8. 大文件高效比对

对于大型文件,请逐行读取和比较它们以节省内存。

# Open the first large file ('input_large_file1.txt') and second large file ('input_large_file2.txt') for reading
with open('input_large_file1.txt', 'r') as f1, open('input_large_file2.txt', 'r') as f2:  
    
# Read lines from both files simultaneously and compare them
    for line1, line2 in zip(f1, f2):  
        # If a difference is found between the two lines, print a message and stop the comparison
        if line1 != line2:  
            print("Difference found!")  
            break  # Exit the loop as the first difference has been found

9. 多个文件的动态合并

使用循环动态合并文件路径列表中的文件。

# Generate a list of file paths for input files ('input_file1.txt' to 'input_file3.txt')
file_paths = ['input_file{}.txt'.format(i) for i in range(1, 4)]  

# Open the output file ('output_merged_all.txt') in write mode
with open('output_merged_all.txt', 'w') as merged:  
    # Iterate through the list of input file paths
    for path in file_paths:  
        # Open each file in read mode
        with open(path, 'r') as file:  
            # Write the content of the current file to the merged output file
            # Add a newline character to separate the content of different files
            merged.write(file.read() + '\n')

10. 高级合并策略:智能合并

对于更复杂的合并标准,例如按日期或 ID 合并,请在合并之前对数据进行排序。

# Import pandas library for data manipulation
import pandas as pd  

# Read CSV files ('input_file1.csv' and 'input_file2.csv') into DataFrames
dfs = [pd.read_csv(f) for f in ['input_file1.csv', 'input_file2.csv']]  

# Concatenate the DataFrames and sort by the 'date_column', assuming it's the column holding the date data
sorted_df = pd.concat(dfs).sort_values(by='date_column')  

# Save the merged and sorted DataFrame to a new CSV file
# Import pandas library for data manipulation
sorted_df.to_csv('output_smart_merged.csv', index=False)  

相关推荐

sharding-jdbc实现`分库分表`与`读写分离`

一、前言本文将基于以下环境整合...

三分钟了解mysql中主键、外键、非空、唯一、默认约束是什么

在数据库中,数据表是数据库中最重要、最基本的操作对象,是数据存储的基本单位。数据表被定义为列的集合,数据在表中是按照行和列的格式来存储的。每一行代表一条唯一的记录,每一列代表记录中的一个域。...

MySQL8行级锁_mysql如何加行级锁

MySQL8行级锁版本:8.0.34基本概念...

mysql使用小技巧_mysql使用入门

1、MySQL中有许多很实用的函数,好好利用它们可以省去很多时间:group_concat()将取到的值用逗号连接,可以这么用:selectgroup_concat(distinctid)fr...

MySQL/MariaDB中如何支持全部的Unicode?

永远不要在MySQL中使用utf8,并且始终使用utf8mb4。utf8mb4介绍MySQL/MariaDB中,utf8字符集并不是对Unicode的真正实现,即不是真正的UTF-8编码,因...

聊聊 MySQL Server 可执行注释,你懂了吗?

前言MySQLServer当前支持如下3种注释风格:...

MySQL系列-源码编译安装(v5.7.34)

一、系统环境要求...

MySQL的锁就锁住我啦!与腾讯大佬的技术交谈,是我小看它了

对酒当歌,人生几何!朝朝暮暮,唯有己脱。苦苦寻觅找工作之间,殊不知今日之事乃我心之痛,难道是我不配拥有工作嘛。自面试后他所谓的等待都过去一段时日,可惜在下京东上的小金库都要见低啦。每每想到不由心中一...

MySQL字符问题_mysql中字符串的位置

中文写入乱码问题:我输入的中文编码是urf8的,建的库是urf8的,但是插入mysql总是乱码,一堆"???????????????????????"我用的是ibatis,终于找到原因了,我是这么解决...

深圳尚学堂:mysql基本sql语句大全(三)

数据开发-经典1.按姓氏笔画排序:Select*FromTableNameOrderByCustomerNameCollateChinese_PRC_Stroke_ci_as//从少...

MySQL进行行级锁的?一会next-key锁,一会间隙锁,一会记录锁?

大家好,是不是很多人都对MySQL加行级锁的规则搞的迷迷糊糊,一会是next-key锁,一会是间隙锁,一会又是记录锁。坦白说,确实还挺复杂的,但是好在我找点了点规律,也知道如何如何用命令分析加...

一文讲清怎么利用Python Django实现Excel数据表的导入导出功能

摘要:Python作为一门简单易学且功能强大的编程语言,广受程序员、数据分析师和AI工程师的青睐。本文系统讲解了如何使用Python的Django框架结合openpyxl库实现Excel...

用DataX实现两个MySQL实例间的数据同步

DataXDataX使用Java实现。如果可以实现数据库实例之间准实时的...

MySQL数据库知识_mysql数据库基础知识

MySQL是一种关系型数据库管理系统;那废话不多说,直接上自己以前学习整理文档:查看数据库命令:(1).查看存储过程状态:showprocedurestatus;(2).显示系统变量:show...

如何为MySQL中的JSON字段设置索引

背景MySQL在2015年中发布的5.7.8版本中首次引入了JSON数据类型。自此,它成了一种逃离严格列定义的方式,可以存储各种形状和大小的JSON文档,例如审计日志、配置信息、第三方数据包、用户自定...

取消回复欢迎 发表评论: