In today’s data-driven world, the ability to combine Excel files using Python and Pandas is essential for handling large amounts of data from various sources. We often find ourselves working with multiple Excel files that need to be merged into a single file for further analysis.
In this blog post, we’ll delve into a straightforward yet powerful Python script that harnesses the capabilities of the Pandas library to consolidate Excel files, streamlining your data analysis workflow and enhancing efficiency.
Some of the things mentioned in this post I learned from the Zerto To Mastery academy here. If You decide to sign-up, I will earn a commission.
The Problem – having multiple Excel files with important data
I recently ran into a business situation where users exported multiple (over hundreds) Excel spreadsheets from a single application.
After the export the data needed to be worked with – even modified, and there were formulas added.
The challenge here was that they were losing precious time copying and pasting data into a single Excel .xlsx file. Tedious work that could be automated easily.
An important distinction in this case – the users needed all of the data to be in one Excel file and on the same sheet!
To reiterate the problem:
- Over hundreds of .xlsx files exported from a third-party application.
- Users are manually copying and pasting data from the files into one single file to work with the data.
- Data needs to be in the same file and the same sheet in Excel.
How can we solve this conundrum?
Enter Python
I thought about doing this with Excel Power Query but after some research, I decided on Python instead.
Why?
Because of Pandas!
Pandas is a powerful, open-source Python library designed for data manipulation, analysis, and cleaning. It provides flexible data structures like DataFrames and Series, making it easy to handle structured data in an intuitive and efficient manner.
Set Up Your Environment with Essential Libraries
Before diving into the code, let’s ensure we have the required libraries installed.
The script relies on two key libraries: os
and pandas
. While os
is a built-in Python library, you’ll need to install pandas
. Additionally, you will need the openpyxl
library for reading and writing Excel files. You can install both libraries using pip:
pip install pandas openpyxl
Breaking down the Python script line by line to combine Excel files
Now that our environment is set up, let’s break down the code and understand how it works. I will share the complete script at the end of the post!
Import the necessary libraries:
import os
import pandas as pd
Set the folder path containing your Excel files:
folder_path = 'C:/Yourfolderpath'
os.chdir(folder_path)
Create a new Excel file to store the combined data:
output_file = 'combined_data.xlsx'
output_writer = pd.ExcelWriter(output_file, engine='openpyxl')
Initialize an empty DataFrame to store the combined data:
combined_data = pd.DataFrame()
Working with the Excel files
Loop through all Excel files in the folder, read their content, and concatenate the data:
for file in os.listdir(folder_path):
if file.endswith('.xlsx'):
try:
df = pd.read_excel(file, engine='openpyxl')
combined_data = pd.concat([combined_data, df], ignore_index=True)
except Exception as e:
print(f"Error processing file '{file}': {e}")
Write the combined data to the output file as a single sheet:
combined_data.to_excel(output_writer, index=False, sheet_name='Combined Data')
Save and close the output file:
output_writer.save()
print("Merged files saved as", output_file)
The full code block to combine Excel Files using Python and Pandas
Here’s the full code for Your convenience.
# Change the directory to the folder containing your .xlsx files
folder_path = 'C:/Yourfolderpath'
os.chdir(folder_path)
# Create a new Excel file to store the combined data
output_file = 'combined_data.xlsx'
output_writer = pd.ExcelWriter(output_file, engine='openpyxl')
# Initialize an empty DataFrame to store the combined data
combined_data = pd.DataFrame()
# Loop through all .xlsx files in the folder
for file in os.listdir(folder_path):
if file.endswith('.xlsx'):
try:
# Read the content of the file using pandas
df = pd.read_excel(file, engine='openpyxl')
# Concatenate the data to the combined_data DataFrame
combined_data = pd.concat([combined_data, df], ignore_index=True)
except Exception as e:
print(f"Error processing file '{file}': {e}")
# Write the combined data to the output file as a single sheet
combined_data.to_excel(output_writer, index=False, sheet_name='Combined Data')
# Save and close the output file
output_writer.save()
print("Merged files saved as", output_file)
There You have it! Now You can effortlessly combine Your own Excel files using Python and Pandas!
Some of the things mentioned in this post I learned from the Zerto To Mastery academy here. If You decide to sign-up, I will earn a commission.