How to Effortlessly Combine Excel Files using Python and Pandas

In today’s data-driven world, the ability to combine Excel files using Python and Pandas is essential for handling large amounts of data from various sources. We often find ourselves working with multiple Excel files that need to be merged into a single file for further analysis.

In this blog post, we’ll delve into a straightforward yet powerful Python script that harnesses the capabilities of the Pandas library to consolidate Excel files, streamlining your data analysis workflow and enhancing efficiency.

Some of the things mentioned in this post I learned from the Zerto To Mastery academy here. If You decide to sign-up, I will earn a commission.

The Problem – having multiple Excel files with important data

A lot of people working with multiple files – we need to simplify the workflow!

I recently ran into a business situation where users exported multiple (over hundreds) Excel spreadsheets from a single application.

After the export the data needed to be worked with – even modified, and there were formulas added.

The challenge here was that they were losing precious time copying and pasting data into a single Excel .xlsx file. Tedious work that could be automated easily.

An important distinction in this case – the users needed all of the data to be in one Excel file and on the same sheet!

To reiterate the problem:

  1. Over hundreds of .xlsx files exported from a third-party application.
  2. Users are manually copying and pasting data from the files into one single file to work with the data.
  3. Data needs to be in the same file and the same sheet in Excel.

How can we solve this conundrum?

Enter Python

And nope – not this guy.

I thought about doing this with Excel Power Query but after some research, I decided on Python instead.

Why?

Because of Pandas!

Pandas is a powerful, open-source Python library designed for data manipulation, analysis, and cleaning. It provides flexible data structures like DataFrames and Series, making it easy to handle structured data in an intuitive and efficient manner.

Set Up Your Environment with Essential Libraries

Before we do any coding with Python and Pandas – let us set up the environment properly.

Before diving into the code, let’s ensure we have the required libraries installed.

The script relies on two key libraries: os and pandas. While os is a built-in Python library, you’ll need to install pandas. Additionally, you will need the openpyxl library for reading and writing Excel files. You can install both libraries using pip:

pip install pandas openpyxl

Breaking down the Python script line by line to combine Excel files

Reading code is fun and it can also make Your head explode!

Now that our environment is set up, let’s break down the code and understand how it works. I will share the complete script at the end of the post!

Import the necessary libraries:

import os
import pandas as pd

Set the folder path containing your Excel files:

folder_path = 'C:/Yourfolderpath'
os.chdir(folder_path)

Create a new Excel file to store the combined data:

output_file = 'combined_data.xlsx'
output_writer = pd.ExcelWriter(output_file, engine='openpyxl')

Initialize an empty DataFrame to store the combined data:

combined_data = pd.DataFrame()

Working with the Excel files

We work with the Excel files at the end of the code block in this specific case.

Loop through all Excel files in the folder, read their content, and concatenate the data:

for file in os.listdir(folder_path):
    if file.endswith('.xlsx'):
        try:
            df = pd.read_excel(file, engine='openpyxl')
            combined_data = pd.concat([combined_data, df], ignore_index=True)
        except Exception as e:
            print(f"Error processing file '{file}': {e}")

Write the combined data to the output file as a single sheet:

combined_data.to_excel(output_writer, index=False, sheet_name='Combined Data')

Save and close the output file:

output_writer.save()
print("Merged files saved as", output_file)

The full code block to combine Excel Files using Python and Pandas

We finally did it!

Here’s the full code for Your convenience.

# Change the directory to the folder containing your .xlsx files
folder_path = 'C:/Yourfolderpath'
os.chdir(folder_path)

# Create a new Excel file to store the combined data
output_file = 'combined_data.xlsx'
output_writer = pd.ExcelWriter(output_file, engine='openpyxl')

# Initialize an empty DataFrame to store the combined data
combined_data = pd.DataFrame()

# Loop through all .xlsx files in the folder
for file in os.listdir(folder_path):
    if file.endswith('.xlsx'):
        try:
            # Read the content of the file using pandas
            df = pd.read_excel(file, engine='openpyxl')

            # Concatenate the data to the combined_data DataFrame
            combined_data = pd.concat([combined_data, df], ignore_index=True)

        except Exception as e:
            print(f"Error processing file '{file}': {e}")

# Write the combined data to the output file as a single sheet
combined_data.to_excel(output_writer, index=False, sheet_name='Combined Data')

# Save and close the output file
output_writer.save()
print("Merged files saved as", output_file)

There You have it! Now You can effortlessly combine Your own Excel files using Python and Pandas!

Some of the things mentioned in this post I learned from the Zerto To Mastery academy here. If You decide to sign-up, I will earn a commission.

Leave a Comment