Merging multiple files for Analysis¶

In practice it's often the case that you will be presented with multiple data files that need to be merged together. Perhaps it's survey data that was collected on a monthly basis over a period of a year, or performance data that was collected from a number of locations. When it's only 4-5 files it's extremely easy to manually load all of them, concatenate them, and save the file out.

But, what if it's not 10 files? What if it's 100? or 1000? What then?

At the end of the day Python is interacting with your computer, so there are a number of ways to look inside folders. Contrary to what many would believe, but it's still possible to navigate a computer's directory without your mouse :) Let's start with a brief overview of the terminal/command prompt and some basic commands, then we'll switch over to python and leveraging the os package to do similar things.

A quick note: I'm working in a Linux operating system, so the terminal commands will work in Linux and Unix (i.e. Mac's), but will not work in Windows. You can still do the same thing, but the commands are slightly different. For example if I want to list the folders in a directory in linux I'd use ls in windows you will use dir. It's extremely easy to just Google ls in Windows, etc.

The Terminal¶

There are many reasons why knowing the terminal is valuable.

It helps us wrap our head around the Python os package.
If you ever move into the cloud to do your work, you will need to know how to use the terminal to get stuff done on the cloud machine. In most instances it doesn't make sense to set up an RDP (Remote Desktop) for cloud machines, so using the terminal is your new desktop.

The terminal is what people used to use before OS GUIs were invented. It looked something like this back in the day :)

The terminals always remind me of my favorite game when I was a little kid, Oregon Trail, which on my parent's computer ran straight out of DOS.

Basics¶

Think about using the terminal in the same way you'd navigate your directory via GUI. If I have a folder with files in it and I double click on it I move into the new folder. That same methodology holds true for the terminal if you use the cd command.

Important Commands:

cd (change directory)
ls (list directory contents)
mkdir (make a new directory)
cp (cp a file from one directory to another)
.. (allows you to move back directories)

Here is an example of all of them at work

so what did we do?

we changed directories to a folder on our desktop titled "test"
we printed the working directory
we made a folder titled "additional_folder"
we listed the files in the working directory, which now include the additional_folder directory and a csv file
we navigated to the additional_folder directory
we used .. to navigate back to the test directory
we copied the test_data.csv file to the additional_folder directory
we navigated to the additional_folder directory to confirm we had copied the file there.

Pretty, easy, right?

Now that we have a sense of how to navigate the computers' directories via the terminal let's move on to python and figure out how to do similar things within python.

In [1]:

import pandas as pd
import numpy as np
from os import listdir
from os.path import isfile, join

OS¶

The OS module in python allows you to interact with your operating system within python. Why might this be important for loading multiple files? Because we will want to see every file that's in a working directory in order to identify which files to load.

for this we will use the following os functions:

listdir
isfile
join

These are all extremely straightforward. Let's quickly look at them. We'll start by setting a path to start from.

In [2]:

mypath = "data/reviews/splits/"

In [3]:

len(listdir(mypath))

Out[3]:

So we can see here that there are 215 files listed in this directory. Now let's use list comprehension and isfile and join to pull the names of all of the files out of the path into a list.

In [4]:

onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]

In [5]:

print("number of files in the directory: ",len(onlyfiles))
print("example of file names: ",onlyfiles[:5])

number of files in the directory:  215
example of file names:  ['171_df.csv', '133_df.csv', '96_df.csv', '188_df.csv', '128_df.csv']

We get 215 unique filenames like we see above and we can look at a few of the example file names.

So in reality I just split several rather large dataframes into 215 smaller dataframes for this example, so I just named them based off of their number for simplicity.

Now let's go ahead and load each of the 215 files. Because we know they are all dataframes we can just loop through the files and load them each as a pandas dataframe and then store each pandas dataframe in a list of dataframes.

In [6]:

data_frames = []
for i in onlyfiles:
    df = pd.read_csv(mypath+i)
    data_frames.append(df)

In [7]:

len(data_frames)

Out[7]:

We see here that we have 215 dataframes.

In [8]:

len(data_frames[0])

Out[8]:

Each of the 215 files is of length 100.

Next step is to concatenate them into one large dataframe for analysis purposes and then reset the index for the potential of indexing at a later point.

In [9]:

full_df = pd.concat(data_frames)
full_df.reset_index(inplace=True)

In [10]:

len(full_df)

Out[10]:

Now we have the separate files all in one dataframe, which is what we wanted. But we can assume that these files are randomly added, so let's assume we need to put them in some sort of order. How about based off of date of review?

Let's ensure that the dates are actually of type datetime.

In [11]:

import datetime

In [12]:

full_df['date'] = pd.to_datetime(full_df['date'])

In [13]:

full_df = full_df.sort_values('date')

In [14]:

full_df.head(3)

Out[14]:

	level_0	index	date	employee_title	location	employee_status	review_title	years_at_company	pros	cons	advice_to_mgmt	rating_overall	rating_balance	rating_culture	rating_career	rating_comp	rating_mgmt	company
14496	96	1378	2008-07-22	Stock Clerk	Fairlawn, OH	Former Employee	Working at Giant Eagle is not that "Giant	NaN	flexible schedule, location, proximity from ho...	no respect for employees, minimum wages for a ...	When your employees quit their menial jobs, as...	2.0	3.0	2.5	2.5	2.5	NaN	geagle
14493	93	1375	2008-09-22	Stock Clerk	Pittsburgh, PA	Former Employee	good place to work	NaN	Very good scheduling and management. They will...	Sometime I felt like it was hard to work your ...	remember the little people	4.0	4.0	4.0	3.5	3.5	NaN	geagle
14495	95	1377	2008-09-24	Engineering Manager	Pittsburgh, PA	Current Employee	If it wasn't close to home, I wouldn't be here.	NaN	Local, privately held company. Big company, wi...	The leadership talks about the right things bu...	Live your values, stop talking about them. Fla...	2.0	4.0	2.0	2.5	2.0	NaN	geagle

In [15]:

full_df.tail(3)

Out[15]:

	level_0	index	date	employee_title	location	employee_status	review_title	years_at_company	pros	cons	advice_to_mgmt	rating_overall	rating_balance	rating_culture	rating_career	rating_comp	rating_mgmt	company
18218	18	0	2019-01-26	Front End Team Leader	Youngstown, OH	Current Employee	9 Great Years of Employment!	More than 8 years	flexibility, ability to change careers at any ...	Too much of focus on wage/dollars % and not gu...	Ease off on the "special projects" and let sto...	4.0	3.0	3.0	4.0	4.0	3.0	geagle
7663	63	54	2019-01-26	Cashier	Jonesboro, GA	Former Employee	This isn't a real review I just wanted premium	Less than a year	The job is really easy to get	The pay isn't great and the work is extra	Do your job better mate	3.0	3.0	3.0	5.0	3.0	3.0	mcd
17949	49	2	2019-01-26	NaN	Madison, AL	Current Employee	Pharmacist	More than a year	Kroger offers a family like feel with benefits...	This is a VERY corporate company. There are so...	Reduce the number of emails sent out and make ...	5.0	3.0	4.0	4.0	5.0	2.0	kr

In [16]:

full_df['date'].describe()

Out[16]:

count                   21456
unique                   1681
top       2018-08-20 00:00:00
freq                       87
first     2008-07-22 00:00:00
last      2019-01-26 00:00:00
Name: date, dtype: object

How did I separate all of the files?¶

I had to think through the easiest way to do this because there wasn't something readily available online to separate the files. I figured I'd leave in my approach if there is ever a reason anyone would need to do this. I'd also be interested to see how others have done it, because I'm fairly certain this isn't the most efficient approach.

In [17]:

lst = list(range(1,223))

In [18]:

sub_frames = []
start = 0
end = 100
for i in lst:
    sub_frames.append(full_df[start:end])
    start +=100
    end +=100

In [19]:

index = 1
for i in sub_frames:
    i.to_csv('data/reviews/splits/'+str(index)+'_df.csv',index=False)
    index+=1

AIO: Bridging the Gap Between Data Science and IO

Loading Multiple Files

N. Koenig

Merging multiple files for Analysis¶

The Terminal¶

Basics¶

OS¶

How did I separate all of the files?¶

Comments