Merging multiple files for Analysis

In practice it's often the case that you will be presented with multiple data files that need to be merged together. Perhaps it's survey data that was collected on a monthly basis over a period of a year, or performance data that was collected from a number of locations. When it's only 4-5 files it's extremely easy to manually load all of them, concatenate them, and save the file out.

But, what if it's not 10 files? What if it's 100? or 1000? What then?

At the end of the day Python is interacting with your computer, so there are a number of ways to look inside folders. Contrary to what many would believe, but it's still possible to navigate a computer's directory without your mouse :) Let's start with a brief overview of the terminal/command prompt and some basic commands, then we'll switch over to python and leveraging the os package to do similar things.

A quick note: I'm working in a Linux operating system, so the terminal commands will work in Linux and Unix (i.e. Mac's), but will not work in Windows. You can still do the same thing, but the commands are slightly different. For example if I want to list the folders in a directory in linux I'd use ls in windows you will use dir. It's extremely easy to just Google ls in Windows, etc.

The Terminal

There are many reasons why knowing the terminal is valuable.

  1. It helps us wrap our head around the Python os package.
  2. If you ever move into the cloud to do your work, you will need to know how to use the terminal to get stuff done on the cloud machine. In most instances it doesn't make sense to set up an RDP (Remote Desktop) for cloud machines, so using the terminal is your new desktop.

The terminal is what people used to use before OS GUIs were invented. It looked something like this back in the day :)

img

The terminals always remind me of my favorite game when I was a little kid, Oregon Trail, which on my parent's computer ran straight out of DOS.

img

Basics

Think about using the terminal in the same way you'd navigate your directory via GUI. If I have a folder with files in it and I double click on it I move into the new folder. That same methodology holds true for the terminal if you use the cd command.

Important Commands:

  • cd (change directory)
  • ls (list directory contents)
  • mkdir (make a new directory)
  • cp (cp a file from one directory to another)
  • .. (allows you to move back directories)

Here is an example of all of them at work

img

so what did we do?

  1. we changed directories to a folder on our desktop titled "test"
  2. we printed the working directory
  3. we made a folder titled "additional_folder"
  4. we listed the files in the working directory, which now include the additional_folder directory and a csv file
  5. we navigated to the additional_folder directory
  6. we used .. to navigate back to the test directory
  7. we copied the test_data.csv file to the additional_folder directory
  8. we navigated to the additional_folder directory to confirm we had copied the file there.

Pretty, easy, right?

Now that we have a sense of how to navigate the computers' directories via the terminal let's move on to python and figure out how to do similar things within python.

In [1]:
import pandas as pd
import numpy as np
from os import listdir
from os.path import isfile, join

OS

The OS module in python allows you to interact with your operating system within python. Why might this be important for loading multiple files? Because we will want to see every file that's in a working directory in order to identify which files to load.

for this we will use the following os functions:

  • listdir
  • isfile
  • join

These are all extremely straightforward. Let's quickly look at them. We'll start by setting a path to start from.

In [2]:
mypath = "data/reviews/splits/"
In [3]:
len(listdir(mypath))
Out[3]:
215

So we can see here that there are 215 files listed in this directory. Now let's use list comprehension and isfile and join to pull the names of all of the files out of the path into a list.

In [4]:
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
In [5]:
print("number of files in the directory: ",len(onlyfiles))
print("example of file names: ",onlyfiles[:5])
number of files in the directory:  215
example of file names:  ['171_df.csv', '133_df.csv', '96_df.csv', '188_df.csv', '128_df.csv']

We get 215 unique filenames like we see above and we can look at a few of the example file names.

So in reality I just split several rather large dataframes into 215 smaller dataframes for this example, so I just named them based off of their number for simplicity.

Now let's go ahead and load each of the 215 files. Because we know they are all dataframes we can just loop through the files and load them each as a pandas dataframe and then store each pandas dataframe in a list of dataframes.

In [6]:
data_frames = []
for i in onlyfiles:
    df = pd.read_csv(mypath+i)
    data_frames.append(df)
In [7]:
len(data_frames)
Out[7]:
215

We see here that we have 215 dataframes.

In [8]:
len(data_frames[0])
Out[8]:
100

Each of the 215 files is of length 100.

Next step is to concatenate them into one large dataframe for analysis purposes and then reset the index for the potential of indexing at a later point.

In [9]:
full_df = pd.concat(data_frames)
full_df.reset_index(inplace=True)
In [10]:
len(full_df)
Out[10]:
21456

Now we have the separate files all in one dataframe, which is what we wanted. But we can assume that these files are randomly added, so let's assume we need to put them in some sort of order. How about based off of date of review?

Let's ensure that the dates are actually of type datetime.

In [11]:
import datetime
In [12]:
full_df['date'] = pd.to_datetime(full_df['date'])
In [13]:
full_df = full_df.sort_values('date')
In [14]:
full_df.head(3)
Out[14]:
level_0 index date employee_title location employee_status review_title years_at_company helpful pros cons advice_to_mgmt rating_overall rating_balance rating_culture rating_career rating_comp rating_mgmt company
14496 96 1378 2008-07-22 Stock Clerk Fairlawn, OH Former Employee Working at Giant Eagle is not that "Giant NaN 0 flexible schedule, location, proximity from ho... no respect for employees, minimum wages for a ... When your employees quit their menial jobs, as... 2.0 3.0 2.5 2.5 2.5 NaN geagle
14493 93 1375 2008-09-22 Stock Clerk Pittsburgh, PA Former Employee good place to work NaN 0 Very good scheduling and management. They will... Sometime I felt like it was hard to work your ... remember the little people 4.0 4.0 4.0 3.5 3.5 NaN geagle
14495 95 1377 2008-09-24 Engineering Manager Pittsburgh, PA Current Employee If it wasn't close to home, I wouldn't be here. NaN 0 Local, privately held company. Big company, wi... The leadership talks about the right things bu... Live your values, stop talking about them. Fla... 2.0 4.0 2.0 2.5 2.0 NaN geagle
In [15]:
full_df.tail(3)
Out[15]:
level_0 index date employee_title location employee_status review_title years_at_company helpful pros cons advice_to_mgmt rating_overall rating_balance rating_culture rating_career rating_comp rating_mgmt company
18218 18 0 2019-01-26 Front End Team Leader Youngstown, OH Current Employee 9 Great Years of Employment! More than 8 years 0 flexibility, ability to change careers at any ... Too much of focus on wage/dollars % and not gu... Ease off on the "special projects" and let sto... 4.0 3.0 3.0 4.0 4.0 3.0 geagle
7663 63 54 2019-01-26 Cashier Jonesboro, GA Former Employee This isn't a real review I just wanted premium Less than a year 0 The job is really easy to get The pay isn't great and the work is extra Do your job better mate 3.0 3.0 3.0 5.0 3.0 3.0 mcd
17949 49 2 2019-01-26 NaN Madison, AL Current Employee Pharmacist More than a year 0 Kroger offers a family like feel with benefits... This is a VERY corporate company. There are so... Reduce the number of emails sent out and make ... 5.0 3.0 4.0 4.0 5.0 2.0 kr
In [16]:
full_df['date'].describe()
Out[16]:
count                   21456
unique                   1681
top       2018-08-20 00:00:00
freq                       87
first     2008-07-22 00:00:00
last      2019-01-26 00:00:00
Name: date, dtype: object

How did I separate all of the files?

I had to think through the easiest way to do this because there wasn't something readily available online to separate the files. I figured I'd leave in my approach if there is ever a reason anyone would need to do this. I'd also be interested to see how others have done it, because I'm fairly certain this isn't the most efficient approach.

In [17]:
lst = list(range(1,223))
In [18]:
sub_frames = []
start = 0
end = 100
for i in lst:
    sub_frames.append(full_df[start:end])
    start +=100
    end +=100
In [19]:
index = 1
for i in sub_frames:
    i.to_csv('data/reviews/splits/'+str(index)+'_df.csv',index=False)
    index+=1