Basics of Git for a Social/Data Scientist

When I started working I was often asked to run analyses on our data. So I would do what I thought everyone else did. You fire up SPSS, you click thru, you get your output, then you paste it into excel, right? I quickly started to realize that I would get the same types of requests fairly often and it became fairly cumbersome to click thru the entire process each times, sometimes having to make very small tweaks after slightly different questions came back.

This is when I discovered the power of syntax within SPSS. I could just paste the syntax once and then save that syntax. I just needed to point it to the right dataset and make the small tweaks in the syntax file each time a new request came in. I realized it could save me a lot of time. Over time you would start making edits to the script. This would require you to save the files each time.

Eventually, you'd have a folder with 30 different variations of the syntax, as you make updates, then if you have to share with your colleagues at work you had to either email them or save them all to a shared drive. Which ones had you already uploaded, which ones had you not? Did you make changes to certain files, that you didn't remember? Do you just re-save the entire directory each time?

Long story, short....keeping track of your version control and syntax required a lot of organization.

Enter Git. Git is a version control system that keeps track of changes in all your computer files.

version_control_joke

I am by no means an expert in git and there are far more resources out there that explain all of the amazing stuff you can do like create branches off of the master branch to work on updating specific features. This is especially important if multiple people will be making updates to the files at the same time. But for the purposes of this post I will just share the basics. Like how to use git to quickly send your files to github, which now has free private repos.

So, this article will walk you through:

  1. Setting up a github account
  2. Making your first repository
  3. Cloning the repository
  4. Creating a small jupyter notebook script
  5. Using git to make a commit and pushing the updates to github

I think this provides the basics of using git and is something that is valuable to all social scientists.

1. Setting up a Github account

When you go to https://github.com/ you will be immediately asked if you want to start a github account. You can enter a username, your email and a password and you are ready to go.

2. Making your first repository

  • Under your user on the top right of the page you will see a dropdown menu. One of the options is Your repositories. Let's click there.
  • After that you will see a New option next to the Find a repository search bar. Click there.
  • Let's give our repo a name; for the purposes of this I'll call it test-repository. I'll give it a description of "This is a test", I'll set it to private, and I'll check the box that says initialize a README. See an image of it below.

repo

Congratulations!! You now have your first github repo!

3. Clone your repository

This simply means copying your repository to your local machine (you could also clone it to a cloud machine if you had a live instance). The great thing about github is you can clone everyone's repository. I'd typically recommend forking it first, so that way you can just use this same process to make any changes to it. Here's a quick article on the differences between cloning and forking for those interested.

To clone your repo you just click on the green clone or download tab and if you press the little notepad button it copies the url for you.

clone_repo

Now we can clone our repo. One quick note is that if you followed my earlier tutorial on setting up a python environment and you downloaded miniconda or anaconda, it should have come with git. If you did not, you may need to download git. You can follow the instructions for your specific OS here.

let's open up the bash shell/terminal and navigate to where we want the repository to go. I'm going to go to my desktop and the projects folder. Then we type in git clone and we paste the link we copied from above.

clone_bash_1

Now because I made this a private repo it will ask me for my username and password. If you had instead made it public it would just automatically clone it for you.

I found a good resource for communicating between unix and windows bash commands. This might be useful to keep as you work through bash stuff with me as I will be talking about unix commands, so if you are in windows your commands might be slightly different.

For example I want to use the ls command now to view the directories in the Projects folder to ensure we did in fact clone the repo. But if you are in windows according to the resource link you will need to use dir

clone_bash_2

Now as you can see we do have a directory/folder titled test-repository and if you navigate to that repo you can see we have a README.md, which is exactly what we had in our github repo. So now you have the repository locally.

4. Now let's create a simple Jupyter Notebook so we can make our first commit and push it to github.

  • Open jupyter, navigate to the repo folder and open up a notebook

I wrote a basic function and tested it with a couple of unit tests I saved it as First_commit.ipynb

Hey... I can do AI!!!

first_commit

5. Now let's do our first commit

  • After you wrote your function and saved your .ipynb file we can go back to the bash shell
  • Ensure you are still in the same directory and if you use the command to list directories you should see your .ipynb file
  • First let's do a git status to see that we have untracked files present
  • We can then use the command git add . which will add all of our files in the directory
  • To do your first commit type the following command git commit -m "This is my first commit to this repo" where the dash m allows you to make a comment about the commit

commit

So now you've just made your first commit, but in order to get it to github you need to do one more thing.

  • git push -u origin master will push it upstream to the origin master branch (which is the only branch we currently have)

push

Now let's go checkout out our github repo!!!

and look what we have! We have our notebook in our repo with my most recent commit message next to my username.

github

The TLDR version:

Now you have successfully leveraged git and moving forward if you structure all of your projects like this it's as simple as the following four commands in the bash shell after navigating to the directory.

  1. git status
  2. git add .
  3. git commit -m "notes about the commit"
  4. git push u - origin master

I hope I communicated how important this is and even if your company does not use github the first 3 commands still work within your local machine and there are ways to go back in history to look at previous versions and changes that have been made to specific files.

Also, this is literally very basic use of git, it has a ton more features.

Additional Resources:

You are well on your way to becoming a technologically saavy social scientist!!