Version Control for Analysis Projects
Real scenario:
“I broke my analysis script and need to go back to yesterday’s version”
Key distinction: Git = local, GitHub = remote
✅ DO track:
❌ DON’T track:
Repository (repo): Project folder Git tracks
Commit: Snapshot of your work at a point in time
Branch: Separate line of development (we’ll keep it simple today)
1. Modify files
2. Stage changes (git add)
3. Commit changes (git commit)
4. Repeat!
Creating a simple analysis repository
Two new commands:
# Connect to GitHub (one time)
git remote add origin https://github.com/username/repo.git
# Push your commits
git push -u origin mainNow anyone can see and use your analysis!
Essential files:
.py, .R)requirements.txt or renv.lockREADME.md with project descriptionUse .gitignore for:
*.csv, *.parquet) - usually too large.env, API keys)venv/, renv/).ipynb_checkpoints/)Create .gitignore in your repo:
# Data
*.csv
*.parquet
data/
# Credentials
.env
secrets.json
# Outputs
*.html
figures/
# Python
__pycache__/
venv/
Challenge: Notebooks contain outputs and metadata
Tips:
nbstripout to automategit init)Bonus: Add a .gitignore file
Commit early, commit often, write clear messages
Good commit message: "Add data cleaning function for missing values"
Bad commit message: "updates" or "fixed stuff"
Remember: Everyone struggles with Git at first. It gets easier with practice!