What’s your stack?
In computing, a solution stack, also called software stack and tech stack is a set of software subsystems or components needed to create a complete platform such that no additional software is needed to support applications. Applications are said to “run on” or “run on top of” the resulting platform. (Wikipedia)
I was in a virtual meeting where the guest was a marketing analyst at a toy company and someone asked him “What’s your stack?”. Someone else asked in the chat “what’s a stack?”, and it made me think of how many of us who call ourselves “data scientists” are not actually computer scientists or engineers who have to borrow terms from those fields.
For me, data science started back in 2006 or 2007 at a think tank where I used to work, with the need to analyze data in an environment where the predominant data storage, analysis, and presentation tool was the Excel worksheet, but then one day they sent us to a workshop on Data Analysis for Public Policy with an Argentinian guy in charge of a team that did all the World Bank household data processing for Latin America. Their weapon of choice was Stata and that’s where I learned concepts that still put food on my table today, such as variables, functions, recursion, data types, and formats. He was, and still is, a world-class economist, but these were programming terms, not economic theory. I was hooked.
I became good at the Stata + Excel + Word/PowerPoint combo and I continued to use mainly Excel for everything. I only used Stata every time household surveys were involved. I was sort of the “Household Survey” guy at that think tank. Around that time, I became interested in Linux too (first Debian and then Kubuntu) and R was one of those things that you had to try out. All you had to do was sudo apt install R and that was it. No licenses or anything. However, it felt very archaic and I was still a Stata guy, until I left the think tank and had to pay for my first Stata licence myself. Ouch!
I dabbled in R for some time, never really taking it seriously, until one day my master thesis supervisor told me I had to deliver the analysis of huge World Input-Output matrices covering decades worth of information in about a week. Otherwise, I would have to wait for him to come back from an almost month-long trip. I bought a book by Robert Kabacoff called “R in Action” (first edition at the time), did all the exercises of the first few chapters overnight on a Thursday, and by Monday the analysis was in my professor’s inbox. R saved the day—the year, I should say.
That’s when “my stack” really changed. I started using R for everything, only exporting to Excel at the last second, just so I could collaborate with colleagues and clients. The Word part has been more prevalent, though. I have been trying to step away from it ever since I found out about Pandoc (Github says my first Pandoc site was done in 2015, ten years ago). However, Word is so ingrained in corporate culture, that it has been nearly impossible to ditch.
My datasets are never so big that they don’t fit in memory, but out of curiosity I have dabbled in databases. Especially, when working with geographic data I found it so much easier to use SQL queries with PostGIS and PostgreSQL. I also used SQLite and lately DuckDB for my more complex projects (such as a global emissions database for CGE studies), but it remains largely overkill for most of my pipelines and applications. Normally, it’s just easier for me to save things as an R dataset (RDS), which beats most other formats in compression, providing me with files that are small enough to push to GitHub.
Recently, the good guys at Posit created something called Quarto, which sits on top of Pandoc and makes things so much easier to mix prose and code in documents that can be exported to HTML, pdf, Word, and other formats. I’ve been using it in combination with GitHub pages, as a way of documenting my process, making a GitHub/Quarto repository out of each R or Python project. As a matter of fact, this website was created using Quarto.
My current “stack” is just a Linux desktop running Linux Mint and a Laptop running Fedora for work on the go. I use GitHub as a “cloud” repository provider. I code and write prose in Positron, where I use Quarto to document code and reports. I still keep data in either RDS, CSV or their original SPSS format, and I still use Microsoft Office to collaborate with colleagues and clients.
As a one-man operation, I make do with everything inside of one machine, but I am curious about infrastructure these days. Is it worth using the cloud, servers, containers, databases? I don’t know, but that might be the next step to be able to grow in this field.
Any typos you find here are actually on purpose.