Over the past 6 months I’ve been asked by several people in the charity sector, how did you get started with R? The other question that comes with that is, why R over Python?
Some background
One of the key approaches to my work in data analytics is that I strive to make my life as easy as possible. I first looked at trying to learn R about 10 years ago. I found it to be exceptionally difficult to find any good training materials, help websites like Stack Overflow didn’t exist and Massive Open Online Courses (MOOCs) didn’t exist, let alone have courses for software like R, and RStudio wouldn’t have it’s first release for another four years. In my day to day job at the time I was using SAS an awful lot as that was what one of my key clients was using, but I was looking for something with similar power that would allow me to deliver streamlined analysis (i.e. not point and click in excel) for our clients who couldn’t afford the exorbitant pricing of SAS.
At the time it wasn’t to be (for me at least). Step forward 7 years and all of the things I just mentioned were now in existence. Several lecturers from John Hopkins University had created a MOOC on Coursera, Stack Overflow was a thriving community of people asking and answering technical questions (including relating to R) and data analytics was becoming an increasingly important area for businesses to understand and focus on.
So why did I finally start with R? I was working for a company that conducted an industry wide analysis program that at the time was analysing data from 70 organisations, looking at about 200GB of data (still not big data by many standards), but it wasn’t the scale of the data that was the issue, we were doing the bulk of the analysis using database tools.
The struggle I had was how we took our summarised data and prepared it for presentation. Each organisation participating at a particular level would receive an detailed presentation of the findings and what it meant for their organisation. The first year I did this it was done with excel spreadsheets with linked charts to a PowerPoint deck. I won’t go into all the challenges and frustrations, other than to say it took at least 2 hours per presentation and there would still be mistakes that crept into the final document. It was far from ideal.
The Motivation
What got me motivated was the desire not to have to spend two solid weeks of my life preparing those reports in that manner again. Basically, I had an immediate real world problem I needed to solve. I needed to produce 40-50 custom reports based on aggregated data and organisation specific data. The reports would have the same structure, and typically all have the same types of data, but I was looking to create a repeatable process that a computer could do without my intervention. Have I mentioned I like things easy?
I had a an immediate real world problem I needed to solve
Being aware of R and it’s specificity to statistics and analytics I looked for some good materials to get started myself. I can across the Coursera specialisation from John Hopkins University on Data Science which was where I started. Much of the early stuff was familiar, however the introduction to R was a great way to start learning R. It taught me the basics of the language and allowed me to start solving my real world problem.
TL;DR; Getting Started
If you’re entirely new to the R language there are now a number of great places to start. I would still recommend looking at a MOOC or some other online training course.
Courses
- Coursera Data Science Specialisation - When I did this Coursera allowed you to enrol for free and gave the option for a paid certificate if you wanted. It is still possible to get a access to the materials if you ‘Audit’ the course. If you’re looking for some professional development I would suggest paying, if you’ve got a tight budget and need to learn a new skill you could consider auditing the course instead.
- DataCamp - They have a number of courses free and paid relating to R including introduction and intermediate courses as well as specialist areas.
- Code School - Have an introductory course to R
Getting help
The best place I have found to get help is on Stack Overflow. I usually start with a Google search of [R] my problem here
. While Google is good it helps to have the R
in square brackets as it is more likely to get recognised as a tag in help sites rather than just an alphabet character.
Usually there are several results that will come from Stack Overflow that are likely to address your immediate need. If that doesn’t give you quite what you are after post a question on Stack Overflow.
Explaining your problem to someone can be difficult at times, but it always helps to include an example of what you’ve tried as well as some data to work with too. In the process of explaining or showing what you’e done already you should create a reproducible example (reprex) or a minimum working example (MWE). The term reprex is more widely used and R even has a package now to help you create one. Some examples of questions I’ve asked with reprexes are:
Reading
There are a few books you might consider familiarising yourself with early on as you establish your skills. The first two come with examples and questions you can use to develop your skills as you read.
The community
One of the reasons I stuck with R rather than switching languages is the community. There is a great community of people using R who I’ve found to be quite open and supportive.
The following are a few people you might wish to follow on twitter, or search the #rstats tag every now and then.
If you’ve got any questions as you’re starting your R journey feel free to get in touch. I’d be more than happy to help.
So finally, did I manage to solve my real world problem? Yes, I did. Throught he use of a number of great packages I was able to draw data out of the database, create 200-300 charts and push them all into client specific PowerPoint presentations. I was also able to build a script that looped through all the necessary clients. Taking my time from 2 hours per presentation (of human time) to about 5 minutes of computational time per report and with almost zero errors; the errors were usually poor quality client data.