This is my second blog post for data management and analytics module.My assignment is to write a blog post about R, a language and environment for statistical computing and graphics.I am totally new to this so please bare with me, I hope it won’t be too boring. The first thing I did is completed Try R course from Code School Try R.I have to be honest it wasn’t too difficult because every time I stuck just clicked on the answer that was already provided at the beginning of each exercises. I know ,this is pretty lazy approach but I knew very little about R and I knew I have to take my time and start learning from the beginning. After I finished the course I started to search online for help. I quickly learned that there are plenty of courses, information and videos online about R .
So here is the completed course that was required by our teacher.
First, let take a look at what is R and what is it used for?
R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.( https://www.r-project.org) It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues.R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.
The R environment
R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes
- an effective data handling and storage facility,
- a suite of operators for calculations on arrays, in particular matrices,
- a large, coherent, integrated collection of intermediate tools for data analysis,
- graphical facilities for data analysis and display either on-screen or on hardcopy, and
- a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.(https://www.r-project.org/about.html)
Back to exploring R in a more practical way.
The first free course I found online was DataCamp.The course is designed for complete beginners from the basics takes you through to advance level and you can do all these in your own peace. I decided to give it a go.
As a complete beginner I started with Introduction to R.
Each chapter contains many exercises and the program only allows you to progress to the next level if you manage to answer each exercise correctly. But don ’t worry if you stuck you get help and if you really can ’t finish the exercise you can even get the results. Guilty:(
So in the introduction course I have learned how to use the console as a calculator and how to assign variables. Variables are used to store a value or an object. By calling the name of the variable we can access the value or object that is stored within this variable.
First Chapter was pretty easy ,many more to go.
The rest of the chapters covered Vectors, Matrices, Factors, Data frames, Lists.By the third chapter the difficulty level rise but there is plenty of help if you stuck and if you really can`t finish an exercise you even get the result.Guilty:(
The time estimate to accomplish this course is four hours.I will be honest, I have spent a good six hours to finish it with a few minutes break between each chapter.
How The Titanic Sank
Creating a graph in R
The dataset I chose for this blog is the Titanic Dataset. The script for this exercise is available on https://www.kaggle.com/mrisdal/titanic/exploring-survival-on-the-titanic.
The first thing I did is I installed the library packages than loaded them into RStudio.
I downloaded my sample data in Excel and saved it in my folders in cvs(comma separated values ) format.Using read.cvs command I imported the data into R and loaded into a dataframe.
The result we can work with is shown above ,1309 observations of 12 variables such as Passenger’s name,sex,age,fare,ticket etc.
The relationship between family size and survival
The passengers name variable can be broken down into surname,title and sex.We can use this information to represent families.
After splitting the passenger’s name into new variables,we created a new variable called family size based on number of siblings/spouse(s) and number of children/parents.
Now we can plot the data into a graph to see the relationship between survival and family size.
The red columns represent the people who did not survive and the blue who did survive.So single people survival rate wasn’t so good, families with 2,3 or 4 members had a better chance to survive, and families bigger size than 4 also had little chance to survive.
Survivals by age and gender
First of all, if you were a man, you were outta luck. The overall survival rate for men was 20%. For women, it was 74%, and for children, 52%. Yes, it was indeed “women and children first.”
This last graph explores the significance of age and gender difference in terms of survival.Children and female passengers between the age of 20 and 30 had the highest rate of survival.Man had little chance to survive.
What other ideas/concepts could be represented via R Graphics if you had more time?
If I had more time for this project probably I would have tried to plot data separating gender,age,class and create a graph to see more inside which class,gender and age had the best chance for survival.