搜索
您的当前位置:首页正文

Data Carpentry Workshop - Day 1

来源:知库网
Data Capentry

偶遇

五月初一次偶然的机会,在外导得知我在自学R语言后,推荐了一些课程给我。其中,一个学校机构组织的Data workshop通知吸引了我,在得知这是一个公益限量面向博士开课的课程后,我还是抱着试试看的态度报了名。报名截止后,我还发邮件问了主办方,被告知接下来的一周会给回复。五月底,自己收到了两个课程的正式通知,也得知后面还有很多申请的同学在排队,庆幸自己得到了这么来之不易的学习机会,必将好好珍惜。

Introduction Data Carpentry

Data Carpentry在国外很盛行,类似一些Volunteers定期为学生或研究人员组织的培训活动。对于这次活动,是KCL几位老师为wet-lab的博士们开设的数据分析课程。首次课程针对R语言零基础又对数据分析有着迫切需求的同学,课程为40人的小班教学,有两名教师轮番讲解并现场演练。课程提供免费早中餐及课件茶点,参与的学生多为生命科学领域的学生,地点选在了知名的Guys校区医院的Seminar Room。本次课程分为两天,第一天从常用的Excel入手,介绍科研中采用Excel输入及处理数据中的问题及弊端,引入R并简单入门。第二天从如何dataframe的一些常用操作入手,然后介绍了两个常用package:tidyverse和ggplot。


Lunch

课程感受

本次课程紧凑内容丰富,的确是初学者入门R的精彩课程。由于全英语授课,加之主要基于mac进行讲解,虽然有些自学基础,但到第二天的学习仍然感觉很吃力。基于来英已有仨月的生活经历,可以跟上老师的讲解,但由于坐在后排,代码看不十分清楚,加之敲代码及快捷键的使用并不熟悉,所以后期还是有难度,需要课下及时巩固学习。全班40名同学,遇到2位疑似华人学生,但由于他们英语讲的都很流利,也没好意思汉语交流,并不确定华人身份。课堂上外国学生很踊跃,反应也很快,旁边几位男生边听课边做着自己的数据分析和PPT,佩服他们超高的效率。旁边一位小姐姐也完全跟得上老师的脚步,并给我帮助很多,课间之余也跟她聊起了科研生活,更觉得自己该多下些功夫。

课堂笔记Day 1

1. Data organization in spreadsheets

1.1 Don'ts in spreadsheets

DON'T:

  • modify your raw data. Always make a copy before making any changes.
  • combine multiple values in one cell (units, numbers, etc)
  • never mess with your raw data: always work on a duplicate copy.
  • export as a text based file (csv or txt) so that R can read it.
  • make calculations. When you try to export that you will not export your formulae
  • Do not color code things. Computer does not care.
1.2 Names for columns:
  • do not use spaces
  • use UpperCaseLikeThis
  • use Underscore_to_separate_words
1.3 Dealing with missing values:
  • Do not use 0, because 0 is data sometimes
  • NA is the best
  • blank spaces also work
  • Do not combine columns
1.4 Dealing with Dates
  • Use buit-in functions
  • create a new column to split the year from month and day
  • use the formula =YEAR (#click on the cell you want to split)
  • double click on the right bottom of the cell where the little cross appears. It will apply the formula all the way down
  • reconstructing the date: =DATE(cell1; cell2; cell3) and this reconstructs the date based on the year, month and day
  • string format: a succession of numbers

2. Introduction to R

2.1 R and RStudio
  • R allows you to handle large datasets.It has lots of 'packages'.
  • R Studio is like an in-built computer to work R in a more user-friendly environment.
  • Every 'window' gives you information. The upper left corner is where you can write your script: you write your instructions like a lab protocol.The Console is the window in which you can execute your commands.The upper right is the Environment. Bottom right includes files, plots, packages and help.
  • Pipeline: one script after the other that takes you through all the actions that you need to do to deal with your data.
2.2 Advantages of R
  • it is free
  • it has thousands of functions built in- so that you don't have to do this!
  • it is much user friendly than other programs
  • there is a large community to ask questions (and you will get an answer!)
2.3 Tips to start using R
  • Be very organised. Make sub-folders that organise your project (data, outputs, figures, scripts).
  • A path shows you the way: this is a series of folders and subfolders to show you where your documents are.
  • Start a New Project: always whenever you are starting something new.
  • Start a new R Script: this is where you will type all of your commands- your script.
2.4 You can change the appearance in R
  • Windows: Options --> global options
  • Mac: In the tab 'R Studio' check 'Preferences'
2.5 Object
  • <- is the assign operator. This is how we assign a value to an object.
  • Shortcut: Windows/Linux: "Alt" + "-" | Mac: "Option" + "-".
  • Object names: with underscores, meaningful, and do not start with a number.
2.6 Useful Commands
  • getwd() : it shows you where you are in your computer, it tells you the working directory
  • setwd () Set working directory
  • ls() : it lists the things that are in your 'workspace'.
  • rm() : removes one object. THERE IS NO WAY OF RECOVERING IT!!
  • sqrt () : square root
  • round () : it rounds the number to whatever number of decimals that you want/need.
  • length () : it tells you the number of values in a vector.
  • class () : this tells you the type of object that you are dealing with
  • str () : this function tells you the structure of the object
  • ? #name of function It gives you the information about that function
  • print () it prints the value in the screen
  • (function) it prints the value in the screen
  • (#)This allows you to annotate your script
  • mean () It calculates the mean of a number
  • args (function) Args tells you the arguments of a function
  • c () Combines in one vector
  • [ ] Subsets elements from vectors. The order of the elements starts in 1.
  • ! means 'opposite'
2.7 Type of Data
  • character – text
  • Numeric (numbers)
  • integer - numbers without decimals
  • double - numbers with decimals
  • logical - TRUE or FALSE
  • In R there is a hierarchy about these types of data: logical → numeric → character ← logical
2.8 Vectors
  • This is another type of object in R.
  • This is just a series of values that you put together in an object using c .
2.9 Functions
  • A function is a command that executes some action in your input.
  • A function has 'arguments' in it: the things you input on your function so that it gets executed with your particular parameters.
2.10 Subsetting vectors
  • extract values for vector use [ ].
Conditional subsetting
  • AND: &
  • OR: |
  • Equal to: ==
  • More or equal: >=
  • Less or equal: <=
  • More: >
  • Less: <
  • %in%
2.11 Missing Data
  • Missing data as NA in vector.
  • na.rm = TRUE (ignore the missing data)
  • ( )[!is.na()]: extract those elements are not missing.
  • na.omit(): return with incomplete removed.
  • ()[complete.cases()]: return with complete.

下期预告

Starting with Data
Data Manipulation using dplyr and tidyr
Data visualization with ggplot2
Top