📊 R Introduction

🎯 Complete Definition

R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. Created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, R is an implementation of the S programming language with lexical scoping semantics inspired by Scheme.

🔬 Core Characteristics

  • Statistical Computing: Built-in statistical models, tests, and analyses
  • Data Visualization: Powerful graphics capabilities (base graphics, ggplot2, lattice)
  • Vectorized Operations: Operations apply to entire vectors without explicit loops
  • Functional Programming: Functions are first-class citizens
  • Package Ecosystem: CRAN hosts over 18,000 packages
  • Data Wrangling: Excellent tools for data manipulation (dplyr, data.table)
  • Reproducible Research: R Markdown, knitr, Sweave
  • Interoperability: Connects with Python, C++, SQL, and more

📊 Industry Usage

R is the standard in academia for statistics and data science. Used extensively in pharmaceuticals (clinical trials), finance (risk modeling), healthcare (bioinformatics), government (census data), and tech companies (data analysis). Major users include Google, Facebook, Pfizer, and the FDA.

# R Introduction - CodeOrbitPro # Basic R operations print("📊 Welcome to R Pro Track!") # Create variables language <- "R" year <- 1993 version <- "4.3.1" # Print with concatenation cat("Hello,", language, "! Created in", year, "\n") cat("Current version:", version) # Simple statistics numbers <- c(1, 2, 3, 4, 5) mean_value <- mean(numbers) sd_value <- sd(numbers) cat("\nMean:", mean_value) cat("\nStandard deviation:", sd_value)

🔤 Basics & Syntax

🎯 Complete Definition

R syntax uses <- for assignment (though = also works), supports both functional and object-oriented paradigms, and has special handling for vectors and missing data.

📋 Key Elements

  • Assignment: x <- 10 or x = 10 (preferred <-)
  • Comments: # for single-line comments (no multi-line comments)
  • Case Sensitivity: R is case-sensitive (x and X are different)
  • Console I/O: print(), cat(), readline(), scan()
  • Packages: install.packages(), library(), require()
  • Help System: ?function, help(function), example(function)
  • Working Directory: getwd(), setwd()
  • Special Values: NA (missing), NULL (undefined), NaN (not a number), Inf/-Inf
# R Basics & Syntax # Assignment operators x <- 10 # Preferred y = 20 # Also works but less common 30 -> z # Right assignment (rare) # Printing print(x) # [1] 10 cat("x =", x, "y =", y, "\n") # x = 10 y = 20 # Getting help help(mean) # or ?mean example(mean) # Run examples # Working directory getwd() # Current directory # setwd("path/to/directory") # Change directory # Special values missing <- NA undefined <- NULL not_number <- 0/0 # NaN infinite <- 1/0 # Inf # Check types is.na(missing) # TRUE is.null(undefined) # TRUE is.nan(not_number) # TRUE is.infinite(infinite) # TRUE # Listing objects ls() # List all objects in environment objects() # Same as ls() # Removing objects rm(x) # Remove single object rm(list = ls()) # Remove all objects (clear workspace)

📦 Data Types

🎯 Complete Definition

R data types include atomic vectors (homogeneous) and lists (heterogeneous). R uses dynamic typing with type coercion and has specific classes for statistical data.

📊 Basic Types

  • numeric: Double-precision floating point (default for numbers)
  • integer: Integer values (suffix L: 42L)
  • character: String values in quotes
  • logical: TRUE, FALSE (or T, F - not recommended)
  • complex: Complex numbers (3 + 2i)
  • raw: Raw bytes
  • factor: Categorical data with levels
  • Date/POSIXct: Date and time classes

🔄 Type Checking & Coercion

  • Check: class(), typeof(), mode(), is.numeric(), is.character()
  • Coerce: as.numeric(), as.character(), as.logical(), as.factor()
# R Data Types # Numeric (double by default) num1 <- 10.5 num2 <- 42 typeof(num1) # "double" class(num1) # "numeric" # Integer (explicit) int1 <- 42L typeof(int1) # "integer" # Character char1 <- "Hello, R!" char2 <- 'Single quotes also work' typeof(char1) # "character" # Logical bool1 <- TRUE bool2 <- FALSE bool3 <- T # Not recommended, can be overwritten typeof(bool1) # "logical" # Complex comp <- 3 + 4i typeof(comp) # "complex" Re(comp) # Real part: 3 Im(comp) # Imaginary part: 4 # Type checking is.numeric(num1) # TRUE is.integer(int1) # TRUE is.character(char1) # TRUE is.logical(bool1) # TRUE # Type coercion (automatic) mixed <- c(1, "hello", TRUE) print(mixed) # All converted to character: "1" "hello" "TRUE" # Explicit coercion as.numeric("123") # 123 as.character(42) # "42" as.logical(0) # FALSE as.logical(1) # TRUE as.numeric(TRUE) # 1 as.numeric(FALSE) # 0 # Checking for NA/NaN x <- c(1, 2, NA, 4, NaN) is.na(x) # FALSE FALSE TRUE FALSE TRUE is.nan(x) # FALSE FALSE FALSE FALSE TRUE

🔢 Vectors

🎯 Complete Definition

Vectors are the most fundamental data structure in R. They are homogeneous (same type) sequences of elements. Everything in R is built on vectors - even single values are vectors of length 1.

📏 Vector Operations

  • Creation: c(), vector(), seq(), rep(), : operator
  • Indexing: [ ] with positive integers, negative integers (exclude), logical vectors, names
  • Vectorized operations: Operations apply element-wise
  • Recycling: Shorter vectors are recycled to match longer ones
  • Named vectors: Elements can have names for labeled access
  • Attributes: names(), dim(), class() can be attached

🔧 Key Functions

  • Creation: seq(from, to, by), seq_len(n), rep(x, times), rep_len(x, n)
  • Manipulation: length(), unique(), duplicated(), sort(), order(), rank()
  • Math: sum(), mean(), sd(), var(), min(), max(), range()
# Vectors in R # Creating vectors v1 <- c(1, 2, 3, 4, 5) # Combine function v2 <- 1:10 # Sequence operator v3 <- seq(from = 0, to = 1, by = 0.1) # Sequence with step v4 <- rep(1:3, times = 4) # Repeat vector v5 <- rep(1:3, each = 3) # Repeat each element print(v1) # [1] 1 2 3 4 5 print(v2) # [1] 1 2 3 4 5 6 7 8 9 10 print(v3) # [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 print(v4) # [1] 1 2 3 1 2 3 1 2 3 1 2 3 print(v5) # [1] 1 1 1 2 2 2 3 3 3 # Named vectors grades <- c(Alice = 95, Bob = 87, Charlie = 92) print(grades) # Alice Bob Charlie # 95 87 92 # Access by name grades["Alice"] # 95 # Vector operations (vectorized) a <- c(1, 2, 3) b <- c(4, 5, 6) a + b # [1] 5 7 9 a * b # [1] 4 10 18 a^2 # [1] 1 4 9 sqrt(a) # [1] 1.00 1.41 1.73 log(a) # Natural log # Recycling (shorter vector recycled) c(1, 2, 3, 4) + c(10, 20) # [1] 11 22 13 24 # Indexing v <- c(10, 20, 30, 40, 50) v[1] # First element: 10 v[3] # Third element: 30 v[1:3] # First three: 10 20 30 v[c(1, 3, 5)] # Specific positions: 10 30 50 v[-2] # Exclude second: 10 30 40 50 v[-c(1, 5)] # Exclude first and last: 20 30 40 # Logical indexing v > 25 # [1] FALSE FALSE TRUE TRUE TRUE v[v > 25] # [1] 30 40 50 # Vector functions length(v) # 5 unique(c(1,1,2,2,3)) # [1] 1 2 3 sort(c(3,1,4,2)) # [1] 1 2 3 4 order(c(3,1,4,2)) # [1] 2 4 1 3 (indices that would sort) # Summary statistics x <- rnorm(100) # 100 random normal numbers mean(x) median(x) sd(x) summary(x) # Min, 1st Qu., Median, Mean, 3rd Qu., Max

📋 Lists & Factors

🎯 Complete Definition

Lists are recursive vectors that can contain elements of different types, including other lists. Factors are vectors used for categorical data, storing integers with corresponding character labels (levels).

📋 Lists

  • Creation: list(), pairlist()
  • Access: [ ] returns list, [[ ]] returns element, $ for named elements
  • Nesting: Lists can contain other lists
  • Conversion: unlist() flattens to vector, as.list() converts vector to list
  • Applications: Function returns, complex data structures, JSON-like data

📊 Factors

  • Creation: factor(), as.factor(), gl() (generate levels)
  • Properties: levels(), nlevels(), is.factor(), ordered() for ordinal
  • Manipulation: droplevels() removes unused levels, relevel() changes reference
  • Table creation: table() creates contingency tables from factors
  • Important: Factors behave like integers in some contexts (caution needed)
# Lists and Factors in R # ===== LISTS ===== # Creating lists my_list <- list( name = "John Doe", age = 30, scores = c(85, 92, 78), is_student = FALSE, nested = list(a = 1, b = 2) ) print(my_list) # Accessing list elements my_list[1] # Returns list with first element: $name my_list[[1]] # Returns content: "John Doe" my_list$name # Returns content: "John Doe" my_list[["name"]] # Returns content: "John Doe" my_list$scores[2] # 92 (second score) my_list$nested$a # 1 # Adding to list my_list$new_element <- "added later" # List operations length(my_list) # Number of top-level elements names(my_list) # Get or set names unlist(my_list$scores) # Flatten to vector (already vector here) # Combining lists list1 <- list(a = 1, b = 2) list2 <- list(c = 3, d = 4) combined <- c(list1, list2) print(combined) # ===== FACTORS ===== # Creating factors gender <- factor(c("male", "female", "female", "male", "male")) print(gender) # Check levels levels(gender) # [1] "female" "male" nlevels(gender) # 2 class(gender) # "factor" # Ordered factors education <- factor( c("high school", "bachelor", "master", "phd", "bachelor"), levels = c("high school", "bachelor", "master", "phd"), ordered = TRUE ) print(education) # Factor with frequencies education[1] < education[2] # TRUE (high school < bachelor) # Table from factors gender_education <- table(gender, education) print(gender_education) # Changing factor levels levels(gender) <- c("F", "M") # Order matters: first level becomes "F" print(gender) # Generate factor with gl() gl(3, 2, labels = c("Low", "Medium", "High")) # Dropping unused levels factor_with_unused <- factor(c("A", "B", "A"), levels = c("A", "B", "C")) droplevels(factor_with_unused) # Removes level "C"

📑 Data Frames

🎯 Complete Definition

Data frames are the fundamental data structure for tabular data in R. They are lists of equal-length vectors (columns), combining features of lists and matrices. Each column can be a different type, making them ideal for datasets with mixed variable types.

📋 Data Frame Operations

  • Creation: data.frame(), as.data.frame(), read.table()/read.csv()
  • Access: [rows, cols], $ for columns, [[ ]], subset()
  • Inspection: head(), tail(), str(), summary(), dim(), nrow(), ncol()
  • Manipulation: rbind() (add rows), cbind() (add columns), merge() (join)
  • Subsetting: subset(), filter with logical conditions
  • Transformation: transform(), within()
  • Special: attach()/detach() (use with caution)
# Data Frames in R # ===== CREATING DATA FRAMES ===== # From vectors name <- c("Alice", "Bob", "Charlie", "Diana", "Eve") age <- c(25, 30, 35, 28, 32) salary <- c(50000, 60000, 75000, 55000, 70000) department <- c("IT", "HR", "IT", "Finance", "HR") employed <- c(TRUE, TRUE, FALSE, TRUE, TRUE) df <- data.frame( Name = name, Age = age, Salary = salary, Department = department, Employed = employed, stringsAsFactors = FALSE # Don't convert strings to factors ) print(df) # ===== INSPECTING DATA FRAMES ===== head(df, 3) # First 3 rows tail(df, 2) # Last 2 rows str(df) # Structure summary(df) # Summary statistics dim(df) # Dimensions: [1] 5 5 nrow(df) # Number of rows: 5 ncol(df) # Number of columns: 5 names(df) # Column names colnames(df) # Same as names() rownames(df) # Row names # ===== ACCESSING DATA ===== # By column df$Name # Vector: "Alice" "Bob" "Charlie" "Diana" "Eve" df[["Age"]] # Vector: 25 30 35 28 32 df[, "Salary"] # Vector: 50000 60000 75000 55000 70000 df[, 3] # Third column (Salary) # By row and column df[1, ] # First row (all columns) df[1:3, ] # First three rows df[1, "Name"] # "Alice" df[1, 1] # "Alice" df[1, c("Name", "Age")] # First row, Name and Age # By condition df[df$Age > 30, ] # Rows with Age > 30 df[df$Department == "IT", ] # IT department only df[df$Salary > 60000 & df$Employed, ] # Employed with Salary > 60000 # ===== SUBSET FUNCTION ===== subset(df, Age > 30) subset(df, Department == "IT", select = c(Name, Salary)) subset(df, Salary > 60000, select = -c(Department)) # ===== ADDING/REMOVING COLUMNS ===== # Add column df$Bonus <- df$Salary * 0.1 df[["YearsEmployed"]] <- c(2, 5, 0, 3, 4) # Add multiple columns using transform df <- transform(df, TotalComp = Salary + Bonus, AgeGroup = ifelse(Age < 30, "Young", "Senior")) # Remove column df$Bonus <- NULL # Remove Bonus column df[["YearsEmployed"]] <- NULL # Remove YearsEmployed # ===== ADDING ROWS ===== # Add single row new_employee <- data.frame( Name = "Frank", Age = 29, Salary = 62000, Department = "Finance", Employed = TRUE, stringsAsFactors = FALSE ) df <- rbind(df, new_employee) # Add multiple rows more_employees <- data.frame( Name = c("Grace", "Henry"), Age = c(27, 33), Salary = c(58000, 72000), Department = c("IT", "HR"), Employed = c(TRUE, TRUE), stringsAsFactors = FALSE ) df <- rbind(df, more_employees) # ===== COMBINING DATA FRAMES ===== # Create second data frame df2 <- data.frame( Name = c("Alice", "Bob", "Charlie", "Diana", "Eve", "Frank", "Grace", "Henry"), City = c("NYC", "LA", "Chicago", "NYC", "LA", "Chicago", "NYC", "LA"), stringsAsFactors = FALSE ) # Merge (join) data frames df_merged <- merge(df, df2, by = "Name") print(df_merged) # ===== IMPORTANT FUNCTIONS ===== # Order/sort data frame df_ordered <- df[order(df$Age, decreasing = TRUE), ] # Sort by Age descending df_ordered <- df[order(df$Department, df$Salary), ] # Sort by Department, then Salary # Remove rows with NA df_complete <- na.omit(df) # Unique rows df_unique <- unique(df) # Apply functions to columns lapply(df[, c("Age", "Salary")], mean) sapply(df[, c("Age", "Salary")], summary) # ===== AGGREGATION ===== # Aggregate by department agg <- aggregate(Salary ~ Department, data = df, FUN = mean) print(agg) # Table of counts table(df$Department, df$Employed)

🔲 Matrices & Arrays

🎯 Complete Definition

Matrices are two-dimensional arrays where all elements are the same type. Arrays generalize matrices to any number of dimensions. They are fundamental for linear algebra operations and multidimensional data.

🔲 Matrix Operations

  • Creation: matrix(), rbind(), cbind(), diag()
  • Dimensions: dim(), nrow(), ncol(), length()
  • Access: [row, col] indexing
  • Linear Algebra: %*% (matrix multiplication), t() (transpose), solve() (inverse), eigen()
  • Row/Column Operations: rowSums(), colSums(), rowMeans(), colMeans()
  • Special Matrices: diag() (diagonal), outer() (outer product)

📊 Arrays

  • Creation: array(data, dim)
  • Dimensions: dim() to get/set dimensions
  • Indexing: [dim1, dim2, dim3, ...]
  • aperm(): Permute array dimensions (like transpose for higher dimensions)
# Matrices and Arrays in R # ===== CREATING MATRICES ===== # By column (default) m1 <- matrix(1:12, nrow = 3, ncol = 4) print(m1) # By row m2 <- matrix(1:12, nrow = 3, ncol = 4, byrow = TRUE) print(m2) # From vectors using rbind and cbind row1 <- c(1, 2, 3) row2 <- c(4, 5, 6) m3 <- rbind(row1, row2) # Bind by row print(m3) col1 <- c(1, 4) col2 <- c(2, 5) col3 <- c(3, 6) m4 <- cbind(col1, col2, col3) # Bind by column print(m4) # Diagonal matrix diag(1, nrow = 3) # Identity matrix diag(c(10, 20, 30)) # Diagonal with specified values # ===== MATRIX PROPERTIES ===== m <- matrix(1:9, nrow = 3) dim(m) # [1] 3 3 nrow(m) # [1] 3 ncol(m) # [1] 3 length(m) # [1] 9 (total elements) # ===== INDEXING MATRICES ===== m <- matrix(1:12, nrow = 3) m[2, 3] # Element at row 2, column 3: 8 m[1, ] # First row: 1 4 7 10 m[, 2] # Second column: 4 5 6 m[1:2, 3:4] # Rows 1-2, columns 3-4 m[c(1, 3), ] # Rows 1 and 3 # ===== MATRIX OPERATIONS ===== A <- matrix(1:4, nrow = 2) B <- matrix(5:8, nrow = 2) # Element-wise operations A + B A * B # Element-wise multiplication, NOT matrix multiplication A / B A^2 # Matrix multiplication A %*% B # True matrix multiplication # Transpose t(A) # Determinant det(matrix(c(1,2,3,4), nrow = 2)) # Inverse solve(A) # Inverse of A (if invertible) # Eigenvalues and eigenvectors eigen(A) # ===== ROW/COLUMN STATISTICS ===== m <- matrix(1:12, nrow = 3) rowSums(m) # Sum of each row: 22 26 30 colSums(m) # Sum of each column: 6 15 24 33 rowMeans(m) # Mean of each row: 5.5 6.5 7.5 colMeans(m) # Mean of each column: 2 5 8 11 apply(m, 1, sum) # Same as rowSums (apply over rows) apply(m, 2, mean) # Same as colMeans (apply over columns) # ===== ARRAYS (3+ dimensions) ===== # Create 3D array (2 rows, 3 columns, 4 layers) arr <- array(1:24, dim = c(2, 3, 4)) print(arr) # Accessing array elements arr[1, 2, 3] # Row 1, Col 2, Layer 3: 15 arr[, , 1] # First layer (2x3 matrix) arr[1, , ] # First row across all layers arr[, 2, ] # Second column across all layers # Dimensions of array dim(arr) # [1] 2 3 4 # Permute dimensions aperm(arr, c(2, 1, 3)) # Swap first two dimensions # ===== OUTER PRODUCT ===== x <- 1:3 y <- 4:6 outer(x, y, FUN = "*") # Multiplication table

🔄 Control Flow

🎯 Complete Definition

Control flow statements in R include conditional execution and loops. R also provides vectorized alternatives that are often more efficient than explicit loops.

🔀 Conditional Statements

  • if/else: if (condition) expr else expr
  • ifelse(): Vectorized conditional: ifelse(test, yes, no)
  • switch(): Multi-way branch based on value

🔄 Loops

  • for: for (var in sequence) { expr }
  • while: while (condition) { expr }
  • repeat: repeat { expr; if (condition) break }
  • break/next: Exit loop or skip iteration

⚡ Vectorized Alternatives

  • Operations on vectors are usually faster than loops
  • ifelse() for vectorized conditional
  • apply family functions for iterative operations
# Control Flow in R # ===== IF/ELSE STATEMENTS ===== x <- 10 if (x > 5) { print("x is greater than 5") } else { print("x is less than or equal to 5") } # Multiple conditions score <- 85 if (score >= 90) { grade <- "A" } else if (score >= 80) { grade <- "B" } else if (score >= 70) { grade <- "C" } else { grade <- "F" } print(paste("Grade:", grade)) # ===== VECTORIZED IFELSE ===== ages <- c(15, 25, 35, 45, 55, 65) age_groups <- ifelse(ages < 30, "Young", ifelse(ages < 50, "Middle", "Senior")) print(age_groups) # Example with data frame df <- data.frame( name = c("Alice", "Bob", "Charlie", "Diana"), score = c(92, 67, 88, 45) ) df$result <- ifelse(df$score >= 70, "Pass", "Fail") print(df) # ===== SWITCH STATEMENT ===== # switch can work with numbers or strings operation <- "mean" values <- c(10, 20, 30, 40, 50) result <- switch(operation, "mean" = mean(values), "median" = median(values), "sum" = sum(values), "min" = min(values), "max" = max(values), stop("Unknown operation") ) print(result) # ===== FOR LOOPS ===== # Basic for loop for (i in 1:5) { print(paste("Iteration:", i)) } # Loop over vector fruits <- c("apple", "banana", "cherry", "date") for (fruit in fruits) { print(paste("I like", fruit)) } # Loop with index for (i in seq_along(fruits)) { print(paste("Fruit", i, "is", fruits[i])) } # ===== WHILE LOOPS ===== counter <- 1 while (counter <= 5) { print(paste("Counter:", counter)) counter <- counter + 1 } # While with condition x <- 1 while (x < 100) { print(x) x <- x * 2 } # ===== BREAK AND NEXT ===== # break - exit loop for (i in 1:10) { if (i == 6) { break } print(i) } # next - skip iteration for (i in 1:10) { if (i %% 2 == 0) { # Skip even numbers next } print(i) }

⚙️ Functions

🎯 Complete Definition

Functions in R are first-class objects that encapsulate reusable code. They take arguments, perform operations, and return values. R supports functional programming concepts like closures, lexical scoping, and higher-order functions.

⚙️ Function Components

  • Name: Optional identifier for the function
  • Arguments: Formal parameters (can have default values)
  • Body: Code inside { } that implements the function
  • Return value: Last evaluated expression or explicit return()
  • Environment: Where the function looks for variables

📋 Function Features

  • Default arguments: function(x, y = 10)
  • Lazy evaluation: Arguments evaluated only when used
  • ... (ellipsis): Pass arbitrary number of arguments
  • Anonymous functions: Functions without names
  • Closures: Functions that return functions
  • Recursion: Functions that call themselves
# Functions in R # ===== BASIC FUNCTION DEFINITION ===== # Simple function greet <- function(name) { paste("Hello,", name, "!") } # Call the function greet("Alice") # Function with multiple arguments add_numbers <- function(a, b) { result <- a + b return(result) # Explicit return } add_numbers(5, 3) # Function without return (last expression returned) multiply <- function(a, b) { a * b # Implicit return } multiply(4, 5) # ===== DEFAULT ARGUMENTS ===== power <- function(x, exponent = 2) { x^exponent } power(5) # [1] 25 (default exponent = 2) power(5, 3) # [1] 125 (exponent = 3) # ===== MULTIPLE RETURN VALUES ===== # Return a list for multiple values stats <- function(x) { list( mean = mean(x), median = median(x), sd = sd(x), min = min(x), max = max(x) ) } values <- c(10, 20, 30, 40, 50) result <- stats(values) print(result$mean) # ===== ANONYMOUS FUNCTIONS ===== # Used on the fly without naming squared <- sapply(1:5, function(x) x^2) print(squared) # ===== CLOSURES (FUNCTIONS RETURNING FUNCTIONS) ===== power_factory <- function(exponent) { function(x) { x^exponent } } square <- power_factory(2) cube <- power_factory(3) square(5) # [1] 25 cube(5) # [1] 125 # ===== RECURSIVE FUNCTIONS ===== # Factorial factorial_rec <- function(n) { if (n <= 1) { return(1) } else { return(n * factorial_rec(n - 1)) } } factorial_rec(5) # [1] 120

🚀 Apply Family

🎯 Complete Definition

The apply family provides functions for applying operations over margins of arrays, lists, or vectors, offering vectorized alternatives to loops. They are core to R's functional programming paradigm and often more efficient and concise than explicit loops.

📋 Apply Functions Overview

  • apply(): Apply function to margins of arrays/matrices
  • lapply(): Apply function to each element of a list, return list
  • sapply(): Simplify lapply result to vector or matrix if possible
  • vapply(): Safer version of sapply with pre-specified return type
  • mapply(): Multivariate version of sapply
  • tapply(): Apply function over subsets (by factor)
# Apply Family Functions in R # ===== APPLY (on matrices/arrays) ===== # Create a matrix m <- matrix(1:12, nrow = 3) print(m) # Apply over rows (MARGIN = 1) row_sums <- apply(m, 1, sum) print(row_sums) # Apply over columns (MARGIN = 2) col_means <- apply(m, 2, mean) print(col_means) # ===== LAPPLY (list apply, returns list) ===== # Create a list my_list <- list( a = 1:5, b = 6:10, c = 11:15 ) # Apply mean to each element means_list <- lapply(my_list, mean) print(means_list) # lapply on data frame (data frame is a list of columns) df <- data.frame( age = c(25, 30, 35, 28, 32), salary = c(50000, 60000, 75000, 55000, 70000), years = c(2, 5, 8, 3, 6) ) lapply(df, mean) # Means of each column # ===== SAPPLY (simplified lapply) ===== # Returns vector when possible means_vector <- sapply(my_list, mean) print(means_vector) # Returns matrix when appropriate stats_matrix <- sapply(my_list, function(x) { c(mean = mean(x), sd = sd(x), n = length(x)) }) print(stats_matrix) # ===== VAPPLY (type-safe sapply) ===== # Specify return type for safety vapply(my_list, mean, numeric(1)) # ===== MAPPLY (multivariate apply) ===== # Parallel processing of multiple arguments mapply(rep, 1:4, times = 4:1) # ===== TAPPLY (apply by groups) ===== # Grouped operations data <- data.frame( group = rep(c("A", "B", "C"), each = 5), value = c(rnorm(5, mean = 10), rnorm(5, mean = 20), rnorm(5, mean = 30)) ) # Calculate mean by group tapply(data$value, data$group, mean) # With factor tapply(mtcars$mpg, mtcars$cyl, mean)

📊 dplyr & tidyverse

🎯 Complete Definition

dplyr is a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges. It's part of the tidyverse, a collection of R packages designed for data science that share an underlying design philosophy, grammar, and data structures.

📋 Core dplyr Verbs

  • filter(): Subset rows based on conditions
  • select(): Choose columns by name
  • mutate(): Create new columns
  • summarise(): Collapse multiple values to single summary
  • arrange(): Reorder rows
  • group_by(): Group data for operations
  • join functions: left_join(), right_join(), inner_join(), full_join()

🔧 Additional Tidyverse Packages

  • tidyr: Tidy data (pivot_longer, pivot_wider, separate, unite)
  • purrr: Functional programming tools
  • tibble: Modern reimagining of data frames
  • stringr: String manipulation
  • forcats: Factor manipulation

⚡ Key Features

  • Pipe operator (%>%): Chain operations together
  • Consistent syntax: First argument is data, returns data frame
# dplyr and tidyverse in R # Load libraries library(dplyr) # ===== PIPE OPERATOR (%>%) ===== # Without pipe result <- arrange(summarise(group_by(filter(mtcars, cyl == 6), gear), avg_mpg = mean(mpg)), desc(avg_mpg)) # With pipe (much more readable) result <- mtcars %>% filter(cyl == 6) %>% group_by(gear) %>% summarise(avg_mpg = mean(mpg)) %>% arrange(desc(avg_mpg)) print(result) # ===== FILTER (row selection) ===== # Filter rows based on conditions mtcars %>% filter(mpg > 20) %>% head() # Multiple conditions mtcars %>% filter(cyl == 6 & hp > 100) # ===== SELECT (column selection) ===== # Select columns by name mtcars %>% select(mpg, cyl, hp) %>% head() # Select with helpers mtcars %>% select(starts_with("c")) # Remove columns mtcars %>% select(-hp, -wt) %>% head() # ===== MUTATE (create/modify columns) ===== # Create new columns mtcars %>% mutate( mpg_per_cyl = mpg / cyl, high_mpg = mpg > 25 ) %>% select(mpg, cyl, mpg_per_cyl, high_mpg) %>% head() # ===== SUMMARISE (aggregate) ===== # Basic summary mtcars %>% summarise( avg_mpg = mean(mpg), sd_mpg = sd(mpg), min_mpg = min(mpg), max_mpg = max(mpg), n = n() ) # ===== GROUP_BY (grouped operations) ===== # Group by single variable mtcars %>% group_by(cyl) %>% summarise( avg_mpg = mean(mpg), avg_hp = mean(hp), count = n() ) # ===== ARRANGE (sorting) ===== # Ascending (default) mtcars %>% arrange(mpg) %>% select(mpg, cyl, hp) %>% head() # Descending mtcars %>% arrange(desc(mpg)) %>% select(mpg, cyl, hp) %>% head() # ===== JOIN OPERATIONS ===== # Create two data frames for joining employees <- tibble( emp_id = 1:5, name = c("Alice", "Bob", "Charlie", "Diana", "Eve"), dept_id = c(101, 102, 101, 103, 102) ) departments <- tibble( dept_id = c(101, 102, 103, 104), dept_name = c("IT", "HR", "Finance", "Marketing"), location = c("NYC", "LA", "Chicago", "NYC") ) # Inner join (keep only matches in both) inner_join(employees, departments, by = "dept_id") # Left join (keep all employees) left_join(employees, departments, by = "dept_id") # ===== TIDYR FUNCTIONS ===== library(tidyr) # pivot_longer (gather) - wide to long wide_data <- tibble( id = 1:3, Q1 = c(85, 90, 88), Q2 = c(78, 92, 85), Q3 = c(92, 88, 90) ) long_data <- wide_data %>% pivot_longer( cols = starts_with("Q"), names_to = "quarter", values_to = "score" ) print(long_data) # pivot_wider (spread) - long to wide long_data %>% pivot_wider( names_from = quarter, values_from = score )

🎨 ggplot2

🎯 Complete Definition

ggplot2 is a data visualization package for R based on the Grammar of Graphics. Created by Hadley Wickham, it provides a consistent, layered approach to creating statistical graphics. Plots are built by adding layers, allowing complex visualizations to be constructed intuitively.

📋 Grammar of Graphics Components

  • Data: The dataset to plot
  • Aesthetics (aes): Mapping variables to visual properties (x, y, color, size, shape)
  • Geometries (geom): Geometric objects (points, lines, bars, etc.)
  • Facets: Subplots (small multiples)
  • Statistics: Statistical transformations (smoothing, binning)
  • Coordinates: Coordinate system (cartesian, polar, etc.)
  • Themes: Visual styling (titles, legends, backgrounds)

🔧 Common Geometries

  • geom_point(): Scatter plots
  • geom_line(): Line charts
  • geom_bar(): Bar charts
  • geom_histogram(): Histograms
  • geom_boxplot(): Box plots
  • geom_density(): Density plots
  • geom_smooth(): Smoothed conditional means
# ggplot2 Visualization in R # Load libraries library(ggplot2) # ===== BASIC STRUCTURE ===== # ggplot(data) + geom_function(aes(mappings)) ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() # ===== SCATTER PLOTS ===== # Basic scatter plot ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() # Add color by variable ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + geom_point(size = 3) # Add smooth line ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + geom_smooth(method = "lm", se = TRUE) # ===== BAR CHARTS ===== # Simple bar chart of counts ggplot(mtcars, aes(x = factor(cyl))) + geom_bar(fill = "steelblue") # Bar chart with fill by another variable ggplot(mtcars, aes(x = factor(cyl), fill = factor(gear))) + geom_bar(position = "dodge") # ===== HISTOGRAMS ===== # Basic histogram ggplot(mtcars, aes(x = mpg)) + geom_histogram(bins = 10, fill = "skyblue", color = "black") # ===== BOX PLOTS ===== # Basic box plot ggplot(mtcars, aes(x = factor(cyl), y = mpg)) + geom_boxplot(fill = "lightgreen") # Box plot with points ggplot(mtcars, aes(x = factor(cyl), y = mpg)) + geom_boxplot() + geom_jitter(width = 0.2, alpha = 0.5) # ===== DENSITY PLOTS ===== # Basic density ggplot(mtcars, aes(x = mpg)) + geom_density(fill = "orange", alpha = 0.5) # Multiple densities ggplot(iris, aes(x = Sepal.Length, fill = Species)) + geom_density(alpha = 0.5) # ===== FACETING ===== # Facet wrap (by one variable) ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + facet_wrap(~ cyl) # ===== CUSTOMIZING PLOTS ===== # Adding labels and title ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + geom_point(size = 3) + labs( title = "Fuel Efficiency vs Weight", subtitle = "By Number of Cylinders", x = "Weight (1000 lbs)", y = "Miles per Gallon", color = "Cylinders" ) + theme_minimal()

📈 Statistics

🎯 Complete Definition

R's statistical capabilities are among its greatest strengths. It provides comprehensive functionality for descriptive statistics, probability distributions, statistical tests, linear and nonlinear modeling, time series analysis, and machine learning.

📊 Descriptive Statistics

  • summary(): Five-number summary + mean
  • fivenum(): Tukey's five numbers
  • quantile(): Sample quantiles
  • IQR(): Interquartile range
  • cor(), cov(): Correlation and covariance
  • table(), prop.table(): Frequency tables

📈 Probability Distributions

  • Normal: dnorm(), pnorm(), qnorm(), rnorm()
  • Binomial: dbinom(), pbinom(), qbinom(), rbinom()
  • Poisson: dpois(), ppois(), qpois(), rpois()
  • Uniform: dunif(), punif(), qunif(), runif()

📋 Statistical Tests

  • t.test(): One and two-sample t-tests
  • wilcox.test(): Mann-Whitney/Wilcoxon tests
  • chisq.test(): Chi-squared tests
  • cor.test(): Test for association
  • shapiro.test(): Normality test

🔧 Linear Models

  • lm(): Linear regression
  • glm(): Generalized linear models
  • anova(): Analysis of variance
  • aov(): ANOVA
  • predict(): Model predictions
# Statistics in R # ===== DESCRIPTIVE STATISTICS ===== data(mtcars) # Summary statistics summary(mtcars$mpg) # Mean, median mean(mtcars$mpg) median(mtcars$mpg) # Spread measures sd(mtcars$mpg) var(mtcars$mpg) IQR(mtcars$mpg) # Correlation cor(mtcars$mpg, mtcars$wt) # ===== FREQUENCY TABLES ===== # One-way table table(mtcars$cyl) # Two-way table table(mtcars$cyl, mtcars$gear) # ===== PROBABILITY DISTRIBUTIONS ===== # Normal distribution x <- seq(-4, 4, length = 100) plot(x, dnorm(x), type = "l", main = "Normal Distribution") pnorm(1.96) # Cumulative probability: 0.975 qnorm(0.975) # Quantile: 1.959964 rnorm(10) # Random sample # ===== STATISTICAL TESTS ===== # One-sample t-test t.test(mtcars$mpg, mu = 20) # Two-sample t-test t.test(mpg ~ am, data = mtcars) # Chi-square test chisq.test(table(mtcars$cyl, mtcars$gear)) # Correlation test cor.test(mtcars$mpg, mtcars$wt) # Test for normality shapiro.test(mtcars$mpg) # ===== LINEAR MODELS ===== # Simple linear regression model1 <- lm(mpg ~ wt, data = mtcars) summary(model1) # Multiple linear regression model2 <- lm(mpg ~ wt + hp + cyl, data = mtcars) summary(model2) # Extract model components coef(model2) # Coefficients residuals(model2) # Residuals fitted(model2) # Fitted values anova(model2) # ANOVA table # Predictions new_cars <- data.frame(wt = c(2.5, 3.0, 3.5), hp = c(100, 150, 200), cyl = c(4, 6, 8)) predict(model2, new_cars) # ===== ANALYSIS OF VARIANCE (ANOVA) ===== # One-way ANOVA aov_model <- aov(mpg ~ factor(cyl), data = mtcars) summary(aov_model) # Tukey HSD post-hoc test TukeyHSD(aov_model)

🔬 Advanced R

🎯 Complete Definition

Advanced R covers sophisticated programming techniques, performance optimization, object-oriented systems, metaprogramming, and interfacing with other languages. These concepts enable building robust, efficient, and maintainable R code.

🔬 Object-Oriented Systems

  • S3: Informal, generic functions, class attribute
  • S4: Formal classes, slots, validation
  • R6: Encapsulated, reference-based OOP

⚡ Performance Optimization

  • Profiling: Rprof(), profvis
  • Byte compilation: compiler package
  • Vectorization: Avoiding loops
  • Parallel computing: parallel, foreach
  • Rcpp: C++ integration
  • data.table: High-performance data manipulation

📝 Metaprogramming

  • Non-standard evaluation: substitute(), quote(), eval()
  • Tidy evaluation: {{ }}, enquo(), !!
  • Environments: new.env(), parent.env()
  • Closures: Functions that capture environments
# Advanced R Programming # ===== S3 OBJECT SYSTEM ===== # Create an S3 class person <- list(name = "Alice", age = 30) class(person) <- "person" # Define a generic function greet <- function(x, ...) { UseMethod("greet") } # Define methods for the person class greet.person <- function(x, ...) { paste("Hello, I'm", x$name, "and I'm", x$age, "years old.") } # Test greet(person) # ===== R6 CLASSES ===== library(R6) Person <- R6Class("Person", public = list( name = NULL, age = NULL, initialize = function(name, age) { self$name <- name self$age <- age }, introduce = function() { paste("I'm", self$name, ", age", self$age) }, have_birthday = function() { self$age <- self$age + 1 invisible(self) } ) ) # Create and use R6 object alice <- Person$new("Alice", 30) alice$introduce() alice$have_birthday() alice$age # ===== ENVIRONMENTS ===== # Environments are fundamental to R's scoping e1 <- new.env() e1$x <- 10 e1$y <- 20 e2 <- new.env(parent = e1) e2$x <- 5 e2$x # 5 (own x) e2$y # 20 (from parent) # ===== CLOSURES ===== # Functions that capture their environment make_counter <- function(start = 0) { count <- start function() { count <<- count + 1 count } } counter1 <- make_counter() counter1() counter1() # ===== NON-STANDARD EVALUATION ===== # quote() captures expression without evaluating expr <- quote(x + y) expr # eval() evaluates captured expression x <- 5 y <- 3 eval(expr) # ===== TIDY EVALUATION ===== library(dplyr) # Correct tidy evaluation good_var_summary <- function(df, var) { df %>% group_by(cyl) %>% summarise(avg = mean({{ var }})) # Embrace with {{ }} } good_var_summary(mtcars, mpg) # ===== RCPP: C++ INTEGRATION ===== library(Rcpp) # Define C++ function in R cppFunction(' int fibonacci_cpp(int n) { if (n < 2) return n; return fibonacci_cpp(n-1) + fibonacci_cpp(n-2); } ') fibonacci_cpp(10) # ===== DATA.TABLE ===== library(data.table) # Create data.table dt <- data.table( id = 1:1e5, group = sample(letters[1:10], 1e5, replace = TRUE), value = rnorm(1e5) ) # Fast operations dt[, mean(value), by = group] dt[, value_sq := value^2] # ===== PARALLEL COMPUTING ===== library(parallel) # Detect cores ncores <- detectCores() print(paste("Number of cores:", ncores))