📊 R Introduction
🎯 Complete Definition
R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. Created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, R is an implementation of the S programming language with lexical scoping semantics inspired by Scheme.
🔬 Core Characteristics
- Statistical Computing: Built-in statistical models, tests, and analyses
- Data Visualization: Powerful graphics capabilities (base graphics, ggplot2, lattice)
- Vectorized Operations: Operations apply to entire vectors without explicit loops
- Functional Programming: Functions are first-class citizens
- Package Ecosystem: CRAN hosts over 18,000 packages
- Data Wrangling: Excellent tools for data manipulation (dplyr, data.table)
- Reproducible Research: R Markdown, knitr, Sweave
- Interoperability: Connects with Python, C++, SQL, and more
📊 Industry Usage
R is the standard in academia for statistics and data science. Used extensively in pharmaceuticals (clinical trials), finance (risk modeling), healthcare (bioinformatics), government (census data), and tech companies (data analysis). Major users include Google, Facebook, Pfizer, and the FDA.
# R Introduction - CodeOrbitPro
# Basic R operations
print("📊 Welcome to R Pro Track!")
# Create variables
language <- "R"
year <- 1993
version <- "4.3.1"
# Print with concatenation
cat("Hello,", language, "! Created in", year, "\n")
cat("Current version:", version)
# Simple statistics
numbers <- c(1, 2, 3, 4, 5)
mean_value <- mean(numbers)
sd_value <- sd(numbers)
cat("\nMean:", mean_value)
cat("\nStandard deviation:", sd_value)
🔤 Basics & Syntax
🎯 Complete Definition
R syntax uses <- for assignment (though = also works), supports both functional and object-oriented paradigms, and has special handling for vectors and missing data.
📋 Key Elements
- Assignment: x <- 10 or x = 10 (preferred <-)
- Comments: # for single-line comments (no multi-line comments)
- Case Sensitivity: R is case-sensitive (x and X are different)
- Console I/O: print(), cat(), readline(), scan()
- Packages: install.packages(), library(), require()
- Help System: ?function, help(function), example(function)
- Working Directory: getwd(), setwd()
- Special Values: NA (missing), NULL (undefined), NaN (not a number), Inf/-Inf
# R Basics & Syntax
# Assignment operators
x <- 10 # Preferred
y = 20 # Also works but less common
30 -> z # Right assignment (rare)
# Printing
print(x) # [1] 10
cat("x =", x, "y =", y, "\n") # x = 10 y = 20
# Getting help
help(mean) # or ?mean
example(mean) # Run examples
# Working directory
getwd() # Current directory
# setwd("path/to/directory") # Change directory
# Special values
missing <- NA
undefined <- NULL
not_number <- 0/0 # NaN
infinite <- 1/0 # Inf
# Check types
is.na(missing) # TRUE
is.null(undefined) # TRUE
is.nan(not_number) # TRUE
is.infinite(infinite) # TRUE
# Listing objects
ls() # List all objects in environment
objects() # Same as ls()
# Removing objects
rm(x) # Remove single object
rm(list = ls()) # Remove all objects (clear workspace)
📦 Data Types
🎯 Complete Definition
R data types include atomic vectors (homogeneous) and lists (heterogeneous). R uses dynamic typing with type coercion and has specific classes for statistical data.
📊 Basic Types
- numeric: Double-precision floating point (default for numbers)
- integer: Integer values (suffix L: 42L)
- character: String values in quotes
- logical: TRUE, FALSE (or T, F - not recommended)
- complex: Complex numbers (3 + 2i)
- raw: Raw bytes
- factor: Categorical data with levels
- Date/POSIXct: Date and time classes
🔄 Type Checking & Coercion
- Check: class(), typeof(), mode(), is.numeric(), is.character()
- Coerce: as.numeric(), as.character(), as.logical(), as.factor()
# R Data Types
# Numeric (double by default)
num1 <- 10.5
num2 <- 42
typeof(num1) # "double"
class(num1) # "numeric"
# Integer (explicit)
int1 <- 42L
typeof(int1) # "integer"
# Character
char1 <- "Hello, R!"
char2 <- 'Single quotes also work'
typeof(char1) # "character"
# Logical
bool1 <- TRUE
bool2 <- FALSE
bool3 <- T # Not recommended, can be overwritten
typeof(bool1) # "logical"
# Complex
comp <- 3 + 4i
typeof(comp) # "complex"
Re(comp) # Real part: 3
Im(comp) # Imaginary part: 4
# Type checking
is.numeric(num1) # TRUE
is.integer(int1) # TRUE
is.character(char1) # TRUE
is.logical(bool1) # TRUE
# Type coercion (automatic)
mixed <- c(1, "hello", TRUE)
print(mixed) # All converted to character: "1" "hello" "TRUE"
# Explicit coercion
as.numeric("123") # 123
as.character(42) # "42"
as.logical(0) # FALSE
as.logical(1) # TRUE
as.numeric(TRUE) # 1
as.numeric(FALSE) # 0
# Checking for NA/NaN
x <- c(1, 2, NA, 4, NaN)
is.na(x) # FALSE FALSE TRUE FALSE TRUE
is.nan(x) # FALSE FALSE FALSE FALSE TRUE
🔢 Vectors
🎯 Complete Definition
Vectors are the most fundamental data structure in R. They are homogeneous (same type) sequences of elements. Everything in R is built on vectors - even single values are vectors of length 1.
📏 Vector Operations
- Creation: c(), vector(), seq(), rep(), : operator
- Indexing: [ ] with positive integers, negative integers (exclude), logical vectors, names
- Vectorized operations: Operations apply element-wise
- Recycling: Shorter vectors are recycled to match longer ones
- Named vectors: Elements can have names for labeled access
- Attributes: names(), dim(), class() can be attached
🔧 Key Functions
- Creation: seq(from, to, by), seq_len(n), rep(x, times), rep_len(x, n)
- Manipulation: length(), unique(), duplicated(), sort(), order(), rank()
- Math: sum(), mean(), sd(), var(), min(), max(), range()
# Vectors in R
# Creating vectors
v1 <- c(1, 2, 3, 4, 5) # Combine function
v2 <- 1:10 # Sequence operator
v3 <- seq(from = 0, to = 1, by = 0.1) # Sequence with step
v4 <- rep(1:3, times = 4) # Repeat vector
v5 <- rep(1:3, each = 3) # Repeat each element
print(v1) # [1] 1 2 3 4 5
print(v2) # [1] 1 2 3 4 5 6 7 8 9 10
print(v3) # [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
print(v4) # [1] 1 2 3 1 2 3 1 2 3 1 2 3
print(v5) # [1] 1 1 1 2 2 2 3 3 3
# Named vectors
grades <- c(Alice = 95, Bob = 87, Charlie = 92)
print(grades)
# Alice Bob Charlie
# 95 87 92
# Access by name
grades["Alice"] # 95
# Vector operations (vectorized)
a <- c(1, 2, 3)
b <- c(4, 5, 6)
a + b # [1] 5 7 9
a * b # [1] 4 10 18
a^2 # [1] 1 4 9
sqrt(a) # [1] 1.00 1.41 1.73
log(a) # Natural log
# Recycling (shorter vector recycled)
c(1, 2, 3, 4) + c(10, 20) # [1] 11 22 13 24
# Indexing
v <- c(10, 20, 30, 40, 50)
v[1] # First element: 10
v[3] # Third element: 30
v[1:3] # First three: 10 20 30
v[c(1, 3, 5)] # Specific positions: 10 30 50
v[-2] # Exclude second: 10 30 40 50
v[-c(1, 5)] # Exclude first and last: 20 30 40
# Logical indexing
v > 25 # [1] FALSE FALSE TRUE TRUE TRUE
v[v > 25] # [1] 30 40 50
# Vector functions
length(v) # 5
unique(c(1,1,2,2,3)) # [1] 1 2 3
sort(c(3,1,4,2)) # [1] 1 2 3 4
order(c(3,1,4,2)) # [1] 2 4 1 3 (indices that would sort)
# Summary statistics
x <- rnorm(100) # 100 random normal numbers
mean(x)
median(x)
sd(x)
summary(x) # Min, 1st Qu., Median, Mean, 3rd Qu., Max
📋 Lists & Factors
🎯 Complete Definition
Lists are recursive vectors that can contain elements of different types, including other lists. Factors are vectors used for categorical data, storing integers with corresponding character labels (levels).
📋 Lists
- Creation: list(), pairlist()
- Access: [ ] returns list, [[ ]] returns element, $ for named elements
- Nesting: Lists can contain other lists
- Conversion: unlist() flattens to vector, as.list() converts vector to list
- Applications: Function returns, complex data structures, JSON-like data
📊 Factors
- Creation: factor(), as.factor(), gl() (generate levels)
- Properties: levels(), nlevels(), is.factor(), ordered() for ordinal
- Manipulation: droplevels() removes unused levels, relevel() changes reference
- Table creation: table() creates contingency tables from factors
- Important: Factors behave like integers in some contexts (caution needed)
# Lists and Factors in R
# ===== LISTS =====
# Creating lists
my_list <- list(
name = "John Doe",
age = 30,
scores = c(85, 92, 78),
is_student = FALSE,
nested = list(a = 1, b = 2)
)
print(my_list)
# Accessing list elements
my_list[1] # Returns list with first element: $name
my_list[[1]] # Returns content: "John Doe"
my_list$name # Returns content: "John Doe"
my_list[["name"]] # Returns content: "John Doe"
my_list$scores[2] # 92 (second score)
my_list$nested$a # 1
# Adding to list
my_list$new_element <- "added later"
# List operations
length(my_list) # Number of top-level elements
names(my_list) # Get or set names
unlist(my_list$scores) # Flatten to vector (already vector here)
# Combining lists
list1 <- list(a = 1, b = 2)
list2 <- list(c = 3, d = 4)
combined <- c(list1, list2)
print(combined)
# ===== FACTORS =====
# Creating factors
gender <- factor(c("male", "female", "female", "male", "male"))
print(gender)
# Check levels
levels(gender) # [1] "female" "male"
nlevels(gender) # 2
class(gender) # "factor"
# Ordered factors
education <- factor(
c("high school", "bachelor", "master", "phd", "bachelor"),
levels = c("high school", "bachelor", "master", "phd"),
ordered = TRUE
)
print(education)
# Factor with frequencies
education[1] < education[2] # TRUE (high school < bachelor)
# Table from factors
gender_education <- table(gender, education)
print(gender_education)
# Changing factor levels
levels(gender) <- c("F", "M") # Order matters: first level becomes "F"
print(gender)
# Generate factor with gl()
gl(3, 2, labels = c("Low", "Medium", "High"))
# Dropping unused levels
factor_with_unused <- factor(c("A", "B", "A"), levels = c("A", "B", "C"))
droplevels(factor_with_unused) # Removes level "C"
📑 Data Frames
🎯 Complete Definition
Data frames are the fundamental data structure for tabular data in R. They are lists of equal-length vectors (columns), combining features of lists and matrices. Each column can be a different type, making them ideal for datasets with mixed variable types.
📋 Data Frame Operations
- Creation: data.frame(), as.data.frame(), read.table()/read.csv()
- Access: [rows, cols], $ for columns, [[ ]], subset()
- Inspection: head(), tail(), str(), summary(), dim(), nrow(), ncol()
- Manipulation: rbind() (add rows), cbind() (add columns), merge() (join)
- Subsetting: subset(), filter with logical conditions
- Transformation: transform(), within()
- Special: attach()/detach() (use with caution)
# Data Frames in R
# ===== CREATING DATA FRAMES =====
# From vectors
name <- c("Alice", "Bob", "Charlie", "Diana", "Eve")
age <- c(25, 30, 35, 28, 32)
salary <- c(50000, 60000, 75000, 55000, 70000)
department <- c("IT", "HR", "IT", "Finance", "HR")
employed <- c(TRUE, TRUE, FALSE, TRUE, TRUE)
df <- data.frame(
Name = name,
Age = age,
Salary = salary,
Department = department,
Employed = employed,
stringsAsFactors = FALSE # Don't convert strings to factors
)
print(df)
# ===== INSPECTING DATA FRAMES =====
head(df, 3) # First 3 rows
tail(df, 2) # Last 2 rows
str(df) # Structure
summary(df) # Summary statistics
dim(df) # Dimensions: [1] 5 5
nrow(df) # Number of rows: 5
ncol(df) # Number of columns: 5
names(df) # Column names
colnames(df) # Same as names()
rownames(df) # Row names
# ===== ACCESSING DATA =====
# By column
df$Name # Vector: "Alice" "Bob" "Charlie" "Diana" "Eve"
df[["Age"]] # Vector: 25 30 35 28 32
df[, "Salary"] # Vector: 50000 60000 75000 55000 70000
df[, 3] # Third column (Salary)
# By row and column
df[1, ] # First row (all columns)
df[1:3, ] # First three rows
df[1, "Name"] # "Alice"
df[1, 1] # "Alice"
df[1, c("Name", "Age")] # First row, Name and Age
# By condition
df[df$Age > 30, ] # Rows with Age > 30
df[df$Department == "IT", ] # IT department only
df[df$Salary > 60000 & df$Employed, ] # Employed with Salary > 60000
# ===== SUBSET FUNCTION =====
subset(df, Age > 30)
subset(df, Department == "IT", select = c(Name, Salary))
subset(df, Salary > 60000, select = -c(Department))
# ===== ADDING/REMOVING COLUMNS =====
# Add column
df$Bonus <- df$Salary * 0.1
df[["YearsEmployed"]] <- c(2, 5, 0, 3, 4)
# Add multiple columns using transform
df <- transform(df,
TotalComp = Salary + Bonus,
AgeGroup = ifelse(Age < 30, "Young", "Senior"))
# Remove column
df$Bonus <- NULL # Remove Bonus column
df[["YearsEmployed"]] <- NULL # Remove YearsEmployed
# ===== ADDING ROWS =====
# Add single row
new_employee <- data.frame(
Name = "Frank",
Age = 29,
Salary = 62000,
Department = "Finance",
Employed = TRUE,
stringsAsFactors = FALSE
)
df <- rbind(df, new_employee)
# Add multiple rows
more_employees <- data.frame(
Name = c("Grace", "Henry"),
Age = c(27, 33),
Salary = c(58000, 72000),
Department = c("IT", "HR"),
Employed = c(TRUE, TRUE),
stringsAsFactors = FALSE
)
df <- rbind(df, more_employees)
# ===== COMBINING DATA FRAMES =====
# Create second data frame
df2 <- data.frame(
Name = c("Alice", "Bob", "Charlie", "Diana", "Eve", "Frank", "Grace", "Henry"),
City = c("NYC", "LA", "Chicago", "NYC", "LA", "Chicago", "NYC", "LA"),
stringsAsFactors = FALSE
)
# Merge (join) data frames
df_merged <- merge(df, df2, by = "Name")
print(df_merged)
# ===== IMPORTANT FUNCTIONS =====
# Order/sort data frame
df_ordered <- df[order(df$Age, decreasing = TRUE), ] # Sort by Age descending
df_ordered <- df[order(df$Department, df$Salary), ] # Sort by Department, then Salary
# Remove rows with NA
df_complete <- na.omit(df)
# Unique rows
df_unique <- unique(df)
# Apply functions to columns
lapply(df[, c("Age", "Salary")], mean)
sapply(df[, c("Age", "Salary")], summary)
# ===== AGGREGATION =====
# Aggregate by department
agg <- aggregate(Salary ~ Department, data = df, FUN = mean)
print(agg)
# Table of counts
table(df$Department, df$Employed)
🔲 Matrices & Arrays
🎯 Complete Definition
Matrices are two-dimensional arrays where all elements are the same type. Arrays generalize matrices to any number of dimensions. They are fundamental for linear algebra operations and multidimensional data.
🔲 Matrix Operations
- Creation: matrix(), rbind(), cbind(), diag()
- Dimensions: dim(), nrow(), ncol(), length()
- Access: [row, col] indexing
- Linear Algebra: %*% (matrix multiplication), t() (transpose), solve() (inverse), eigen()
- Row/Column Operations: rowSums(), colSums(), rowMeans(), colMeans()
- Special Matrices: diag() (diagonal), outer() (outer product)
📊 Arrays
- Creation: array(data, dim)
- Dimensions: dim() to get/set dimensions
- Indexing: [dim1, dim2, dim3, ...]
- aperm(): Permute array dimensions (like transpose for higher dimensions)
# Matrices and Arrays in R
# ===== CREATING MATRICES =====
# By column (default)
m1 <- matrix(1:12, nrow = 3, ncol = 4)
print(m1)
# By row
m2 <- matrix(1:12, nrow = 3, ncol = 4, byrow = TRUE)
print(m2)
# From vectors using rbind and cbind
row1 <- c(1, 2, 3)
row2 <- c(4, 5, 6)
m3 <- rbind(row1, row2) # Bind by row
print(m3)
col1 <- c(1, 4)
col2 <- c(2, 5)
col3 <- c(3, 6)
m4 <- cbind(col1, col2, col3) # Bind by column
print(m4)
# Diagonal matrix
diag(1, nrow = 3) # Identity matrix
diag(c(10, 20, 30)) # Diagonal with specified values
# ===== MATRIX PROPERTIES =====
m <- matrix(1:9, nrow = 3)
dim(m) # [1] 3 3
nrow(m) # [1] 3
ncol(m) # [1] 3
length(m) # [1] 9 (total elements)
# ===== INDEXING MATRICES =====
m <- matrix(1:12, nrow = 3)
m[2, 3] # Element at row 2, column 3: 8
m[1, ] # First row: 1 4 7 10
m[, 2] # Second column: 4 5 6
m[1:2, 3:4] # Rows 1-2, columns 3-4
m[c(1, 3), ] # Rows 1 and 3
# ===== MATRIX OPERATIONS =====
A <- matrix(1:4, nrow = 2)
B <- matrix(5:8, nrow = 2)
# Element-wise operations
A + B
A * B # Element-wise multiplication, NOT matrix multiplication
A / B
A^2
# Matrix multiplication
A %*% B # True matrix multiplication
# Transpose
t(A)
# Determinant
det(matrix(c(1,2,3,4), nrow = 2))
# Inverse
solve(A) # Inverse of A (if invertible)
# Eigenvalues and eigenvectors
eigen(A)
# ===== ROW/COLUMN STATISTICS =====
m <- matrix(1:12, nrow = 3)
rowSums(m) # Sum of each row: 22 26 30
colSums(m) # Sum of each column: 6 15 24 33
rowMeans(m) # Mean of each row: 5.5 6.5 7.5
colMeans(m) # Mean of each column: 2 5 8 11
apply(m, 1, sum) # Same as rowSums (apply over rows)
apply(m, 2, mean) # Same as colMeans (apply over columns)
# ===== ARRAYS (3+ dimensions) =====
# Create 3D array (2 rows, 3 columns, 4 layers)
arr <- array(1:24, dim = c(2, 3, 4))
print(arr)
# Accessing array elements
arr[1, 2, 3] # Row 1, Col 2, Layer 3: 15
arr[, , 1] # First layer (2x3 matrix)
arr[1, , ] # First row across all layers
arr[, 2, ] # Second column across all layers
# Dimensions of array
dim(arr) # [1] 2 3 4
# Permute dimensions
aperm(arr, c(2, 1, 3)) # Swap first two dimensions
# ===== OUTER PRODUCT =====
x <- 1:3
y <- 4:6
outer(x, y, FUN = "*") # Multiplication table
🔄 Control Flow
🎯 Complete Definition
Control flow statements in R include conditional execution and loops. R also provides vectorized alternatives that are often more efficient than explicit loops.
🔀 Conditional Statements
- if/else: if (condition) expr else expr
- ifelse(): Vectorized conditional: ifelse(test, yes, no)
- switch(): Multi-way branch based on value
🔄 Loops
- for: for (var in sequence) { expr }
- while: while (condition) { expr }
- repeat: repeat { expr; if (condition) break }
- break/next: Exit loop or skip iteration
⚡ Vectorized Alternatives
- Operations on vectors are usually faster than loops
- ifelse() for vectorized conditional
- apply family functions for iterative operations
# Control Flow in R
# ===== IF/ELSE STATEMENTS =====
x <- 10
if (x > 5) {
print("x is greater than 5")
} else {
print("x is less than or equal to 5")
}
# Multiple conditions
score <- 85
if (score >= 90) {
grade <- "A"
} else if (score >= 80) {
grade <- "B"
} else if (score >= 70) {
grade <- "C"
} else {
grade <- "F"
}
print(paste("Grade:", grade))
# ===== VECTORIZED IFELSE =====
ages <- c(15, 25, 35, 45, 55, 65)
age_groups <- ifelse(ages < 30, "Young",
ifelse(ages < 50, "Middle", "Senior"))
print(age_groups)
# Example with data frame
df <- data.frame(
name = c("Alice", "Bob", "Charlie", "Diana"),
score = c(92, 67, 88, 45)
)
df$result <- ifelse(df$score >= 70, "Pass", "Fail")
print(df)
# ===== SWITCH STATEMENT =====
# switch can work with numbers or strings
operation <- "mean"
values <- c(10, 20, 30, 40, 50)
result <- switch(operation,
"mean" = mean(values),
"median" = median(values),
"sum" = sum(values),
"min" = min(values),
"max" = max(values),
stop("Unknown operation")
)
print(result)
# ===== FOR LOOPS =====
# Basic for loop
for (i in 1:5) {
print(paste("Iteration:", i))
}
# Loop over vector
fruits <- c("apple", "banana", "cherry", "date")
for (fruit in fruits) {
print(paste("I like", fruit))
}
# Loop with index
for (i in seq_along(fruits)) {
print(paste("Fruit", i, "is", fruits[i]))
}
# ===== WHILE LOOPS =====
counter <- 1
while (counter <= 5) {
print(paste("Counter:", counter))
counter <- counter + 1
}
# While with condition
x <- 1
while (x < 100) {
print(x)
x <- x * 2
}
# ===== BREAK AND NEXT =====
# break - exit loop
for (i in 1:10) {
if (i == 6) {
break
}
print(i)
}
# next - skip iteration
for (i in 1:10) {
if (i %% 2 == 0) { # Skip even numbers
next
}
print(i)
}
⚙️ Functions
🎯 Complete Definition
Functions in R are first-class objects that encapsulate reusable code. They take arguments, perform operations, and return values. R supports functional programming concepts like closures, lexical scoping, and higher-order functions.
⚙️ Function Components
- Name: Optional identifier for the function
- Arguments: Formal parameters (can have default values)
- Body: Code inside { } that implements the function
- Return value: Last evaluated expression or explicit return()
- Environment: Where the function looks for variables
📋 Function Features
- Default arguments: function(x, y = 10)
- Lazy evaluation: Arguments evaluated only when used
- ... (ellipsis): Pass arbitrary number of arguments
- Anonymous functions: Functions without names
- Closures: Functions that return functions
- Recursion: Functions that call themselves
# Functions in R
# ===== BASIC FUNCTION DEFINITION =====
# Simple function
greet <- function(name) {
paste("Hello,", name, "!")
}
# Call the function
greet("Alice")
# Function with multiple arguments
add_numbers <- function(a, b) {
result <- a + b
return(result) # Explicit return
}
add_numbers(5, 3)
# Function without return (last expression returned)
multiply <- function(a, b) {
a * b # Implicit return
}
multiply(4, 5)
# ===== DEFAULT ARGUMENTS =====
power <- function(x, exponent = 2) {
x^exponent
}
power(5) # [1] 25 (default exponent = 2)
power(5, 3) # [1] 125 (exponent = 3)
# ===== MULTIPLE RETURN VALUES =====
# Return a list for multiple values
stats <- function(x) {
list(
mean = mean(x),
median = median(x),
sd = sd(x),
min = min(x),
max = max(x)
)
}
values <- c(10, 20, 30, 40, 50)
result <- stats(values)
print(result$mean)
# ===== ANONYMOUS FUNCTIONS =====
# Used on the fly without naming
squared <- sapply(1:5, function(x) x^2)
print(squared)
# ===== CLOSURES (FUNCTIONS RETURNING FUNCTIONS) =====
power_factory <- function(exponent) {
function(x) {
x^exponent
}
}
square <- power_factory(2)
cube <- power_factory(3)
square(5) # [1] 25
cube(5) # [1] 125
# ===== RECURSIVE FUNCTIONS =====
# Factorial
factorial_rec <- function(n) {
if (n <= 1) {
return(1)
} else {
return(n * factorial_rec(n - 1))
}
}
factorial_rec(5) # [1] 120
🚀 Apply Family
🎯 Complete Definition
The apply family provides functions for applying operations over margins of arrays, lists, or vectors, offering vectorized alternatives to loops. They are core to R's functional programming paradigm and often more efficient and concise than explicit loops.
📋 Apply Functions Overview
- apply(): Apply function to margins of arrays/matrices
- lapply(): Apply function to each element of a list, return list
- sapply(): Simplify lapply result to vector or matrix if possible
- vapply(): Safer version of sapply with pre-specified return type
- mapply(): Multivariate version of sapply
- tapply(): Apply function over subsets (by factor)
# Apply Family Functions in R
# ===== APPLY (on matrices/arrays) =====
# Create a matrix
m <- matrix(1:12, nrow = 3)
print(m)
# Apply over rows (MARGIN = 1)
row_sums <- apply(m, 1, sum)
print(row_sums)
# Apply over columns (MARGIN = 2)
col_means <- apply(m, 2, mean)
print(col_means)
# ===== LAPPLY (list apply, returns list) =====
# Create a list
my_list <- list(
a = 1:5,
b = 6:10,
c = 11:15
)
# Apply mean to each element
means_list <- lapply(my_list, mean)
print(means_list)
# lapply on data frame (data frame is a list of columns)
df <- data.frame(
age = c(25, 30, 35, 28, 32),
salary = c(50000, 60000, 75000, 55000, 70000),
years = c(2, 5, 8, 3, 6)
)
lapply(df, mean) # Means of each column
# ===== SAPPLY (simplified lapply) =====
# Returns vector when possible
means_vector <- sapply(my_list, mean)
print(means_vector)
# Returns matrix when appropriate
stats_matrix <- sapply(my_list, function(x) {
c(mean = mean(x), sd = sd(x), n = length(x))
})
print(stats_matrix)
# ===== VAPPLY (type-safe sapply) =====
# Specify return type for safety
vapply(my_list, mean, numeric(1))
# ===== MAPPLY (multivariate apply) =====
# Parallel processing of multiple arguments
mapply(rep, 1:4, times = 4:1)
# ===== TAPPLY (apply by groups) =====
# Grouped operations
data <- data.frame(
group = rep(c("A", "B", "C"), each = 5),
value = c(rnorm(5, mean = 10), rnorm(5, mean = 20), rnorm(5, mean = 30))
)
# Calculate mean by group
tapply(data$value, data$group, mean)
# With factor
tapply(mtcars$mpg, mtcars$cyl, mean)
📊 dplyr & tidyverse
🎯 Complete Definition
dplyr is a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges. It's part of the tidyverse, a collection of R packages designed for data science that share an underlying design philosophy, grammar, and data structures.
📋 Core dplyr Verbs
- filter(): Subset rows based on conditions
- select(): Choose columns by name
- mutate(): Create new columns
- summarise(): Collapse multiple values to single summary
- arrange(): Reorder rows
- group_by(): Group data for operations
- join functions: left_join(), right_join(), inner_join(), full_join()
🔧 Additional Tidyverse Packages
- tidyr: Tidy data (pivot_longer, pivot_wider, separate, unite)
- purrr: Functional programming tools
- tibble: Modern reimagining of data frames
- stringr: String manipulation
- forcats: Factor manipulation
⚡ Key Features
- Pipe operator (%>%): Chain operations together
- Consistent syntax: First argument is data, returns data frame
# dplyr and tidyverse in R
# Load libraries
library(dplyr)
# ===== PIPE OPERATOR (%>%) =====
# Without pipe
result <- arrange(summarise(group_by(filter(mtcars, cyl == 6),
gear), avg_mpg = mean(mpg)), desc(avg_mpg))
# With pipe (much more readable)
result <- mtcars %>%
filter(cyl == 6) %>%
group_by(gear) %>%
summarise(avg_mpg = mean(mpg)) %>%
arrange(desc(avg_mpg))
print(result)
# ===== FILTER (row selection) =====
# Filter rows based on conditions
mtcars %>%
filter(mpg > 20) %>%
head()
# Multiple conditions
mtcars %>%
filter(cyl == 6 & hp > 100)
# ===== SELECT (column selection) =====
# Select columns by name
mtcars %>%
select(mpg, cyl, hp) %>%
head()
# Select with helpers
mtcars %>%
select(starts_with("c"))
# Remove columns
mtcars %>%
select(-hp, -wt) %>%
head()
# ===== MUTATE (create/modify columns) =====
# Create new columns
mtcars %>%
mutate(
mpg_per_cyl = mpg / cyl,
high_mpg = mpg > 25
) %>%
select(mpg, cyl, mpg_per_cyl, high_mpg) %>%
head()
# ===== SUMMARISE (aggregate) =====
# Basic summary
mtcars %>%
summarise(
avg_mpg = mean(mpg),
sd_mpg = sd(mpg),
min_mpg = min(mpg),
max_mpg = max(mpg),
n = n()
)
# ===== GROUP_BY (grouped operations) =====
# Group by single variable
mtcars %>%
group_by(cyl) %>%
summarise(
avg_mpg = mean(mpg),
avg_hp = mean(hp),
count = n()
)
# ===== ARRANGE (sorting) =====
# Ascending (default)
mtcars %>%
arrange(mpg) %>%
select(mpg, cyl, hp) %>%
head()
# Descending
mtcars %>%
arrange(desc(mpg)) %>%
select(mpg, cyl, hp) %>%
head()
# ===== JOIN OPERATIONS =====
# Create two data frames for joining
employees <- tibble(
emp_id = 1:5,
name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
dept_id = c(101, 102, 101, 103, 102)
)
departments <- tibble(
dept_id = c(101, 102, 103, 104),
dept_name = c("IT", "HR", "Finance", "Marketing"),
location = c("NYC", "LA", "Chicago", "NYC")
)
# Inner join (keep only matches in both)
inner_join(employees, departments, by = "dept_id")
# Left join (keep all employees)
left_join(employees, departments, by = "dept_id")
# ===== TIDYR FUNCTIONS =====
library(tidyr)
# pivot_longer (gather) - wide to long
wide_data <- tibble(
id = 1:3,
Q1 = c(85, 90, 88),
Q2 = c(78, 92, 85),
Q3 = c(92, 88, 90)
)
long_data <- wide_data %>%
pivot_longer(
cols = starts_with("Q"),
names_to = "quarter",
values_to = "score"
)
print(long_data)
# pivot_wider (spread) - long to wide
long_data %>%
pivot_wider(
names_from = quarter,
values_from = score
)
🎨 ggplot2
🎯 Complete Definition
ggplot2 is a data visualization package for R based on the Grammar of Graphics. Created by Hadley Wickham, it provides a consistent, layered approach to creating statistical graphics. Plots are built by adding layers, allowing complex visualizations to be constructed intuitively.
📋 Grammar of Graphics Components
- Data: The dataset to plot
- Aesthetics (aes): Mapping variables to visual properties (x, y, color, size, shape)
- Geometries (geom): Geometric objects (points, lines, bars, etc.)
- Facets: Subplots (small multiples)
- Statistics: Statistical transformations (smoothing, binning)
- Coordinates: Coordinate system (cartesian, polar, etc.)
- Themes: Visual styling (titles, legends, backgrounds)
🔧 Common Geometries
- geom_point(): Scatter plots
- geom_line(): Line charts
- geom_bar(): Bar charts
- geom_histogram(): Histograms
- geom_boxplot(): Box plots
- geom_density(): Density plots
- geom_smooth(): Smoothed conditional means
# ggplot2 Visualization in R
# Load libraries
library(ggplot2)
# ===== BASIC STRUCTURE =====
# ggplot(data) + geom_function(aes(mappings))
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point()
# ===== SCATTER PLOTS =====
# Basic scatter plot
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point()
# Add color by variable
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point(size = 3)
# Add smooth line
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE)
# ===== BAR CHARTS =====
# Simple bar chart of counts
ggplot(mtcars, aes(x = factor(cyl))) +
geom_bar(fill = "steelblue")
# Bar chart with fill by another variable
ggplot(mtcars, aes(x = factor(cyl), fill = factor(gear))) +
geom_bar(position = "dodge")
# ===== HISTOGRAMS =====
# Basic histogram
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(bins = 10, fill = "skyblue", color = "black")
# ===== BOX PLOTS =====
# Basic box plot
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(fill = "lightgreen")
# Box plot with points
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot() +
geom_jitter(width = 0.2, alpha = 0.5)
# ===== DENSITY PLOTS =====
# Basic density
ggplot(mtcars, aes(x = mpg)) +
geom_density(fill = "orange", alpha = 0.5)
# Multiple densities
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
geom_density(alpha = 0.5)
# ===== FACETING =====
# Facet wrap (by one variable)
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
facet_wrap(~ cyl)
# ===== CUSTOMIZING PLOTS =====
# Adding labels and title
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
labs(
title = "Fuel Efficiency vs Weight",
subtitle = "By Number of Cylinders",
x = "Weight (1000 lbs)",
y = "Miles per Gallon",
color = "Cylinders"
) +
theme_minimal()
📈 Statistics
🎯 Complete Definition
R's statistical capabilities are among its greatest strengths. It provides comprehensive functionality for descriptive statistics, probability distributions, statistical tests, linear and nonlinear modeling, time series analysis, and machine learning.
📊 Descriptive Statistics
- summary(): Five-number summary + mean
- fivenum(): Tukey's five numbers
- quantile(): Sample quantiles
- IQR(): Interquartile range
- cor(), cov(): Correlation and covariance
- table(), prop.table(): Frequency tables
📈 Probability Distributions
- Normal: dnorm(), pnorm(), qnorm(), rnorm()
- Binomial: dbinom(), pbinom(), qbinom(), rbinom()
- Poisson: dpois(), ppois(), qpois(), rpois()
- Uniform: dunif(), punif(), qunif(), runif()
📋 Statistical Tests
- t.test(): One and two-sample t-tests
- wilcox.test(): Mann-Whitney/Wilcoxon tests
- chisq.test(): Chi-squared tests
- cor.test(): Test for association
- shapiro.test(): Normality test
🔧 Linear Models
- lm(): Linear regression
- glm(): Generalized linear models
- anova(): Analysis of variance
- aov(): ANOVA
- predict(): Model predictions
# Statistics in R
# ===== DESCRIPTIVE STATISTICS =====
data(mtcars)
# Summary statistics
summary(mtcars$mpg)
# Mean, median
mean(mtcars$mpg)
median(mtcars$mpg)
# Spread measures
sd(mtcars$mpg)
var(mtcars$mpg)
IQR(mtcars$mpg)
# Correlation
cor(mtcars$mpg, mtcars$wt)
# ===== FREQUENCY TABLES =====
# One-way table
table(mtcars$cyl)
# Two-way table
table(mtcars$cyl, mtcars$gear)
# ===== PROBABILITY DISTRIBUTIONS =====
# Normal distribution
x <- seq(-4, 4, length = 100)
plot(x, dnorm(x), type = "l", main = "Normal Distribution")
pnorm(1.96) # Cumulative probability: 0.975
qnorm(0.975) # Quantile: 1.959964
rnorm(10) # Random sample
# ===== STATISTICAL TESTS =====
# One-sample t-test
t.test(mtcars$mpg, mu = 20)
# Two-sample t-test
t.test(mpg ~ am, data = mtcars)
# Chi-square test
chisq.test(table(mtcars$cyl, mtcars$gear))
# Correlation test
cor.test(mtcars$mpg, mtcars$wt)
# Test for normality
shapiro.test(mtcars$mpg)
# ===== LINEAR MODELS =====
# Simple linear regression
model1 <- lm(mpg ~ wt, data = mtcars)
summary(model1)
# Multiple linear regression
model2 <- lm(mpg ~ wt + hp + cyl, data = mtcars)
summary(model2)
# Extract model components
coef(model2) # Coefficients
residuals(model2) # Residuals
fitted(model2) # Fitted values
anova(model2) # ANOVA table
# Predictions
new_cars <- data.frame(wt = c(2.5, 3.0, 3.5),
hp = c(100, 150, 200),
cyl = c(4, 6, 8))
predict(model2, new_cars)
# ===== ANALYSIS OF VARIANCE (ANOVA) =====
# One-way ANOVA
aov_model <- aov(mpg ~ factor(cyl), data = mtcars)
summary(aov_model)
# Tukey HSD post-hoc test
TukeyHSD(aov_model)
🔬 Advanced R
🎯 Complete Definition
Advanced R covers sophisticated programming techniques, performance optimization, object-oriented systems, metaprogramming, and interfacing with other languages. These concepts enable building robust, efficient, and maintainable R code.
🔬 Object-Oriented Systems
- S3: Informal, generic functions, class attribute
- S4: Formal classes, slots, validation
- R6: Encapsulated, reference-based OOP
⚡ Performance Optimization
- Profiling: Rprof(), profvis
- Byte compilation: compiler package
- Vectorization: Avoiding loops
- Parallel computing: parallel, foreach
- Rcpp: C++ integration
- data.table: High-performance data manipulation
📝 Metaprogramming
- Non-standard evaluation: substitute(), quote(), eval()
- Tidy evaluation: {{ }}, enquo(), !!
- Environments: new.env(), parent.env()
- Closures: Functions that capture environments
# Advanced R Programming
# ===== S3 OBJECT SYSTEM =====
# Create an S3 class
person <- list(name = "Alice", age = 30)
class(person) <- "person"
# Define a generic function
greet <- function(x, ...) {
UseMethod("greet")
}
# Define methods for the person class
greet.person <- function(x, ...) {
paste("Hello, I'm", x$name, "and I'm", x$age, "years old.")
}
# Test
greet(person)
# ===== R6 CLASSES =====
library(R6)
Person <- R6Class("Person",
public = list(
name = NULL,
age = NULL,
initialize = function(name, age) {
self$name <- name
self$age <- age
},
introduce = function() {
paste("I'm", self$name, ", age", self$age)
},
have_birthday = function() {
self$age <- self$age + 1
invisible(self)
}
)
)
# Create and use R6 object
alice <- Person$new("Alice", 30)
alice$introduce()
alice$have_birthday()
alice$age
# ===== ENVIRONMENTS =====
# Environments are fundamental to R's scoping
e1 <- new.env()
e1$x <- 10
e1$y <- 20
e2 <- new.env(parent = e1)
e2$x <- 5
e2$x # 5 (own x)
e2$y # 20 (from parent)
# ===== CLOSURES =====
# Functions that capture their environment
make_counter <- function(start = 0) {
count <- start
function() {
count <<- count + 1
count
}
}
counter1 <- make_counter()
counter1()
counter1()
# ===== NON-STANDARD EVALUATION =====
# quote() captures expression without evaluating
expr <- quote(x + y)
expr
# eval() evaluates captured expression
x <- 5
y <- 3
eval(expr)
# ===== TIDY EVALUATION =====
library(dplyr)
# Correct tidy evaluation
good_var_summary <- function(df, var) {
df %>%
group_by(cyl) %>%
summarise(avg = mean({{ var }})) # Embrace with {{ }}
}
good_var_summary(mtcars, mpg)
# ===== RCPP: C++ INTEGRATION =====
library(Rcpp)
# Define C++ function in R
cppFunction('
int fibonacci_cpp(int n) {
if (n < 2) return n;
return fibonacci_cpp(n-1) + fibonacci_cpp(n-2);
}
')
fibonacci_cpp(10)
# ===== DATA.TABLE =====
library(data.table)
# Create data.table
dt <- data.table(
id = 1:1e5,
group = sample(letters[1:10], 1e5, replace = TRUE),
value = rnorm(1e5)
)
# Fast operations
dt[, mean(value), by = group]
dt[, value_sq := value^2]
# ===== PARALLEL COMPUTING =====
library(parallel)
# Detect cores
ncores <- detectCores()
print(paste("Number of cores:", ncores))