📊 R Introduction

🎯 Complete Definition

R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. Created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, R is an implementation of the S programming language with lexical scoping semantics inspired by Scheme.

🔬 Core Characteristics

Statistical Computing: Built-in statistical models, tests, and analyses
Data Visualization: Powerful graphics capabilities (base graphics, ggplot2, lattice)
Vectorized Operations: Operations apply to entire vectors without explicit loops
Functional Programming: Functions are first-class citizens
Package Ecosystem: CRAN hosts over 18,000 packages
Data Wrangling: Excellent tools for data manipulation (dplyr, data.table)
Reproducible Research: R Markdown, knitr, Sweave
Interoperability: Connects with Python, C++, SQL, and more

📊 Industry Usage

R is the standard in academia for statistics and data science. Used extensively in pharmaceuticals (clinical trials), finance (risk modeling), healthcare (bioinformatics), government (census data), and tech companies (data analysis). Major users include Google, Facebook, Pfizer, and the FDA.

# R Introduction - CodeOrbitPro

# Basic R operations
print("📊 Welcome to R Pro Track!")

# Create variables
language <- "R"
year <- 1993
version <- "4.3.1"

# Print with concatenation
cat("Hello,", language, "! Created in", year, "\n")
cat("Current version:", version)

# Simple statistics
numbers <- c(1, 2, 3, 4, 5)
mean_value <- mean(numbers)
sd_value <- sd(numbers)

cat("\nMean:", mean_value)
cat("\nStandard deviation:", sd_value)
    

🔤 Basics & Syntax

🎯 Complete Definition

R syntax uses <- for assignment (though = also works), supports both functional and object-oriented paradigms, and has special handling for vectors and missing data.

📋 Key Elements

Assignment: x <- 10 or x = 10 (preferred <-)
Comments: # for single-line comments (no multi-line comments)
Case Sensitivity: R is case-sensitive (x and X are different)
Console I/O: print(), cat(), readline(), scan()
Packages: install.packages(), library(), require()
Help System: ?function, help(function), example(function)
Working Directory: getwd(), setwd()
Special Values: NA (missing), NULL (undefined), NaN (not a number), Inf/-Inf

# R Basics & Syntax

# Assignment operators
x <- 10        # Preferred
y = 20         # Also works but less common
30 -> z        # Right assignment (rare)

# Printing
print(x)                       # [1] 10
cat("x =", x, "y =", y, "\n")  # x = 10 y = 20

# Getting help
help(mean)     # or ?mean
example(mean)  # Run examples

# Working directory
getwd()        # Current directory
# setwd("path/to/directory")  # Change directory

# Special values
missing <- NA
undefined <- NULL
not_number <- 0/0  # NaN
infinite <- 1/0    # Inf

# Check types
is.na(missing)     # TRUE
is.null(undefined) # TRUE
is.nan(not_number) # TRUE
is.infinite(infinite) # TRUE

# Listing objects
ls()             # List all objects in environment
objects()        # Same as ls()

# Removing objects
rm(x)            # Remove single object
rm(list = ls())  # Remove all objects (clear workspace)
    

📦 Data Types

🎯 Complete Definition

R data types include atomic vectors (homogeneous) and lists (heterogeneous). R uses dynamic typing with type coercion and has specific classes for statistical data.

📊 Basic Types

numeric: Double-precision floating point (default for numbers)
integer: Integer values (suffix L: 42L)
character: String values in quotes
logical: TRUE, FALSE (or T, F - not recommended)
complex: Complex numbers (3 + 2i)
raw: Raw bytes
factor: Categorical data with levels
Date/POSIXct: Date and time classes

🔄 Type Checking & Coercion

Check: class(), typeof(), mode(), is.numeric(), is.character()
Coerce: as.numeric(), as.character(), as.logical(), as.factor()

# R Data Types

# Numeric (double by default)
num1 <- 10.5
num2 <- 42
typeof(num1)  # "double"
class(num1)   # "numeric"

# Integer (explicit)
int1 <- 42L
typeof(int1)  # "integer"

# Character
char1 <- "Hello, R!"
char2 <- 'Single quotes also work'
typeof(char1)  # "character"

# Logical
bool1 <- TRUE
bool2 <- FALSE
bool3 <- T     # Not recommended, can be overwritten
typeof(bool1)  # "logical"

# Complex
comp <- 3 + 4i
typeof(comp)   # "complex"
Re(comp)       # Real part: 3
Im(comp)       # Imaginary part: 4

# Type checking
is.numeric(num1)     # TRUE
is.integer(int1)     # TRUE
is.character(char1)  # TRUE
is.logical(bool1)    # TRUE

# Type coercion (automatic)
mixed <- c(1, "hello", TRUE)
print(mixed)  # All converted to character: "1" "hello" "TRUE"

# Explicit coercion
as.numeric("123")     # 123
as.character(42)      # "42"
as.logical(0)         # FALSE
as.logical(1)         # TRUE
as.numeric(TRUE)      # 1
as.numeric(FALSE)     # 0

# Checking for NA/NaN
x <- c(1, 2, NA, 4, NaN)
is.na(x)      # FALSE FALSE TRUE FALSE TRUE
is.nan(x)     # FALSE FALSE FALSE FALSE TRUE
    

🔢 Vectors

🎯 Complete Definition

Vectors are the most fundamental data structure in R. They are homogeneous (same type) sequences of elements. Everything in R is built on vectors - even single values are vectors of length 1.

📏 Vector Operations

Creation: c(), vector(), seq(), rep(), : operator
Indexing: [ ] with positive integers, negative integers (exclude), logical vectors, names
Vectorized operations: Operations apply element-wise
Recycling: Shorter vectors are recycled to match longer ones
Named vectors: Elements can have names for labeled access
Attributes: names(), dim(), class() can be attached

🔧 Key Functions

Creation: seq(from, to, by), seq_len(n), rep(x, times), rep_len(x, n)
Manipulation: length(), unique(), duplicated(), sort(), order(), rank()
Math: sum(), mean(), sd(), var(), min(), max(), range()

# Vectors in R

# Creating vectors
v1 <- c(1, 2, 3, 4, 5)           # Combine function
v2 <- 1:10                        # Sequence operator
v3 <- seq(from = 0, to = 1, by = 0.1)  # Sequence with step
v4 <- rep(1:3, times = 4)         # Repeat vector
v5 <- rep(1:3, each = 3)          # Repeat each element

print(v1)  # [1] 1 2 3 4 5
print(v2)  # [1] 1 2 3 4 5 6 7 8 9 10
print(v3)  # [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
print(v4)  # [1] 1 2 3 1 2 3 1 2 3 1 2 3
print(v5)  # [1] 1 1 1 2 2 2 3 3 3

# Named vectors
grades <- c(Alice = 95, Bob = 87, Charlie = 92)
print(grades)
# Alice   Bob Charlie 
#    95     87     92 

# Access by name
grades["Alice"]  # 95

# Vector operations (vectorized)
a <- c(1, 2, 3)
b <- c(4, 5, 6)

a + b         # [1] 5 7 9
a * b         # [1] 4 10 18
a^2           # [1] 1 4 9
sqrt(a)       # [1] 1.00 1.41 1.73
log(a)        # Natural log

# Recycling (shorter vector recycled)
c(1, 2, 3, 4) + c(10, 20)  # [1] 11 22 13 24

# Indexing
v <- c(10, 20, 30, 40, 50)

v[1]          # First element: 10
v[3]          # Third element: 30
v[1:3]        # First three: 10 20 30
v[c(1, 3, 5)] # Specific positions: 10 30 50
v[-2]         # Exclude second: 10 30 40 50
v[-c(1, 5)]   # Exclude first and last: 20 30 40

# Logical indexing
v > 25        # [1] FALSE FALSE TRUE TRUE TRUE
v[v > 25]     # [1] 30 40 50

# Vector functions
length(v)     # 5
unique(c(1,1,2,2,3))  # [1] 1 2 3
sort(c(3,1,4,2))      # [1] 1 2 3 4
order(c(3,1,4,2))     # [1] 2 4 1 3 (indices that would sort)

# Summary statistics
x <- rnorm(100)  # 100 random normal numbers
mean(x)
median(x)
sd(x)
summary(x)       # Min, 1st Qu., Median, Mean, 3rd Qu., Max
    

📋 Lists & Factors

🎯 Complete Definition

Lists are recursive vectors that can contain elements of different types, including other lists. Factors are vectors used for categorical data, storing integers with corresponding character labels (levels).

📋 Lists

Creation: list(), pairlist()
Access: [ ] returns list, [[ ]] returns element, $ for named elements
Nesting: Lists can contain other lists
Conversion: unlist() flattens to vector, as.list() converts vector to list
Applications: Function returns, complex data structures, JSON-like data

📊 Factors

Creation: factor(), as.factor(), gl() (generate levels)
Properties: levels(), nlevels(), is.factor(), ordered() for ordinal
Manipulation: droplevels() removes unused levels, relevel() changes reference
Table creation: table() creates contingency tables from factors
Important: Factors behave like integers in some contexts (caution needed)

# Lists and Factors in R

# ===== LISTS =====
# Creating lists
my_list <- list(
  name = "John Doe",
  age = 30,
  scores = c(85, 92, 78),
  is_student = FALSE,
  nested = list(a = 1, b = 2)
)

print(my_list)

# Accessing list elements
my_list[1]        # Returns list with first element: $name
my_list[[1]]      # Returns content: "John Doe"
my_list$name      # Returns content: "John Doe"
my_list[["name"]] # Returns content: "John Doe"

my_list$scores[2] # 92 (second score)
my_list$nested$a  # 1

# Adding to list
my_list$new_element <- "added later"

# List operations
length(my_list)           # Number of top-level elements
names(my_list)            # Get or set names
unlist(my_list$scores)    # Flatten to vector (already vector here)

# Combining lists
list1 <- list(a = 1, b = 2)
list2 <- list(c = 3, d = 4)
combined <- c(list1, list2)
print(combined)

# ===== FACTORS =====
# Creating factors
gender <- factor(c("male", "female", "female", "male", "male"))
print(gender)

# Check levels
levels(gender)      # [1] "female" "male"
nlevels(gender)     # 2
class(gender)       # "factor"

# Ordered factors
education <- factor(
  c("high school", "bachelor", "master", "phd", "bachelor"),
  levels = c("high school", "bachelor", "master", "phd"),
  ordered = TRUE
)
print(education)

# Factor with frequencies
education[1] < education[2]  # TRUE (high school < bachelor)

# Table from factors
gender_education <- table(gender, education)
print(gender_education)

# Changing factor levels
levels(gender) <- c("F", "M")  # Order matters: first level becomes "F"
print(gender)

# Generate factor with gl()
gl(3, 2, labels = c("Low", "Medium", "High"))

# Dropping unused levels
factor_with_unused <- factor(c("A", "B", "A"), levels = c("A", "B", "C"))
droplevels(factor_with_unused)  # Removes level "C"
    

📑 Data Frames

🎯 Complete Definition

Data frames are the fundamental data structure for tabular data in R. They are lists of equal-length vectors (columns), combining features of lists and matrices. Each column can be a different type, making them ideal for datasets with mixed variable types.

📋 Data Frame Operations

Creation: data.frame(), as.data.frame(), read.table()/read.csv()
Access: [rows, cols], $ for columns, [[ ]], subset()
Inspection: head(), tail(), str(), summary(), dim(), nrow(), ncol()
Manipulation: rbind() (add rows), cbind() (add columns), merge() (join)
Subsetting: subset(), filter with logical conditions
Transformation: transform(), within()
Special: attach()/detach() (use with caution)

# Data Frames in R

# ===== CREATING DATA FRAMES =====
# From vectors
name <- c("Alice", "Bob", "Charlie", "Diana", "Eve")
age <- c(25, 30, 35, 28, 32)
salary <- c(50000, 60000, 75000, 55000, 70000)
department <- c("IT", "HR", "IT", "Finance", "HR")
employed <- c(TRUE, TRUE, FALSE, TRUE, TRUE)

df <- data.frame(
  Name = name,
  Age = age,
  Salary = salary,
  Department = department,
  Employed = employed,
  stringsAsFactors = FALSE  # Don't convert strings to factors
)

print(df)

# ===== INSPECTING DATA FRAMES =====
head(df, 3)           # First 3 rows
tail(df, 2)           # Last 2 rows
str(df)               # Structure
summary(df)           # Summary statistics
dim(df)               # Dimensions: [1] 5 5
nrow(df)              # Number of rows: 5
ncol(df)              # Number of columns: 5
names(df)             # Column names
colnames(df)          # Same as names()
rownames(df)          # Row names

# ===== ACCESSING DATA =====
# By column
df$Name               # Vector: "Alice" "Bob" "Charlie" "Diana" "Eve"
df[["Age"]]           # Vector: 25 30 35 28 32
df[, "Salary"]        # Vector: 50000 60000 75000 55000 70000
df[, 3]               # Third column (Salary)

# By row and column
df[1, ]               # First row (all columns)
df[1:3, ]             # First three rows
df[1, "Name"]         # "Alice"
df[1, 1]              # "Alice"
df[1, c("Name", "Age")]  # First row, Name and Age

# By condition
df[df$Age > 30, ]          # Rows with Age > 30
df[df$Department == "IT", ] # IT department only
df[df$Salary > 60000 & df$Employed, ]  # Employed with Salary > 60000

# ===== SUBSET FUNCTION =====
subset(df, Age > 30)
subset(df, Department == "IT", select = c(Name, Salary))
subset(df, Salary > 60000, select = -c(Department))

# ===== ADDING/REMOVING COLUMNS =====
# Add column
df$Bonus <- df$Salary * 0.1
df[["YearsEmployed"]] <- c(2, 5, 0, 3, 4)

# Add multiple columns using transform
df <- transform(df, 
                TotalComp = Salary + Bonus,
                AgeGroup = ifelse(Age < 30, "Young", "Senior"))

# Remove column
df$Bonus <- NULL           # Remove Bonus column
df[["YearsEmployed"]] <- NULL  # Remove YearsEmployed

# ===== ADDING ROWS =====
# Add single row
new_employee <- data.frame(
  Name = "Frank",
  Age = 29,
  Salary = 62000,
  Department = "Finance",
  Employed = TRUE,
  stringsAsFactors = FALSE
)
df <- rbind(df, new_employee)

# Add multiple rows
more_employees <- data.frame(
  Name = c("Grace", "Henry"),
  Age = c(27, 33),
  Salary = c(58000, 72000),
  Department = c("IT", "HR"),
  Employed = c(TRUE, TRUE),
  stringsAsFactors = FALSE
)
df <- rbind(df, more_employees)

# ===== COMBINING DATA FRAMES =====
# Create second data frame
df2 <- data.frame(
  Name = c("Alice", "Bob", "Charlie", "Diana", "Eve", "Frank", "Grace", "Henry"),
  City = c("NYC", "LA", "Chicago", "NYC", "LA", "Chicago", "NYC", "LA"),
  stringsAsFactors = FALSE
)

# Merge (join) data frames
df_merged <- merge(df, df2, by = "Name")
print(df_merged)

# ===== IMPORTANT FUNCTIONS =====
# Order/sort data frame
df_ordered <- df[order(df$Age, decreasing = TRUE), ]  # Sort by Age descending
df_ordered <- df[order(df$Department, df$Salary), ]   # Sort by Department, then Salary

# Remove rows with NA
df_complete <- na.omit(df)

# Unique rows
df_unique <- unique(df)

# Apply functions to columns
lapply(df[, c("Age", "Salary")], mean)
sapply(df[, c("Age", "Salary")], summary)

# ===== AGGREGATION =====
# Aggregate by department
agg <- aggregate(Salary ~ Department, data = df, FUN = mean)
print(agg)

# Table of counts
table(df$Department, df$Employed)
    

🔲 Matrices & Arrays

🎯 Complete Definition

Matrices are two-dimensional arrays where all elements are the same type. Arrays generalize matrices to any number of dimensions. They are fundamental for linear algebra operations and multidimensional data.

🔲 Matrix Operations

Creation: matrix(), rbind(), cbind(), diag()
Dimensions: dim(), nrow(), ncol(), length()
Access: [row, col] indexing
Linear Algebra: %*% (matrix multiplication), t() (transpose), solve() (inverse), eigen()
Row/Column Operations: rowSums(), colSums(), rowMeans(), colMeans()
Special Matrices: diag() (diagonal), outer() (outer product)

📊 Arrays

Creation: array(data, dim)
Dimensions: dim() to get/set dimensions
Indexing: [dim1, dim2, dim3, ...]
aperm(): Permute array dimensions (like transpose for higher dimensions)

# Matrices and Arrays in R

# ===== CREATING MATRICES =====
# By column (default)
m1 <- matrix(1:12, nrow = 3, ncol = 4)
print(m1)

# By row
m2 <- matrix(1:12, nrow = 3, ncol = 4, byrow = TRUE)
print(m2)

# From vectors using rbind and cbind
row1 <- c(1, 2, 3)
row2 <- c(4, 5, 6)
m3 <- rbind(row1, row2)  # Bind by row
print(m3)

col1 <- c(1, 4)
col2 <- c(2, 5)
col3 <- c(3, 6)
m4 <- cbind(col1, col2, col3)  # Bind by column
print(m4)

# Diagonal matrix
diag(1, nrow = 3)  # Identity matrix
diag(c(10, 20, 30))  # Diagonal with specified values

# ===== MATRIX PROPERTIES =====
m <- matrix(1:9, nrow = 3)
dim(m)        # [1] 3 3
nrow(m)       # [1] 3
ncol(m)       # [1] 3
length(m)     # [1] 9 (total elements)

# ===== INDEXING MATRICES =====
m <- matrix(1:12, nrow = 3)

m[2, 3]       # Element at row 2, column 3: 8
m[1, ]        # First row: 1 4 7 10
m[, 2]        # Second column: 4 5 6
m[1:2, 3:4]   # Rows 1-2, columns 3-4

m[c(1, 3), ]  # Rows 1 and 3

# ===== MATRIX OPERATIONS =====
A <- matrix(1:4, nrow = 2)
B <- matrix(5:8, nrow = 2)

# Element-wise operations
A + B
A * B  # Element-wise multiplication, NOT matrix multiplication
A / B
A^2

# Matrix multiplication
A %*% B  # True matrix multiplication

# Transpose
t(A)

# Determinant
det(matrix(c(1,2,3,4), nrow = 2))

# Inverse
solve(A)  # Inverse of A (if invertible)

# Eigenvalues and eigenvectors
eigen(A)

# ===== ROW/COLUMN STATISTICS =====
m <- matrix(1:12, nrow = 3)
rowSums(m)    # Sum of each row: 22 26 30
colSums(m)    # Sum of each column: 6 15 24 33
rowMeans(m)   # Mean of each row: 5.5 6.5 7.5
colMeans(m)   # Mean of each column: 2 5 8 11

apply(m, 1, sum)     # Same as rowSums (apply over rows)
apply(m, 2, mean)    # Same as colMeans (apply over columns)

# ===== ARRAYS (3+ dimensions) =====
# Create 3D array (2 rows, 3 columns, 4 layers)
arr <- array(1:24, dim = c(2, 3, 4))
print(arr)

# Accessing array elements
arr[1, 2, 3]  # Row 1, Col 2, Layer 3: 15
arr[, , 1]    # First layer (2x3 matrix)
arr[1, , ]    # First row across all layers
arr[, 2, ]    # Second column across all layers

# Dimensions of array
dim(arr)      # [1] 2 3 4

# Permute dimensions
aperm(arr, c(2, 1, 3))  # Swap first two dimensions

# ===== OUTER PRODUCT =====
x <- 1:3
y <- 4:6
outer(x, y, FUN = "*")  # Multiplication table
    

🔄 Control Flow

🎯 Complete Definition

Control flow statements in R include conditional execution and loops. R also provides vectorized alternatives that are often more efficient than explicit loops.

🔀 Conditional Statements

if/else: if (condition) expr else expr
ifelse(): Vectorized conditional: ifelse(test, yes, no)
switch(): Multi-way branch based on value

🔄 Loops

for: for (var in sequence) { expr }
while: while (condition) { expr }
repeat: repeat { expr; if (condition) break }
break/next: Exit loop or skip iteration

⚡ Vectorized Alternatives

Operations on vectors are usually faster than loops
ifelse() for vectorized conditional
apply family functions for iterative operations

# Control Flow in R

# ===== IF/ELSE STATEMENTS =====
x <- 10

if (x > 5) {
  print("x is greater than 5")
} else {
  print("x is less than or equal to 5")
}

# Multiple conditions
score <- 85

if (score >= 90) {
  grade <- "A"
} else if (score >= 80) {
  grade <- "B"
} else if (score >= 70) {
  grade <- "C"
} else {
  grade <- "F"
}
print(paste("Grade:", grade))

# ===== VECTORIZED IFELSE =====
ages <- c(15, 25, 35, 45, 55, 65)

age_groups <- ifelse(ages < 30, "Young", 
                    ifelse(ages < 50, "Middle", "Senior"))
print(age_groups)

# Example with data frame
df <- data.frame(
  name = c("Alice", "Bob", "Charlie", "Diana"),
  score = c(92, 67, 88, 45)
)

df$result <- ifelse(df$score >= 70, "Pass", "Fail")
print(df)

# ===== SWITCH STATEMENT =====
# switch can work with numbers or strings
operation <- "mean"
values <- c(10, 20, 30, 40, 50)

result <- switch(operation,
  "mean" = mean(values),
  "median" = median(values),
  "sum" = sum(values),
  "min" = min(values),
  "max" = max(values),
  stop("Unknown operation")
)
print(result)

# ===== FOR LOOPS =====
# Basic for loop
for (i in 1:5) {
  print(paste("Iteration:", i))
}

# Loop over vector
fruits <- c("apple", "banana", "cherry", "date")
for (fruit in fruits) {
  print(paste("I like", fruit))
}

# Loop with index
for (i in seq_along(fruits)) {
  print(paste("Fruit", i, "is", fruits[i]))
}

# ===== WHILE LOOPS =====
counter <- 1
while (counter <= 5) {
  print(paste("Counter:", counter))
  counter <- counter + 1
}

# While with condition
x <- 1
while (x < 100) {
  print(x)
  x <- x * 2
}

# ===== BREAK AND NEXT =====
# break - exit loop
for (i in 1:10) {
  if (i == 6) {
    break
  }
  print(i)
}

# next - skip iteration
for (i in 1:10) {
  if (i %% 2 == 0) {  # Skip even numbers
    next
  }
  print(i)
}
    

⚙️ Functions

🎯 Complete Definition

Functions in R are first-class objects that encapsulate reusable code. They take arguments, perform operations, and return values. R supports functional programming concepts like closures, lexical scoping, and higher-order functions.

⚙️ Function Components

Name: Optional identifier for the function
Arguments: Formal parameters (can have default values)
Body: Code inside { } that implements the function
Return value: Last evaluated expression or explicit return()
Environment: Where the function looks for variables

📋 Function Features

Default arguments: function(x, y = 10)
Lazy evaluation: Arguments evaluated only when used
... (ellipsis): Pass arbitrary number of arguments
Anonymous functions: Functions without names
Closures: Functions that return functions
Recursion: Functions that call themselves

# Functions in R

# ===== BASIC FUNCTION DEFINITION =====
# Simple function
greet <- function(name) {
  paste("Hello,", name, "!")
}

# Call the function
greet("Alice")

# Function with multiple arguments
add_numbers <- function(a, b) {
  result <- a + b
  return(result)  # Explicit return
}

add_numbers(5, 3)

# Function without return (last expression returned)
multiply <- function(a, b) {
  a * b  # Implicit return
}

multiply(4, 5)

# ===== DEFAULT ARGUMENTS =====
power <- function(x, exponent = 2) {
  x^exponent
}

power(5)     # [1] 25 (default exponent = 2)
power(5, 3)  # [1] 125 (exponent = 3)

# ===== MULTIPLE RETURN VALUES =====
# Return a list for multiple values
stats <- function(x) {
  list(
    mean = mean(x),
    median = median(x),
    sd = sd(x),
    min = min(x),
    max = max(x)
  )
}

values <- c(10, 20, 30, 40, 50)
result <- stats(values)
print(result$mean)

# ===== ANONYMOUS FUNCTIONS =====
# Used on the fly without naming
squared <- sapply(1:5, function(x) x^2)
print(squared)

# ===== CLOSURES (FUNCTIONS RETURNING FUNCTIONS) =====
power_factory <- function(exponent) {
  function(x) {
    x^exponent
  }
}

square <- power_factory(2)
cube <- power_factory(3)

square(5)  # [1] 25
cube(5)    # [1] 125

# ===== RECURSIVE FUNCTIONS =====
# Factorial
factorial_rec <- function(n) {
  if (n <= 1) {
    return(1)
  } else {
    return(n * factorial_rec(n - 1))
  }
}

factorial_rec(5)  # [1] 120
    

🚀 Apply Family

🎯 Complete Definition

The apply family provides functions for applying operations over margins of arrays, lists, or vectors, offering vectorized alternatives to loops. They are core to R's functional programming paradigm and often more efficient and concise than explicit loops.

📋 Apply Functions Overview

apply(): Apply function to margins of arrays/matrices
lapply(): Apply function to each element of a list, return list
sapply(): Simplify lapply result to vector or matrix if possible
vapply(): Safer version of sapply with pre-specified return type
mapply(): Multivariate version of sapply
tapply(): Apply function over subsets (by factor)

# Apply Family Functions in R

# ===== APPLY (on matrices/arrays) =====
# Create a matrix
m <- matrix(1:12, nrow = 3)
print(m)

# Apply over rows (MARGIN = 1)
row_sums <- apply(m, 1, sum)
print(row_sums)

# Apply over columns (MARGIN = 2)
col_means <- apply(m, 2, mean)
print(col_means)

# ===== LAPPLY (list apply, returns list) =====
# Create a list
my_list <- list(
  a = 1:5,
  b = 6:10,
  c = 11:15
)

# Apply mean to each element
means_list <- lapply(my_list, mean)
print(means_list)

# lapply on data frame (data frame is a list of columns)
df <- data.frame(
  age = c(25, 30, 35, 28, 32),
  salary = c(50000, 60000, 75000, 55000, 70000),
  years = c(2, 5, 8, 3, 6)
)

lapply(df, mean)  # Means of each column

# ===== SAPPLY (simplified lapply) =====
# Returns vector when possible
means_vector <- sapply(my_list, mean)
print(means_vector)

# Returns matrix when appropriate
stats_matrix <- sapply(my_list, function(x) {
  c(mean = mean(x), sd = sd(x), n = length(x))
})
print(stats_matrix)

# ===== VAPPLY (type-safe sapply) =====
# Specify return type for safety
vapply(my_list, mean, numeric(1))

# ===== MAPPLY (multivariate apply) =====
# Parallel processing of multiple arguments
mapply(rep, 1:4, times = 4:1)

# ===== TAPPLY (apply by groups) =====
# Grouped operations
data <- data.frame(
  group = rep(c("A", "B", "C"), each = 5),
  value = c(rnorm(5, mean = 10), rnorm(5, mean = 20), rnorm(5, mean = 30))
)

# Calculate mean by group
tapply(data$value, data$group, mean)

# With factor
tapply(mtcars$mpg, mtcars$cyl, mean)
    

📊 dplyr & tidyverse

🎯 Complete Definition

dplyr is a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges. It's part of the tidyverse, a collection of R packages designed for data science that share an underlying design philosophy, grammar, and data structures.

📋 Core dplyr Verbs

filter(): Subset rows based on conditions
select(): Choose columns by name
mutate(): Create new columns
summarise(): Collapse multiple values to single summary
arrange(): Reorder rows
group_by(): Group data for operations
join functions: left_join(), right_join(), inner_join(), full_join()

🔧 Additional Tidyverse Packages

tidyr: Tidy data (pivot_longer, pivot_wider, separate, unite)
purrr: Functional programming tools
tibble: Modern reimagining of data frames
stringr: String manipulation
forcats: Factor manipulation

⚡ Key Features

Pipe operator (%>%): Chain operations together
Consistent syntax: First argument is data, returns data frame

# dplyr and tidyverse in R

# Load libraries
library(dplyr)

# ===== PIPE OPERATOR (%>%) =====
# Without pipe
result <- arrange(summarise(group_by(filter(mtcars, cyl == 6), 
                                     gear), avg_mpg = mean(mpg)), desc(avg_mpg))

# With pipe (much more readable)
result <- mtcars %>%
  filter(cyl == 6) %>%
  group_by(gear) %>%
  summarise(avg_mpg = mean(mpg)) %>%
  arrange(desc(avg_mpg))

print(result)

# ===== FILTER (row selection) =====
# Filter rows based on conditions
mtcars %>%
  filter(mpg > 20) %>%
  head()

# Multiple conditions
mtcars %>%
  filter(cyl == 6 & hp > 100)

# ===== SELECT (column selection) =====
# Select columns by name
mtcars %>%
  select(mpg, cyl, hp) %>%
  head()

# Select with helpers
mtcars %>%
  select(starts_with("c"))

# Remove columns
mtcars %>%
  select(-hp, -wt) %>%
  head()

# ===== MUTATE (create/modify columns) =====
# Create new columns
mtcars %>%
  mutate(
    mpg_per_cyl = mpg / cyl,
    high_mpg = mpg > 25
  ) %>%
  select(mpg, cyl, mpg_per_cyl, high_mpg) %>%
  head()

# ===== SUMMARISE (aggregate) =====
# Basic summary
mtcars %>%
  summarise(
    avg_mpg = mean(mpg),
    sd_mpg = sd(mpg),
    min_mpg = min(mpg),
    max_mpg = max(mpg),
    n = n()
  )

# ===== GROUP_BY (grouped operations) =====
# Group by single variable
mtcars %>%
  group_by(cyl) %>%
  summarise(
    avg_mpg = mean(mpg),
    avg_hp = mean(hp),
    count = n()
  )

# ===== ARRANGE (sorting) =====
# Ascending (default)
mtcars %>%
  arrange(mpg) %>%
  select(mpg, cyl, hp) %>%
  head()

# Descending
mtcars %>%
  arrange(desc(mpg)) %>%
  select(mpg, cyl, hp) %>%
  head()

# ===== JOIN OPERATIONS =====
# Create two data frames for joining
employees <- tibble(
  emp_id = 1:5,
  name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
  dept_id = c(101, 102, 101, 103, 102)
)

departments <- tibble(
  dept_id = c(101, 102, 103, 104),
  dept_name = c("IT", "HR", "Finance", "Marketing"),
  location = c("NYC", "LA", "Chicago", "NYC")
)

# Inner join (keep only matches in both)
inner_join(employees, departments, by = "dept_id")

# Left join (keep all employees)
left_join(employees, departments, by = "dept_id")

# ===== TIDYR FUNCTIONS =====
library(tidyr)

# pivot_longer (gather) - wide to long
wide_data <- tibble(
  id = 1:3,
  Q1 = c(85, 90, 88),
  Q2 = c(78, 92, 85),
  Q3 = c(92, 88, 90)
)

long_data <- wide_data %>%
  pivot_longer(
    cols = starts_with("Q"),
    names_to = "quarter",
    values_to = "score"
  )
print(long_data)

# pivot_wider (spread) - long to wide
long_data %>%
  pivot_wider(
    names_from = quarter,
    values_from = score
  )
    

🎨 ggplot2

🎯 Complete Definition

ggplot2 is a data visualization package for R based on the Grammar of Graphics. Created by Hadley Wickham, it provides a consistent, layered approach to creating statistical graphics. Plots are built by adding layers, allowing complex visualizations to be constructed intuitively.

📋 Grammar of Graphics Components

Data: The dataset to plot
Aesthetics (aes): Mapping variables to visual properties (x, y, color, size, shape)
Geometries (geom): Geometric objects (points, lines, bars, etc.)
Facets: Subplots (small multiples)
Statistics: Statistical transformations (smoothing, binning)
Coordinates: Coordinate system (cartesian, polar, etc.)
Themes: Visual styling (titles, legends, backgrounds)

🔧 Common Geometries

geom_point(): Scatter plots
geom_line(): Line charts
geom_bar(): Bar charts
geom_histogram(): Histograms
geom_boxplot(): Box plots
geom_density(): Density plots
geom_smooth(): Smoothed conditional means

# ggplot2 Visualization in R

# Load libraries
library(ggplot2)

# ===== BASIC STRUCTURE =====
# ggplot(data) + geom_function(aes(mappings))
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()

# ===== SCATTER PLOTS =====
# Basic scatter plot
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()

# Add color by variable
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point(size = 3)

# Add smooth line
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE)

# ===== BAR CHARTS =====
# Simple bar chart of counts
ggplot(mtcars, aes(x = factor(cyl))) +
  geom_bar(fill = "steelblue")

# Bar chart with fill by another variable
ggplot(mtcars, aes(x = factor(cyl), fill = factor(gear))) +
  geom_bar(position = "dodge")

# ===== HISTOGRAMS =====
# Basic histogram
ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(bins = 10, fill = "skyblue", color = "black")

# ===== BOX PLOTS =====
# Basic box plot
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot(fill = "lightgreen")

# Box plot with points
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot() +
  geom_jitter(width = 0.2, alpha = 0.5)

# ===== DENSITY PLOTS =====
# Basic density
ggplot(mtcars, aes(x = mpg)) +
  geom_density(fill = "orange", alpha = 0.5)

# Multiple densities
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
  geom_density(alpha = 0.5)

# ===== FACETING =====
# Facet wrap (by one variable)
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  facet_wrap(~ cyl)

# ===== CUSTOMIZING PLOTS =====
# Adding labels and title
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point(size = 3) +
  labs(
    title = "Fuel Efficiency vs Weight",
    subtitle = "By Number of Cylinders",
    x = "Weight (1000 lbs)",
    y = "Miles per Gallon",
    color = "Cylinders"
  ) +
  theme_minimal()
    

📈 Statistics

🎯 Complete Definition

R's statistical capabilities are among its greatest strengths. It provides comprehensive functionality for descriptive statistics, probability distributions, statistical tests, linear and nonlinear modeling, time series analysis, and machine learning.

📊 Descriptive Statistics

summary(): Five-number summary + mean
fivenum(): Tukey's five numbers
quantile(): Sample quantiles
IQR(): Interquartile range
cor(), cov(): Correlation and covariance
table(), prop.table(): Frequency tables

📈 Probability Distributions

Normal: dnorm(), pnorm(), qnorm(), rnorm()
Binomial: dbinom(), pbinom(), qbinom(), rbinom()
Poisson: dpois(), ppois(), qpois(), rpois()
Uniform: dunif(), punif(), qunif(), runif()

📋 Statistical Tests

t.test(): One and two-sample t-tests
wilcox.test(): Mann-Whitney/Wilcoxon tests
chisq.test(): Chi-squared tests
cor.test(): Test for association
shapiro.test(): Normality test

🔧 Linear Models

lm(): Linear regression
glm(): Generalized linear models
anova(): Analysis of variance
aov(): ANOVA
predict(): Model predictions

# Statistics in R

# ===== DESCRIPTIVE STATISTICS =====
data(mtcars)

# Summary statistics
summary(mtcars$mpg)

# Mean, median
mean(mtcars$mpg)
median(mtcars$mpg)

# Spread measures
sd(mtcars$mpg)
var(mtcars$mpg)
IQR(mtcars$mpg)

# Correlation
cor(mtcars$mpg, mtcars$wt)

# ===== FREQUENCY TABLES =====
# One-way table
table(mtcars$cyl)

# Two-way table
table(mtcars$cyl, mtcars$gear)

# ===== PROBABILITY DISTRIBUTIONS =====
# Normal distribution
x <- seq(-4, 4, length = 100)
plot(x, dnorm(x), type = "l", main = "Normal Distribution")
pnorm(1.96)  # Cumulative probability: 0.975
qnorm(0.975) # Quantile: 1.959964
rnorm(10)    # Random sample

# ===== STATISTICAL TESTS =====
# One-sample t-test
t.test(mtcars$mpg, mu = 20)

# Two-sample t-test
t.test(mpg ~ am, data = mtcars)

# Chi-square test
chisq.test(table(mtcars$cyl, mtcars$gear))

# Correlation test
cor.test(mtcars$mpg, mtcars$wt)

# Test for normality
shapiro.test(mtcars$mpg)

# ===== LINEAR MODELS =====
# Simple linear regression
model1 <- lm(mpg ~ wt, data = mtcars)
summary(model1)

# Multiple linear regression
model2 <- lm(mpg ~ wt + hp + cyl, data = mtcars)
summary(model2)

# Extract model components
coef(model2)           # Coefficients
residuals(model2)      # Residuals
fitted(model2)         # Fitted values
anova(model2)          # ANOVA table

# Predictions
new_cars <- data.frame(wt = c(2.5, 3.0, 3.5), 
                       hp = c(100, 150, 200),
                       cyl = c(4, 6, 8))
predict(model2, new_cars)

# ===== ANALYSIS OF VARIANCE (ANOVA) =====
# One-way ANOVA
aov_model <- aov(mpg ~ factor(cyl), data = mtcars)
summary(aov_model)

# Tukey HSD post-hoc test
TukeyHSD(aov_model)
    

🔬 Advanced R

🎯 Complete Definition

Advanced R covers sophisticated programming techniques, performance optimization, object-oriented systems, metaprogramming, and interfacing with other languages. These concepts enable building robust, efficient, and maintainable R code.

🔬 Object-Oriented Systems

S3: Informal, generic functions, class attribute
S4: Formal classes, slots, validation
R6: Encapsulated, reference-based OOP

⚡ Performance Optimization

Profiling: Rprof(), profvis
Byte compilation: compiler package
Vectorization: Avoiding loops
Parallel computing: parallel, foreach
Rcpp: C++ integration
data.table: High-performance data manipulation

📝 Metaprogramming

Non-standard evaluation: substitute(), quote(), eval()
Tidy evaluation: {{ }}, enquo(), !!
Environments: new.env(), parent.env()
Closures: Functions that capture environments

# Advanced R Programming

# ===== S3 OBJECT SYSTEM =====
# Create an S3 class
person <- list(name = "Alice", age = 30)
class(person) <- "person"

# Define a generic function
greet <- function(x, ...) {
  UseMethod("greet")
}

# Define methods for the person class
greet.person <- function(x, ...) {
  paste("Hello, I'm", x$name, "and I'm", x$age, "years old.")
}

# Test
greet(person)

# ===== R6 CLASSES =====
library(R6)

Person <- R6Class("Person",
  public = list(
    name = NULL,
    age = NULL,
    
    initialize = function(name, age) {
      self$name <- name
      self$age <- age
    },
    
    introduce = function() {
      paste("I'm", self$name, ", age", self$age)
    },
    
    have_birthday = function() {
      self$age <- self$age + 1
      invisible(self)
    }
  )
)

# Create and use R6 object
alice <- Person$new("Alice", 30)
alice$introduce()
alice$have_birthday()
alice$age

# ===== ENVIRONMENTS =====
# Environments are fundamental to R's scoping
e1 <- new.env()
e1$x <- 10
e1$y <- 20

e2 <- new.env(parent = e1)
e2$x <- 5

e2$x  # 5 (own x)
e2$y  # 20 (from parent)

# ===== CLOSURES =====
# Functions that capture their environment
make_counter <- function(start = 0) {
  count <- start
  function() {
    count <<- count + 1
    count
  }
}

counter1 <- make_counter()
counter1()
counter1()

# ===== NON-STANDARD EVALUATION =====
# quote() captures expression without evaluating
expr <- quote(x + y)
expr

# eval() evaluates captured expression
x <- 5
y <- 3
eval(expr)

# ===== TIDY EVALUATION =====
library(dplyr)

# Correct tidy evaluation
good_var_summary <- function(df, var) {
  df %>%
    group_by(cyl) %>%
    summarise(avg = mean({{ var }}))  # Embrace with {{ }}
}

good_var_summary(mtcars, mpg)

# ===== RCPP: C++ INTEGRATION =====
library(Rcpp)

# Define C++ function in R
cppFunction('
int fibonacci_cpp(int n) {
  if (n < 2) return n;
  return fibonacci_cpp(n-1) + fibonacci_cpp(n-2);
}
')

fibonacci_cpp(10)

# ===== DATA.TABLE =====
library(data.table)

# Create data.table
dt <- data.table(
  id = 1:1e5,
  group = sample(letters[1:10], 1e5, replace = TRUE),
  value = rnorm(1e5)
)

# Fast operations
dt[, mean(value), by = group]
dt[, value_sq := value^2]

# ===== PARALLEL COMPUTING =====
library(parallel)

# Detect cores
ncores <- detectCores()
print(paste("Number of cores:", ncores))