Machine Learning Basics
Machine learning is the science of getting computers to learn and act like humans do. Learn the fundamentals of ML with Python's scikit-learn library, including supervised and unsupervised learning algorithms.
What is Machine Learning?
Machine Learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. It involves algorithms that can identify patterns in data and make predictions or decisions based on new data.
"Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed."
Types of Machine Learning
- Supervised Learning: Learning from labeled data to predict outcomes
- Unsupervised Learning: Finding patterns in unlabeled data
- Reinforcement Learning: Learning through interaction and rewards
- Semi-supervised Learning: Combination of labeled and unlabeled data
Installing Required Libraries
Setting up the machine learning environment:
# Install required packages
# pip install scikit-learn pandas numpy matplotlib seaborn
import sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
print(f"Scikit-learn version: {sklearn.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
Data Preprocessing
Preparing data for machine learning:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
# Create sample dataset
np.random.seed(42)
data = {
'Age': np.random.normal(35, 10, 100),
'Salary': np.random.normal(50000, 15000, 100),
'Experience': np.random.randint(1, 20, 100),
'Education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 100),
'Department': np.random.choice(['HR', 'IT', 'Finance', 'Marketing'], 100),
'Performance': np.random.choice([1, 2, 3, 4, 5], 100) # Target variable
}
# Introduce some missing values
df = pd.DataFrame(data)
mask = np.random.random((100, 3)) < 0.1 # 10% missing values
df.loc[mask[:, 0], 'Age'] = np.nan
df.loc[mask[:, 1], 'Salary'] = np.nan
df.loc[mask[:, 2], 'Experience'] = np.nan
print("Dataset with missing values:")
print(df.head())
print(f"\nMissing values:\n{df.isnull().sum()}")
# Handle missing values
imputer = SimpleImputer(strategy='mean')
df[['Age', 'Salary', 'Experience']] = imputer.fit_transform(df[['Age', 'Salary', 'Experience']])
print(f"\nAfter imputation:\n{df.isnull().sum()}")
# Encode categorical variables
label_encoder = LabelEncoder()
df['Education_encoded'] = label_encoder.fit_transform(df['Education'])
# One-hot encoding for department
onehot_encoder = OneHotEncoder(sparse_output=False, drop='first')
dept_encoded = onehot_encoder.fit_transform(df[['Department']])
dept_columns = ['Department_' + cat for cat in onehot_encoder.categories_[0][1:]]
df_dept = pd.DataFrame(dept_encoded, columns=dept_columns)
df = pd.concat([df, df_dept], axis=1)
print(f"\nAfter encoding:")
print(df[['Education', 'Education_encoded', 'Department'] + dept_columns].head())
# Feature scaling
scaler = StandardScaler()
numerical_features = ['Age', 'Salary', 'Experience']
df[numerical_features] = scaler.fit_transform(df[numerical_features])
print(f"\nAfter scaling (first 5 rows):")
print(df[numerical_features].head())
# Split features and target
X = df.drop(['Education', 'Department', 'Performance'], axis=1)
y = df['Performance']
print(f"\nFeatures shape: {X.shape}")
print(f"Target shape: {y.shape}")
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"\nTrain set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
Supervised Learning - Classification
Building classification models:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
# Create a simple classification dataset
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=10, n_classes=3,
n_informative=5, n_redundant=2, random_state=42)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("Classification Models Comparison")
print("=" * 40)
# Logistic Regression
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_model.fit(X_train_scaled, y_train)
lr_pred = lr_model.predict(X_test_scaled)
print(f"Logistic Regression Accuracy: {accuracy_score(y_test, lr_pred):.3f}")
# Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train) # Decision trees don't need scaling
dt_pred = dt_model.predict(X_test)
print(f"Decision Tree Accuracy: {accuracy_score(y_test, dt_pred):.3f}")
# Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, rf_pred):.3f}")
# Support Vector Machine
svm_model = SVC(random_state=42)
svm_model.fit(X_train_scaled, y_train)
svm_pred = svm_model.predict(X_test_scaled)
print(f"SVM Accuracy: {accuracy_score(y_test, svm_pred):.3f}")
# Detailed classification report for Random Forest
print(f"\nRandom Forest Classification Report:")
print(classification_report(y_test, rf_pred))
# Confusion Matrix
cm = confusion_matrix(y_test, rf_pred)
print(f"\nConfusion Matrix:")
print(cm)
# Feature importance for Random Forest
feature_importance = pd.DataFrame({
'feature': [f'feature_{i}' for i in range(X.shape[1])],
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
print(f"\nTop 5 Feature Importances:")
print(feature_importance.head())
# Cross-validation
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(rf_model, X, y, cv=5)
print(f"\nCross-validation scores: {cv_scores}")
print(f"Mean CV score: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
Supervised Learning - Regression
Building regression models:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import matplotlib.pyplot as plt
# Create a regression dataset
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1000, n_features=5, noise=0.1, random_state=42)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("Regression Models Comparison")
print("=" * 30)
# Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)
print(f"Linear Regression R²: {r2_score(y_test, lr_pred):.3f}")
print(f"Linear Regression RMSE: {np.sqrt(mean_squared_error(y_test, lr_pred)):.3f}")
# Decision Tree Regressor
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
print(f"Decision Tree R²: {r2_score(y_test, dt_pred):.3f}")
print(f"Decision Tree RMSE: {np.sqrt(mean_squared_error(y_test, dt_pred)):.3f}")
# Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
print(f"Random Forest R²: {r2_score(y_test, rf_pred):.3f}")
print(f"Random Forest RMSE: {np.sqrt(mean_squared_error(y_test, rf_pred)):.3f}")
# Compare predictions vs actual
comparison_df = pd.DataFrame({
'Actual': y_test,
'Linear Regression': lr_pred,
'Decision Tree': dt_pred,
'Random Forest': rf_pred
})
print(f"\nPrediction Comparison (first 10 samples):")
print(comparison_df.head(10))
# Feature importance for Random Forest
feature_importance = pd.DataFrame({
'feature': [f'feature_{i}' for i in range(X.shape[1])],
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
print(f"\nFeature Importances:")
print(feature_importance)
# Residual analysis
residuals = y_test - rf_pred
print(f"\nResidual Statistics:")
print(f"Mean residual: {residuals.mean():.3f}")
print(f"Residual std: {residuals.std():.3f}")
print(f"Max residual: {residuals.max():.3f}")
print(f"Min residual: {residuals.min():.3f}")
# Plot actual vs predicted (conceptual - would need matplotlib)
print("\nTo visualize results:")
print("plt.scatter(y_test, rf_pred)")
print("plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')")
print("plt.xlabel('Actual Values')")
print("plt.ylabel('Predicted Values')")
print("plt.title('Actual vs Predicted')")
print("plt.show()")
Unsupervised Learning - Clustering
Grouping data without labels:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, calinski_harabasz_score
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Create sample clustering data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)
# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Clustering Algorithms Comparison")
print("=" * 35)
# K-Means Clustering
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X_scaled)
print(f"K-Means Silhouette Score: {silhouette_score(X_scaled, kmeans_labels):.3f}")
print(f"K-Means Calinski-Harabasz Score: {calinski_harabasz_score(X_scaled, kmeans_labels):.3f}")
print(f"K-Means Cluster Centers: {kmeans.cluster_centers_}")
# DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=10)
dbscan_labels = dbscan.fit_predict(X_scaled)
# DBSCAN can find varying number of clusters (including noise labeled as -1)
n_clusters_dbscan = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
n_noise = list(dbscan_labels).count(-1)
print(f"\nDBSCAN found {n_clusters_dbscan} clusters and {n_noise} noise points")
if n_clusters_dbscan > 1:
# Only calculate scores if we have more than 1 cluster
mask = dbscan_labels != -1 # Exclude noise points
if sum(mask) > 1:
print(f"DBSCAN Silhouette Score (excluding noise): {silhouette_score(X_scaled[mask], dbscan_labels[mask]):.3f}")
# Agglomerative Clustering
agglo = AgglomerativeClustering(n_clusters=4)
agglo_labels = agglo.fit_predict(X_scaled)
print(f"\nAgglomerative Clustering Silhouette Score: {silhouette_score(X_scaled, agglo_labels):.3f}")
print(f"Agglomerative Clustering Calinski-Harabasz Score: {calinski_harabasz_score(X_scaled, agglo_labels):.3f}")
# Elbow method for K-Means (finding optimal k)
inertias = []
silhouette_scores = []
k_range = range(2, 11)
for k in k_range:
kmeans_temp = KMeans(n_clusters=k, random_state=42, n_init=10)
labels_temp = kmeans_temp.fit_predict(X_scaled)
inertias.append(kmeans_temp.inertia_)
silhouette_scores.append(silhouette_score(X_scaled, labels_temp))
print(f"\nElbow Method Results:")
for k, inertia, sil_score in zip(k_range, inertias, silhouette_scores):
print(f"k={k}: Inertia={inertia:.2f}, Silhouette={sil_score:.3f}")
# Optimal k based on silhouette score
optimal_k = k_range[np.argmax(silhouette_scores)]
print(f"\nOptimal number of clusters (based on silhouette score): {optimal_k}")
# Cluster analysis
cluster_df = pd.DataFrame(X_scaled, columns=['Feature1', 'Feature2'])
cluster_df['KMeans_Cluster'] = kmeans_labels
cluster_df['DBSCAN_Cluster'] = dbscan_labels
cluster_df['Agglo_Cluster'] = agglo_labels
print(f"\nCluster Distribution (K-Means):")
print(cluster_df['KMeans_Cluster'].value_counts().sort_index())
print(f"\nCluster Distribution (DBSCAN):")
print(cluster_df['DBSCAN_Cluster'].value_counts().sort_index())
# Cluster centroids (for K-Means)
centroids = scaler.inverse_transform(kmeans.cluster_centers_)
print(f"\nCluster Centroids (original scale):")
for i, centroid in enumerate(centroids):
print(f"Cluster {i}: ({centroid[0]:.2f}, {centroid[1]:.2f})")
# Visualization code (conceptual)
print("\nTo visualize clusters:")
print("plt.figure(figsize=(15, 5))")
print("plt.subplot(1, 3, 1)")
print("plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=kmeans_labels, cmap='viridis')")
print("plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', marker='x')")
print("plt.title('K-Means Clustering')")
print("plt.show()")
Dimensionality Reduction
Reducing feature dimensions:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
# Load the digits dataset
digits = load_digits()
X = digits.data
y = digits.target
print(f"Original data shape: {X.shape}")
print(f"Number of classes: {len(np.unique(y))}")
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# PCA (Principal Component Analysis)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f"\nPCA Results:")
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Cumulative explained variance: {np.cumsum(pca.explained_variance_ratio_)}")
print(f"PCA transformed shape: {X_pca.shape}")
# Find optimal number of components
pca_full = PCA()
pca_full.fit(X_scaled)
# Plot explained variance
explained_variance = pca_full.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)
print(f"\nExplained variance by component:")
for i, (var, cum_var) in enumerate(zip(explained_variance[:10], cumulative_variance[:10])):
print(f"PC{i+1}: {var:.4f} (cumulative: {cum_var:.4f})")
# t-SNE (t-Distributed Stochastic Neighbor Embedding)
tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000)
X_tsne = tsne.fit_transform(X_scaled)
print(f"\nt-SNE transformed shape: {X_tsne.shape}")
# Compare PCA and t-SNE
print(f"\nDimensionality Reduction Comparison:")
print(f"PCA - Preserves global structure, fast, deterministic")
print(f"t-SNE - Preserves local structure, slower, non-deterministic")
# Clustering on reduced dimensions
from sklearn.cluster import KMeans
# K-Means on original data
kmeans_original = KMeans(n_clusters=10, random_state=42, n_init=10)
labels_original = kmeans_original.fit_predict(X_scaled)
# K-Means on PCA data
kmeans_pca = KMeans(n_clusters=10, random_state=42, n_init=10)
labels_pca = kmeans_pca.fit_predict(X_pca)
# K-Means on t-SNE data
kmeans_tsne = KMeans(n_clusters=10, random_state=42, n_init=10)
labels_tsne = kmeans_tsne.fit_predict(X_tsne)
from sklearn.metrics import adjusted_rand_score
ari_original = adjusted_rand_score(y, labels_original)
ari_pca = adjusted_rand_score(y, labels_pca)
ari_tsne = adjusted_rand_score(y, labels_tsne)
print(f"\nClustering Performance (Adjusted Rand Index):")
print(f"Original data: {ari_original:.3f}")
print(f"PCA reduced: {ari_pca:.3f}")
print(f"t-SNE reduced: {ari_tsne:.3f}")
# Feature importance in PCA
pca_components = pd.DataFrame(
pca.components_,
columns=[f'Pixel_{i}' for i in range(X.shape[1])],
index=['PC1', 'PC2']
)
print(f"\nPCA Component Analysis:")
print("Top 5 features for PC1:")
top_features_pc1 = pca_components.loc['PC1'].abs().sort_values(ascending=False).head()
print(top_features_pc1)
print("\nTop 5 features for PC2:")
top_features_pc2 = pca_components.loc['PC2'].abs().sort_values(ascending=False).head()
print(top_features_pc2)
# Reconstruction error
X_reconstructed = pca.inverse_transform(X_pca)
reconstruction_error = np.mean((X_scaled - X_reconstructed) ** 2)
print(f"\nPCA Reconstruction MSE: {reconstruction_error:.6f}")
# Visualization code (conceptual)
print("\nTo visualize dimensionality reduction:")
print("plt.figure(figsize=(15, 5))")
print("plt.subplot(1, 3, 1)")
print("plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', alpha=0.7)")
print("plt.title('PCA Projection')")
print("plt.colorbar()")
print("")
print("plt.subplot(1, 3, 2)")
print("plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', alpha=0.7)")
print("plt.title('t-SNE Projection')")
print("plt.colorbar()")
print("")
print("plt.subplot(1, 3, 3)")
print("plt.plot(range(1, len(explained_variance)+1), cumulative_variance, 'bo-')")
print("plt.xlabel('Number of Components')")
print("plt.ylabel('Cumulative Explained Variance')")
print("plt.title('PCA Explained Variance')")
print("plt.grid(True)")
print("plt.show()")
Model Evaluation and Validation
Evaluating and validating machine learning models:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score, roc_curve, classification_report
from sklearn.metrics import confusion_matrix, mean_squared_error, r2_score
from sklearn.datasets import make_classification, make_regression
import matplotlib.pyplot as plt
# Classification evaluation
print("Classification Model Evaluation")
print("=" * 35)
X_clf, y_clf = make_classification(n_samples=1000, n_features=10, n_classes=2,
random_state=42)
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
X_clf, y_clf, test_size=0.3, random_state=42)
# Train a model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_clf, y_train_clf)
y_pred_clf = clf.predict(X_test_clf)
y_pred_proba_clf = clf.predict_proba(X_test_clf)[:, 1]
# Classification metrics
accuracy = accuracy_score(y_test_clf, y_pred_clf)
precision = precision_score(y_test_clf, y_pred_clf)
recall = recall_score(y_test_clf, y_pred_clf)
f1 = f1_score(y_test_clf, y_pred_clf)
auc = roc_auc_score(y_test_clf, y_pred_proba_clf)
print(f"Accuracy: {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1-Score: {f1:.3f}")
print(f"AUC-ROC: {auc:.3f}")
# Confusion Matrix
cm = confusion_matrix(y_test_clf, y_pred_clf)
print(f"\nConfusion Matrix:")
print(cm)
# Classification Report
print(f"\nClassification Report:")
print(classification_report(y_test_clf, y_pred_clf))
# Regression evaluation
print("\n\nRegression Model Evaluation")
print("=" * 30)
X_reg, y_reg = make_regression(n_samples=1000, n_features=5, noise=0.1, random_state=42)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
X_reg, y_reg, test_size=0.3, random_state=42)
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor(n_estimators=100, random_state=42)
reg.fit(X_train_reg, y_train_reg)
y_pred_reg = reg.predict(X_test_reg)
# Regression metrics
mse = mean_squared_error(y_test_reg, y_pred_reg)
rmse = np.sqrt(mse)
mae = np.mean(np.abs(y_test_reg - y_pred_reg))
r2 = r2_score(y_test_reg, y_pred_reg)
print(f"Mean Squared Error: {mse:.3f}")
print(f"Root Mean Squared Error: {rmse:.3f}")
print(f"Mean Absolute Error: {mae:.3f}")
print(f"R² Score: {r2:.3f}")
# Cross-validation
print(f"\nCross-Validation Scores:")
cv_scores_clf = cross_val_score(clf, X_clf, y_clf, cv=5, scoring='accuracy')
cv_scores_reg = cross_val_score(reg, X_reg, y_reg, cv=5, scoring='r2')
print(f"Classification CV Accuracy: {cv_scores_clf.mean():.3f} (+/- {cv_scores_clf.std() * 2:.3f})")
print(f"Regression CV R²: {cv_scores_reg.mean():.3f} (+/- {cv_scores_reg.std() * 2:.3f})")
# Hyperparameter tuning with GridSearchCV
print(f"\nHyperparameter Tuning:")
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=3,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train_clf, y_train_clf)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")
# Evaluate best model
best_model = grid_search.best_estimator_
best_pred = best_model.predict(X_test_clf)
best_accuracy = accuracy_score(y_test_clf, best_pred)
print(f"Best model test accuracy: {best_accuracy:.3f}")
# Learning curves (conceptual)
print(f"\nLearning Curves Analysis:")
print("To analyze learning curves:")
print("from sklearn.model_selection import learning_curve")
print("train_sizes, train_scores, val_scores = learning_curve(clf, X_clf, y_clf, cv=5)")
print("# Plot learning curves to check for overfitting/underfitting")
# Feature importance
feature_importance = pd.DataFrame({
'feature': [f'feature_{i}' for i in range(X_clf.shape[1])],
'importance': clf.feature_importances_
}).sort_values('importance', ascending=False)
print(f"\nTop 5 Feature Importances:")
print(feature_importance.head())
# ROC Curve plotting (conceptual)
print(f"\nROC Curve Analysis:")
print("fpr, tpr, thresholds = roc_curve(y_test_clf, y_pred_proba_clf)")
print("plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')")
print("plt.plot([0, 1], [0, 1], 'k--')")
print("plt.xlabel('False Positive Rate')")
print("plt.ylabel('True Positive Rate')")
print("plt.title('ROC Curve')")
print("plt.legend()")
print("plt.show()")
Best Practices
- Data preprocessing: Clean, normalize, and encode data properly
- Train-test split: Always evaluate on unseen data
- Cross-validation: Use k-fold CV for robust evaluation
- Hyperparameter tuning: Optimize model parameters systematically
- Feature engineering: Create meaningful features from raw data
- Model selection: Choose appropriate algorithms for your problem
- Ensemble methods: Combine multiple models for better performance
- Regularization: Prevent overfitting with appropriate techniques
- Interpretability: Understand model decisions when possible
- Scalability: Consider computational requirements
Common Challenges and Solutions
- Overfitting: Use regularization, cross-validation, early stopping
- Underfitting: Increase model complexity, add features
- Imbalanced data: Use SMOTE, class weights, resampling
- Missing data: Imputation, model-based handling
- Feature scaling: Standardize/normalize features appropriately
- Categorical variables: One-hot encoding, label encoding
- High dimensionality: Feature selection, dimensionality reduction
- Computational cost: Optimize algorithms, use sampling
- Data leakage: Proper train-test separation
- Model deployment: Consider production requirements
Machine learning is a rapidly evolving field with countless applications. Scikit-learn provides an excellent foundation for building ML models, but remember that the quality of your data and feature engineering often matter more than the choice of algorithm. Always start with simple models and gradually increase complexity as needed.