The Basics of Python: SciKit-Learn
Scikit-learn is a Python library that is widely used for machine learning tasks such as classification, regression, and clustering. It provides a wide range of machine learning algorithms and tools for data preprocessing, model selection, and evaluation.
Some of the most important functions in scikit-learn include:
sklearn.model_selection.train_test_split()
: splits data into training and testing sets for model training and evaluationsklearn.preprocessing.StandardScaler()
: scales data to have a mean of 0 and a standard deviation of 1sklearn.pipeline.Pipeline()
: chains together multiple machine learning steps into a single pipelinesklearn.linear_model.LogisticRegression()
: performs logistic regression for binary classification problemssklearn.ensemble.RandomForestClassifier()
: performs random forest classification for both binary and multiclass problemssklearn.cluster.KMeans()
: performs K-means clustering for unsupervised learning problemssklearn.metrics.accuracy_score()
: computes the accuracy of a machine learning model on test datasklearn.metrics.precision_score()
: computes the precision of a machine learning model on test datasklearn.metrics.recall_score()
: computes the recall of a machine learning model on test datasklearn.metrics.f1_score()
: computes the F1 score (harmonic mean of precision and recall) of a machine learning model on test data
Here's an example of how to use scikit-learn to build and evaluate a machine learning model:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# load data from a CSV file
data = pd.read_csv('data.csv')
# split data into features (X) and target (y)
X = data.drop(columns='target')
y = data['target']
# split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# create a pipeline with scaling and logistic regression steps
pipeline = Pipeline(steps=[('scaler', StandardScaler()), ('logreg', LogisticRegression())])
# fit the pipeline on the training data
pipeline.fit(X_train, y_train)
# predict the target values for the test data
y_pred = pipeline.predict(X_test)
# evaluate the accuracy of the model on the test data
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)