Data Preprocessing and Visualization for AI: A Complete Guide (Lecture 5)

In this lecture, we’ll cover data preprocessing—the crucial step to ensure your AI models work with clean, structured, and meaningful data. We’ll also explore data visualization techniques to better understand your dataset.

{% toc %}

1) Why Data Preprocessing Matters

Model performance depends heavily on data quality. Even the most advanced algorithms can fail if the input data is noisy or inconsistent.

Preprocessing Goals:

Handle missing values
Detect and manage outliers
Scale features for model compatibility
Encode categorical variables
Use visualization for deeper insight

2) Handling Missing Values

2.1 Checking for Missing Data

1
2
3
4
import pandas as pd

df = pd.read_csv("data.csv")
print(df.isnull().sum())

2.2 Filling or Removing Missing Values

Drop missing rows/columns:

1
df = df.dropna()

Fill with mean/median/mode:

1
df['age'] = df['age'].fillna(df['age'].mean())

3) Detecting and Handling Outliers

Outlier: Data points far from the normal distribution.
Detection Methods:
- Statistical: IQR (Interquartile Range)
- Visualization: Boxplots

1
2
3
4
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['value'] < Q1 - 1.5*IQR) | (df['value'] > Q3 + 1.5*IQR)]

4) Feature Scaling

Scaling ensures features contribute equally to the model.

Standardization: Mean = 0, Std = 1
Normalization: Range = 0–1

1
2
3
4
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['feature1', 'feature2']])

5) Encoding Categorical Variables

Models can’t handle raw text categories—convert them to numeric form.

One-Hot Encoding:

1
df_encoded = pd.get_dummies(df, columns=['category'])

6) Data Visualization

6.1 Histogram

1
2
3
4
5
6
import matplotlib.pyplot as plt

df['age'].hist(bins=20)
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

6.2 Boxplot

1
2
df.boxplot(column='value', by='category')
plt.show()

6.3 Scatter Plot

1
2
3
4
plt.scatter(df['feature1'], df['feature2'])
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

7) Lab: Preprocessing and Visualizing the Iris Dataset

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from sklearn.datasets import load_iris
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Load dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Check missing values
print(df.isnull().sum())

# Standardize features
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df.iloc[:, :-1])

# Visualization
plt.scatter(df['sepal length (cm)'], df['sepal width (cm)'], c=df['target'])
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.show()

8) Key Takeaways

Preprocessing improves data quality and model accuracy.
Handle missing values before training.
Detect and mitigate outliers to avoid skewed results.
Scale features for fair model contribution.
Visualize data to identify trends and patterns.

9) What’s Next?

In Lecture 6, we’ll move into Supervised Learning Practice—building classification and regression models from scratch.

Data Preprocessing and Visualization for AI: A Complete Guide (Lecture 5)#

Table of Contents#

1) Why Data Preprocessing Matters#

2) Handling Missing Values#

2.1 Checking for Missing Data#

2.2 Filling or Removing Missing Values#

3) Detecting and Handling Outliers#

4) Feature Scaling#

5) Encoding Categorical Variables#

6) Data Visualization#

6.1 Histogram#

6.2 Boxplot#

6.3 Scatter Plot#

7) Lab: Preprocessing and Visualizing the Iris Dataset#

8) Key Takeaways#

9) What’s Next?#