Reducing input and output features is at the core of model design and optimization.
- Default: fully connected layers.
Your model’s first layer (nn.Linear(784, 128)) takes 784 features, because each MNIST image is 28×28 pixels flattened into one vector.
self.conv1 = nn.Conv2d(1, 16, 3, padding=1) # 28x28 -> 16 feature maps
self.pool = nn.MaxPool2d(2, 2) # 28x28 -> 14x14
Each pooling step halves width and height — dramatically reducing input size before flattening.
Result
Input to the fully connected layer becomes much smaller, e.g. 16×14×14 = 3136 instead of 784.
Before feeding data into your model
Apply Principal Component Analysis (PCA) to reduce redundant features.
Or pretrain an autoencoder to compress data into a smaller latent vector.
Example (PCA using sklearn):
from sklearn.decomposition import PCA
pca = PCA(n_components=100)
X_reduced = pca.fit_transform(X_original)
➡ Instead of 784 features, your input layer can be nn.Linear(100, 128) — much smaller and faster.
from torchvision import transforms
transform = transforms.Compose([
transforms.Resize((14, 14)), # reduce from 28x28 to 14x14
transforms.ToTensor()
])
This reduces 784 → 196 features per image.
The output size is determined by your task
MNIST = 10 digits → output size = 10
CIFAR-100 = 100 classes → output size = 100
Binary classification → output size = 1
Example:
“cat”, “dog”, “rabbit” → “animal”
“car”, “bus”, “truck” → “vehicle”
self.fc3 = nn.Linear(64, 5) # instead of 10 classes
Instead of one output predicting everything
Stage 1: Predict category type (animal vs vehicle)
Stage 2: Predict sub-class (cat vs dog)
This reduces the size of each output layer and improves interpretability.
Example
| Benefit | Explanation |
|---|---|
| Faster training | Fewer weights to update |
| Less overfitting | Model focuses on key patterns |
| Lower memory cost | Smaller tensors and gradients |
| Better generalization | Simpler model → less noise fitting |
If you reduce too much:
Model might lose important information (underfitting)
Accuracy can drop sharply
➡ Always monitor validation loss — and apply dimensionality reduction + model tuning together.
| Goal | Method | Example |
|---|---|---|
| Reduce Input | CNN, PCA, Downsampling | 28×28 → 14×14 or 784 → 100 |
| Reduce Output | Class merging, hierarchical classification | 10 → 5 or 10 → 2-stage |
| Keep Accuracy | Regularization, dropout, early stopping | Avoid underfitting |