Since we do not know which feature's weight must be decreased at first, we add regularization term to every feature's weight.
Regularization term is added to the cost function, and the cost function is minimized.
Small lambda value can cause overfit:
Large value can cause underfit:
lambda is divided by 2m to keep the scale with the first term the same. This way with more training examples (higher m), the same lambda value can still be used.
b^2 could also be added, but minimizing it has minimal effect.