🤖 ML

Session 10 - Linear Regression

📅 May 13, 2026

Linear Regression Notes

📊 LINEAR REGRESSION

Complete Study Notes - Basic देखि Advance सम्म

१. Linear Regression को परिचय

🎯 एक लाइनमा भन्नुपर्दा

Linear Regression एउटा supervised learning algorithm हो जसले दुई कुराको बीचमा सीधा रेखा (straight line) को relationship पत्ता लगाएर भविष्यको value predict गर्छ।

Simple शब्दमा: "Input दिनुस्, output predict गर्छु - सीधा रेखाको हिसाबले।"

🏠 Real Life Examples

Input (X)	→	Output (Y)
घरको size (sqft)	→	घरको price
पढेको घण्टा	→	Exam मा marks
उमेर	→	Salary
Advertisement खर्च	→	Sales
तापक्रम	→	Ice cream बिक्री

📐 दुई थरीका Variables

१. Independent Variable (X): Input feature - जुन हामीलाई थाहा छ। (कारण)

२. Dependent Variable (Y): Output / Target - जुन predict गर्ने हो। (फल)

💡 याद राख्ने Trick

X = कारण (Cause) → "X causes Y"

Y = फल / असर (Effect) → Y depends on X

त्यसैले Y लाई "dependent" र X लाई "independent" भनिन्छ।

📊 मुख्य Equation

y = β₀ + β₁x

(School को y = mx + c सँग ठ्याक्कै मिल्छ)

यहाँ:

y = predicted value (output)
x = input feature
β₀ (beta zero) = Intercept
β₁ (beta one) = Coefficient / Slope

२. Intercept र Coefficient बुझ्ने

🎯 Intercept (β₀) के हो?

परिभाषा: Intercept = त्यो value जहाँ रेखाले Y-axis लाई काट्छ।

अर्थात्: जब X = 0 हुन्छ, तब Y को value कति हुन्छ - त्यही Intercept हो।

यो line को "शुरुवात बिन्दु" (starting point) हो।

Real-life Example:

Price = 50,000 + 3000 × Size

घरको price predict गर्ने model

यहाँ β₀ = 50,000 को मतलब:

Size = 0 sqft भए पनि base price 50,000 हुन्छ।
Practically size 0 हुँदैन, तर mathematically यो starting point हो।
यो जग्गाको base value, registration cost जस्तो fixed cost मान्न सकिन्छ।

📈 Coefficient (β₁) के हो?

परिभाषा: Coefficient = रेखाको slope (ढलान)।

अर्थात्: X मा 1 unit बढ्दा, Y कति बढ्छ/घट्छ - त्यही नै coefficient हो।

यो line कति भिरालो छ भन्ने देखाउँछ।

माथिको example मा β₁ = 3000 भनेको:

→ Size मा 1 sqft बढायो भने price 3000 ले बढ्छ।

Coefficient को 3 Interpretation:

β₁ को value	Relationship	Example
β₁ > 0 (Positive)	X बढ्दा Y पनि बढ्छ	पढाइ ↑ → marks ↑
β₁ < 0 (Negative)	X बढ्दा Y घट्छ	तापक्रम ↑ → heater बिक्री ↓
β₁ = 0	कुनै relationship छैन	जुत्ताको size → IQ

💡 याद राख्ने Trick

β₀ = "शुरु कहाँबाट" (Where to start) - line कहाँबाट सुरु?

β₁ = "कति छिटो" (How fast) - line कति भिरालो छ?

जस्तै ट्याक्सीमा: β₀ = base fare, β₁ = per km charge

३. Formulas र Manual Calculation

β₀ र β₁ निकाल्न Ordinary Least Squares (OLS) method use गरिन्छ।

📐 दुई मुख्य Formulas

Coefficient (β₁) को formula:

β₁ = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)²

Intercept (β₀) को formula:

β₀ = ȳ - β₁ × x̄

(पहिले β₁ निकाल्ने, अनि β₀)

जहाँ x̄ र ȳ = X र Y को mean (average) हो।

🔢 पूर्ण Calculation Example

मानौं हाम्रो data यस्तो छ - पढेको घण्टा vs marks:

X (पढ्ने घण्टा)	Y (Marks)
1	2
2	4
3	5
4	4
5	5

Step 1: Mean (औसत) निकाल्ने

x̄ = (1+2+3+4+5) / 5 = 3

ȳ = (2+4+5+4+5) / 5 = 4

Step 2: Calculation Table बनाउने

xᵢ	yᵢ	xᵢ - x̄	yᵢ - ȳ	(xᵢ-x̄)(yᵢ-ȳ)	(xᵢ-x̄)²
1	2	-2	-2	4	4
2	4	-1	0	0	1
3	5	0	1	0	0
4	4	1	0	0	1
5	5	2	1	2	4
कुल (Sum):				6	10

Step 3: β₁ निकाल्ने

β₁ = 6 / 10 = 0.6

Step 4: β₀ निकाल्ने

β₀ = 4 - (0.6 × 3) = 4 - 1.8 = 2.2

✨ Final Best Fit Line:

y = 2.2 + 0.6x

यो नै हाम्रो trained model हो!

🔮 Prediction कसरी गर्ने?

कसैले 6 घण्टा पढ्यो भने marks कति आउँछ?

y = 2.2 + 0.6 × 6 = 5.8 marks

💡 OLS को सार 3 Step मा

१. Mean (x̄ र ȳ) निकाल

२. β₁ निकाल (formula बाट)

३. β₁ बाट β₀ निकाल → Best fit line तयार!

४. Error (Loss) के हो?

Error को परिभाषा

Predicted value (ŷ) र Actual value (y) को बीचको difference नै Error हो।

Error = y - ŷ

हाम्रो goal: यो error लाई जति सक्दो कम गर्ने।

📏 Error नाप्ने 3 Common Metrics

1️⃣ Mean Squared Error (MSE) - सबैभन्दा popular

MSE = (1/n) × Σ(yᵢ - ŷᵢ)²

2️⃣ Root Mean Squared Error (RMSE)

RMSE = √MSE

3️⃣ Mean Absolute Error (MAE)

MAE = (1/n) × Σ|yᵢ - ŷᵢ|

❓ MSE मा Square किन गरिन्छ?

कारण	व्याख्या
१. Cancel नहोस्	Positive र Negative errors एक अर्कालाई cancel नगरुन्
२. ठूलो error लाई penalty	Squaring ले ठूलो error लाई धेरै ठूलो बनाउँछ
३. Differentiable	Calculus गर्न सजिलो (gradient descent मा चाहिन्छ)

५. Method 1: OLS (Ordinary Least Squares)

OLS के हो?

OLS = Calculus use गरेर direct exact solution निकाल्ने method।

यसमा iteration गर्नुपर्दैन - एकै पटक answer आउँछ।

हामीले माथि Section 3 मा गरेको calculation OLS नै हो।

🧠 OLS को Core Idea

MSE लाई β₀ र β₁ को respect मा partial derivative लिएर 0 मा equate गर्ने।

∂MSE/∂β₀ = 0

∂MSE/∂β₁ = 0

✅ OLS को फाइदा र बेफाइदा

✅ फाइदा	❌ बेफाइदा
Direct exact answer	ठूलो dataset मा slow
Iteration चाहिँदैन	Matrix inversion महँगो (O(n³))
सानो data मा एकदम छिटो	Million+ rows मा practical छैन
Hyperparameter tune गर्नुपर्दैन	Online learning मा काम नलाग्ने

६. Method 2: Gradient Descent (Detail मा)

🏔️ Intuition (Mountain Analogy)

कल्पना गर्नुहोस्

तपाईं हिमालको टुप्पोमा हुनुहुन्छ। तल बेंसी (valley) मा पुग्नुपर्छ। तर कुहिरोले गर्दा केही देखिँदैन।

के गर्ने?

→ पैतालाले महसुस गर्ने - कुन direction मा भिरालो तल छ?

→ त्यही direction मा सानो पाइला चाल्ने।

→ बारम्बार दोहोर्याउने।

→ बिस्तारै बेंसीमा पुगिन्छ।

Gradient Descent ले ठ्याक्कै यही गर्छ!

Mountain Analogy	Gradient Descent मा
हिमालको टुप्पो	Random initial β₀, β₁
बेंसी (Valley)	Best β₀, β₁ (minimum error)
भिरालोको direction	Gradient (derivative)
पाइलाको साइज	Learning rate (α)
बेंसीमा पुग्ने	Converge - best fit line भेटिने

📐 Cost Function

J(β₀, β₁) = (1/2n) × Σ(yᵢ - ŷᵢ)²

जहाँ ŷᵢ = β₀ + β₁ × xᵢ

🔧 Update Rules (मुख्य Formulas)

हरेक iteration मा β₀ र β₁ यसरी update हुन्छ:

β₀ = β₀ - α × (∂J/∂β₀)

β₁ = β₁ - α × (∂J/∂β₁)

Partial Derivatives:

∂J/∂β₀ = -(1/n) × Σ(yᵢ - ŷᵢ)

∂J/∂β₁ = -(1/n) × Σ(yᵢ - ŷᵢ) × xᵢ

🔢 Step-by-Step Calculation

Initial Setup

β₀ = 0, β₁ = 0, α = 0.01, n = 5

Goal: best β₀ र β₁ भेट्ने

🔄 Iteration 1

Step A: Predictions निकाल्ने

ŷᵢ = 0 + 0 × xᵢ = 0

Step B: Errors table

xᵢ	yᵢ	ŷᵢ	(yᵢ - ŷᵢ)	(yᵢ - ŷᵢ)·xᵢ
1	2	0	2	2
2	4	0	4	8
3	5	0	5	15
4	4	0	4	16
5	5	0	5	25
कुल:			20	66

Step C: Gradients calculate गर्ने

∂J/∂β₀ = -(1/5) × 20 = -4

∂J/∂β₁ = -(1/5) × 66 = -13.2

Step D: β₀ र β₁ update गर्ने

β₀ = 0 - 0.01 × (-4) = 0.04

β₁ = 0 - 0.01 × (-13.2) = 0.132

Step E: Cost (J) निकाल्ने

J = (1/10) × [4+16+25+16+25] = 8.6

Iteration 1 Result

β₀ = 0.04, β₁ = 0.132, Cost = 8.6

🔄 Iteration 2

नयाँ predictions: ŷ = 0.04 + 0.132·x

xᵢ	yᵢ	ŷᵢ	(yᵢ - ŷᵢ)	(yᵢ - ŷᵢ)·xᵢ
1	2	0.172	1.828	1.828
2	4	0.304	3.696	7.392
3	5	0.436	4.564	13.692
4	4	0.568	3.432	13.728
5	5	0.700	4.300	21.500
कुल:			17.820	58.140

∂J/∂β₀ = -17.820/5 = -3.564

∂J/∂β₁ = -58.140/5 = -11.628

β₀ = 0.04 + 0.03564 = 0.076

β₁ = 0.132 + 0.11628 = 0.248

Iteration 2 Result

β₀ = 0.076, β₁ = 0.248, Cost = 6.81

🎉 Cost 8.6 बाट 6.81 मा झर्यो! Algorithm काम गरिरहेको छ।

📈 धेरै Iterations पछिको Progress

Iteration	β₀	β₁	Cost (J)	Status
0 (शुरु)	0.000	0.000	10.300	Random start
1	0.040	0.132	8.600	घट्दै
2	0.076	0.248	6.810	घट्दै
10	0.245	0.812	1.450	तेजी
50	0.890	1.020	0.412	धीमा
100	1.420	0.910	0.318	बिस्तारै
500	2.050	0.720	0.302	नजिकै
1000	2.180	0.620	0.301	लगभग पुग्यो
2000	2.200	0.600	0.300	✓ Converged!

🎯 अन्तिम Result

β₀ = 2.2, β₁ = 0.6

Best Fit Line: y = 2.2 + 0.6x

यो ठ्याक्कै OLS ले दिएको उत्तर हो!

Gradient Descent ले iterations बाट उही best fit line भेट्यो।

⏹️ Convergence कहिले रोक्ने?

Rule	Condition	Example
1. Max Iterations	तोकिएको iteration पुगेपछि	1000 iter पछि stop
2. Cost Change सानो	\|J_new - J_old\| < threshold	< 0.0001
3. Gradient ≈ 0	Valley को तल पुग्यो	Derivative ≈ 0

७. Alpha (Learning Rate) कति राख्ने?

Alpha (α) के हो?

Alpha = पाइलाको साइज।

हरेक iteration मा β₀ र β₁ कति update हुने भन्ने यसले नियन्त्रण गर्छ।

यो Gradient Descent को सबैभन्दा महत्त्वपूर्ण hyperparameter हो।

📊 Alpha को 3 Scenarios

🐌 α धेरै सानो (जस्तै 0.0001):

✓ सुरक्षित, minimum मा पुग्छ।
✗ धेरै ढिलो - हजारौं iteration चाहिन्छ।

✅ α ठीक (जस्तै 0.01 देखि 0.1):

✓ Smooth रूपमा minimum तिर जान्छ।
✓ थोरै iteration मा converge हुन्छ।

💥 α धेरै ठूलो (जस्तै 1.0 भन्दा माथि):

✗ Minimum लाई jump गरेर पारी जान्छ।
✗ Cost घट्नुको साटो बढ्दै जान्छ (diverge)।
✗ कहिल्यै converge हुँदैन।

🧪 Practical Comparison

α	1000 Iter पछि	नतिजा	Verdict
0.001	β₀=1.2, β₁=0.8	अझै converge हुनै बाँकी	❌ धेरै ढिलो
0.01	β₀=2.18, β₁=0.62	लगभग converged	✅ सबैभन्दा राम्रो
0.1	100 iter मै converged	छिटो र perfect	✅ राम्रो
0.5	10.3→25→89→412→∞	Diverge!	❌ फेल
1.5	10.3→156→3,420→∞	तुरुन्तै फेल	❌ एकदम फेल

💡 Alpha छनोट गर्ने Strategy

१. 0.01 बाट सुरु गर्ने (default value)

२. Cost plot हेर्ने - smoothly घटिरहेको छ?

✓ हो भने ठीक छ

✗ Oscillate गर्छ भने α घटाउने (10x ले)

✗ बढ्छ भने α धेरै ठूलो छ

३. 0.001, 0.01, 0.1 - यी तीन try गरेर सबैभन्दा राम्रो छान्ने

💡 अरू Practical Tips

1. Feature Scaling जरुरी छ:

X को values ठूला छन् भने पहिले normalize गर्ने:

x_scaled = (x - x̄) / σ

2. Learning Rate Schedule:

सुरुमा ठूलो α, पछि सानो बनाउँदै जाने। यसले छिटो र accurate convergence दिन्छ।

3. Cost लाई Plot गर्नुहोस्:

हरेक iteration मा cost कस्तो छ देख्न plot गर्ने। यसले α ठीक छ कि छैन भन्ने थाहा हुन्छ।

८. Multiple Linear Regression

Multiple Linear Regression के हो?

Real world मा एउटा मात्र feature पुग्दैन।

धेरै features (X₁, X₂, X₃...) हुन्छन्।

तिनै सबैलाई use गरेर Y predict गर्ने - यही नै Multiple Linear Regression हो।

📐 Equation

y = β₀ + β₁x₁ + β₂x₂ + β₃x₃ + ... + βₙxₙ

🏠 Real Example: घरको Price

Price = β₀ + β₁(Size) + β₂(Bedrooms) + β₃(Age) + β₄(Location)

Coefficient	Feature	Meaning
β₀	Intercept	Base price
β₁	Size	1 sqft बढ्दा कति price बढ्ने
β₂	Bedrooms	1 bedroom थप्दा price कति बढ्ने
β₃	Age	1 वर्ष पुरानो हुँदा price कति घट्ने
β₄	Location	Location score 1 बढ्दा price कति बढ्ने

🧮 Normal Equation (Closed-form)

β = (XᵀX)⁻¹ XᵀY

Single step मा सबै β निकाल्ने formula

९. Assumptions र Model Evaluation

📋 Linear Regression को 5 Assumptions

#	Assumption	मतलब
1	Linearity	X र Y बीच linear relationship हुनुपर्छ
2	Independence	Observations independent हुनुपर्छ
3	Homoscedasticity	Errors को variance constant हुनुपर्छ
4	Normality	Errors normally distributed हुनुपर्छ
5	No Multicollinearity	Features आपसमा highly correlated हुनुहुँदैन

💡 याद राख्ने Trick - LIHNN

L - Linearity

I - Independence

H - Homoscedasticity

N - Normality

N - No Multicollinearity

📊 R-squared (R²) - Model कति राम्रो छ?

R² = 1 - (SS_res / SS_tot)

जहाँ:

SS_res = Σ(yᵢ - ŷᵢ)² → Residual sum of squares (model को error)
SS_tot = Σ(yᵢ - ȳ)² → Total sum of squares (mean सँग को difference)

R² को Interpretation:

R² Value	अर्थ	Quality
1.0	Perfect fit - 100% variance explain	🏆 Excellent
0.85	85% variance explain गर्छ	✅ Very Good
0.60	60% variance explain गर्छ	👍 Acceptable
0.30	30% मात्र explain गर्छ	⚠️ Poor
0	Mean predict भन्दा राम्रो होइन	❌ Useless

१०. Regularization (Advanced)

Problem: Overfitting

→ Model training data मा एकदम राम्रो काम गर्छ

→ तर test data मा खराब

→ यसलाई Overfitting भनिन्छ

Solution: Regularization

🎯 Ridge Regression (L2)

J = MSE + λ × Σβᵢ²

Coefficients लाई सानो बनाउँछ
तर exactly 0 चाहिँ बनाउँदैन
सबै features राख्छ - तर importance घटाउँछ

🎯 Lasso Regression (L1)

J = MSE + λ × Σ|βᵢ|

कुनै coefficients लाई ठ्याक्कै 0 बनाइदिन्छ
→ Automatic feature selection गर्छ!
Useless features हटाउँछ

🎯 Elastic Net

Ridge + Lasso को combination। दुवै penalty add गर्ने।

जब धेरै features छन् र कुन important भनेर थाहा छैन - तब use गर्ने।

💡 कुन regularization कहिले?

Ridge → सबै features important लाग्छ, coefficients घटाउने हो

Lasso → धेरै features छन्, कति बेकामका हुन सक्छन्

Elastic Net → थाहा छैन - दुवैको फाइदा चाहिन्छ

११. Summary र Quick Revision

📋 Important Formulas Cheat Sheet

Concept	Formula
Simple Equation	y = β₀ + β₁x
Coefficient	β₁ = Σ(xᵢ-x̄)(yᵢ-ȳ) / Σ(xᵢ-x̄)²
Intercept	β₀ = ȳ - β₁x̄
MSE	(1/n) × Σ(yᵢ - ŷᵢ)²
Cost Function	J = (1/2n) × Σ(yᵢ - ŷᵢ)²
GD Update β₀	β₀ = β₀ - α × ∂J/∂β₀
GD Update β₁	β₁ = β₁ - α × ∂J/∂β₁
R²	1 - (SS_res / SS_tot)
Normal Equation	β = (XᵀX)⁻¹ XᵀY
Ridge	J = MSE + λΣβᵢ²
Lasso	J = MSE + λΣ\|βᵢ\|

🧠 दिमागमा बस्ने 10 मुख्य कुरा

Linear Regression = सीधा रेखा भेट्ने काम (best fit line)।
β₀ = कहाँबाट सुरु (Y-intercept)। β₁ = कति भिरालो (slope)।
Error नाप्न MSE सबैभन्दा popular - square ले consistency reward गर्छ।
OLS = direct formula बाट answer। सानो data मा perfect।
Gradient Descent = iteration गरेर बिस्तारै minimum तिर जाने।
Alpha (α) ठीक हुनुपर्छ - सानो भए ढिलो, ठूलो भए diverge।
Default α = 0.01 बाट सुरु गर्ने।
Multiple Regression = धेरै features एकैचोटि use।
R² ले देखाउँछ model कति राम्रो छ। 1 को नजिक राम्रो।
Overfitting भयो भने Ridge / Lasso / Elastic Net use गर्ने।

अब practice गर्ने पालो:

→ Python (scikit-learn) मा code गरेर हेर्ने

→ आफ्नै dataset मा try गर्ने

→ Kaggle competition मा भाग लिने

Linear Regression Jupyter notebook

📓

Jupyter Notebook

⬇ Download .ipynb

Implementation of Linear Regression¶

Step1 Generate the Data¶

In [2]:

from sklearn.datasets import make_regression
x,y = make_regression(n_features=1, noise=5, n_samples=10000,random_state=1)
import matplotlib.pyplot as plt
plt.scatter(x,y)
plt.xlabel('x')
plt.ylabel('y')
plt.show()

No description has been provided for this image

Step 2 Split the Data into Training and Testing Sets¶

In [3]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=0)

In [15]:

y_test

Out[15]:

array([ 102.1991915 ,   75.21963318, -137.04699982, ...,   16.7730244 ,
          2.1072226 ,    4.02596433], shape=(2000,))

Step 3 Train the model¶

In [4]:

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

Out[4]:

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

← Back to ML

	fit_intercept	True
	copy_X	True
	tol	1e-06
	n_jobs	None
	positive	False