← Back to AWS practitioner Certification
🌐
AWS practitioner Certification
AWS Glue भनेको के हो? 🤔
📅 Apr 13, 2026
💡 एक लाइनमा: अलग-अलग ठाउँको फोहोर डेटा उठाउने, सफा गर्ने, र काम लाग्ने ठाउँमा राखिदिने AWS को automatic कामदार।
🤔 Section 1 — AWS Glue भनेको के हो?
AWS Glue एउटा fully managed, serverless ETL service हो। विभिन्न data sources बाट डेटा लिएर, process गरेर, analytics को लागि तयार बनाउँछ।
- ☁️ Serverless — सर्भर manage गर्नु पर्दैन, AWS ले auto-scale गर्छ
- 🔄 ETL — Extract → Transform → Load
- 🛠️ Managed — AWS ले सबै infrastructure manage गर्छ
- ⚡ PySpark based — Apache Spark engine use गर्छ
- 🎨 Visual Editor — drag-drop गरेर ETL बनाउन मिल्छ
| 📥 Extract डेटा निकाल्ने |
→ | ⚙️ Transform डेटा बदल्ने |
→ | 📤 Load डेटा राख्ने |
📚 Section 2 — Data Catalog
Data Catalog एउटा library catalog जस्तै हो — actual data store गर्दैन, सिर्फ metadata राख्छ।
| जानकारी | Example |
|---|---|
| 📍 डेटा कहाँ छ? | s3://my-bucket/sales/ |
| 📁 Format के हो? | Parquet / CSV / JSON |
| 🗂️ Columns के-के छन्? | id: int, amount: double, date: string |
| 📅 कहिले update भयो? | 2024-01-15 09:30 AM |
| 🔀 Partitions छन्? | year=2024, month=01 |
💡 Catalog एकचोटि बनाउनुस् — Athena, Redshift Spectrum, EMR सबैले आफैं use गर्छन्!
🕷️ Section 3 — Crawler
Crawler एउटा robot जस्तो हो — data source scan गर्छ र automatically Data Catalog मा table बनाउँछ।
| 📦 Data Source S3 / RDS / JDBC |
→ | 🕷️ Crawler Scan + Detect |
→ | 📚 Data Catalog Table Created ✓ |
| Setting | Description |
|---|---|
| 📌 Data Source | S3 / JDBC / DynamoDB |
| 🔑 IAM Role | Read permission चाहिन्छ |
| ⏰ Schedule | On-demand वा Cron expression |
| 📤 Output Database | Catalog मा कुन database मा राख्ने |
# Crawler output example
s3://bucket/sales/2024/01/ scan गर्यो
→ Format: CSV
→ Columns: id(int), name(str), amount(float)
→ Partitions: year=2024, month=01
✓ 'sales_table' successfully created!
s3://bucket/sales/2024/01/ scan गर्यो
→ Format: CSV
→ Columns: id(int), name(str), amount(float)
→ Partitions: year=2024, month=01
✓ 'sales_table' successfully created!
⚙️ Section 4 — ETL Job
ETL Job = Extract → Transform → Load। PySpark वा Python script ले डेटा clean, filter, join, aggregate गर्छ।
| Operation | के गर्छ? | Example |
|---|---|---|
| 🔍 Filter | Rows हटाउने | amount > 0 |
| 🔗 Join | Tables जोड्ने | sales + users |
| ✏️ Rename | Column नाम बदल्ने | amt → amount |
| 🔄 Cast | Data type बदल्ने | string → integer |
| 📊 Aggregate | Group गरेर जोड्ने | SUM(sales) by region |
# Catalog बाट data load
df = glueContext.create_dynamic_frame.from_catalog(
database="my_db", table_name="sales")
# Null rows हटाउने
df_clean = df.filter(col("amount") > 0)
# Redshift मा load
glueContext.write_dynamic_frame.from_options(frame=df_clean, ...)
df = glueContext.create_dynamic_frame.from_catalog(
database="my_db", table_name="sales")
# Null rows हटाउने
df_clean = df.filter(col("amount") > 0)
# Redshift मा load
glueContext.write_dynamic_frame.from_options(frame=df_clean, ...)
💡 DynamicFrame = Schema mismatch भए पनि crash हुँदैन!
🔔 Section 5 — Triggers र Workflows
| Type | Description | Use Case |
|---|---|---|
| ⏰ Scheduled | Cron expression | हरेक रात २ बजे |
| ⚡ On-Demand | Manual trigger | API / Console बाट |
| 🔗 Conditional | Job A → Job B chain | Job chaining |
| 📡 EventBridge | Event आउँदा auto | S3 मा file आउँदा |
🔄 E-commerce Daily Pipeline
| ⏰ रात २ बजे |
→ | 🕷️ Crawler |
→ | ⚙️ ETL Job 1 |
→ | ⚙️ ETL Job 2 |
→ | ✅ Done! |
🗺️ Section 6 — Complete Architecture
| ① Data Sources | |||
| 📦 S3 🗄️ RDS ⚡ DynamoDB 🔌 JDBC | |||
| ② Crawler → Data Catalog ⬇ | |||
|
|||
| ③ ETL Job transform ⬇ | |||
|
📥 Extract → ⚙️ Transform (PySpark) → 📤 Load
|
|||
| ④ Output Targets ⬇ | |||
| 📊 Redshift 📦 S3 (Parquet) 🔍 Athena |
🧠 याद गर्ने Shortcut
🕷️ Crawler → 📚 Catalog → ⚙️ Job → 🎯 Target
🕷️ Crawler → 📚 Catalog → ⚙️ Job → 🎯 Target
💰 Section 7 — Pricing
| Component | Rate | Minimum |
|---|---|---|
| ⚙️ ETL Job | ~$0.44 / DPU-hour | 10 मिनेट |
| 🕷️ Crawler | ~$1.00 / DPU-hour | 10 मिनेट |
| 📚 Data Catalog | पहिलो १० लाख objects | FREE ✓ |
| 🎨 Glue Studio | Visual editor | FREE ✓ |
📦 DPU = 4 vCPU + 16 GB RAM | 💡 Parquet use गर्नुस् • Job Bookmark enable गर्नुस्
🔑 Section 8 — Key Concepts
| Concept | के हो? |
|---|---|
| DynamicFrame | Schema mismatch handle गर्छ — crash हुँदैन |
| DPU | 4 vCPU + 16 GB RAM — billing unit |
| Job Bookmark | Duplicate processing avoid गर्छ |
| Connection | JDBC, S3, Kafka को connection config |
| Dev Endpoint | Interactive notebook मा job debug गर्न |
⚖️ Section 9 — Glue vs EMR
| Feature | ☁️ AWS Glue | 🖥️ EMR |
|---|---|---|
| Server | ❌ Serverless | ✅ Cluster manage |
| Setup | Minutes | Hours |
| Cost | Pay-as-you-go | Per hour (always on) |
| Best For | ETL pipelines | Complex big data |
| Visual Tool | ✅ Glue Studio | ❌ Manual |
🎓 Section 10 — Interview Q&A
| ❓ Glue र EMR मा के फरक छ? ✅ Glue serverless हो — server manage गर्नु पर्दैन। EMR मा cluster manage गर्नु पर्छ। |
| ❓ Crawler कहिले run गर्ने? ✅ नयाँ data आउँदा, schema बदलिँदा, वा नयाँ partition थपिँदा। |
| ❓ Parquet किन use गर्ने? ✅ Columnar format — faster query, कम storage cost। |
| ❓ DynamicFrame र DataFrame मा के फरक? ✅ DynamicFrame ले schema mismatch handle गर्छ — crash हुँदैन। |
| ❓ Job Bookmark के हो? ✅ कुन data process भइसक्यो track गर्छ — duplicate हुँदैन। |