Data Engineer with 2+ years of experience building scalable ETL/ELT pipelines using Python, SQL, and Spark across AWS and on-prem environments. Solid grounding in computer science fundamentals, including data structures and algorithms, and distributed systems, applied to batch processing, data quality, and fault-tolerant pipeline design. Experience working on large-scale Spark ETL systems for FinTech and Automotive domains, with a focus on performance optimization and reliability.
Nomura Capital: Engineered and maintained large-scale Spark ETL pipelines using Spark SQL and complex analytical SQL queries for capital markets datasets (trades, reference, risk, and valuation feeds), involving multi-way joins and window functions to support downstream reporting and risk analytics.
Analyzed Spark execution metrics to improve execution time by 30% through broadcast joins, partition pruning, predicate pushdown, and optimized star schema joins between fact and dimension tables.
Orchestrated Spark workflows using AutoSys, managing dependency chains, reruns, and recovery logic, improving batch completion reliability and reducing manual intervention by 25%.
Executed migration of Spark workloads from YARN to Kubernetes, transitioning storage from HDFS to MinIO (S3-compatible), resolving Spark connector and storage compatibility issues.
Nissan: Architected daily batch pipelines using AWS Lambda and Step Functions, while developing a Streamlit interface for ad-hoc file ingestion and idempotent re-runs for business users.
Implemented schema validation, data quality checks, and incremental batch processing with idempotent re-runs, preparing curated datasets for Snowflake Data Warehouse ingestion and reducing manual effort by 60%.
Improved pipeline reliability via retries, implementing comprehensive monitoring and alerting using CloudWatch with automated notifications for pipeline failures, data freshness, and quality issues, reducing MTTR by 30%.
Implemented an ensemble-based attrition prediction model, achieving 86% accuracy on validation data.
Designed a lightweight FastAPI service to expose the model for real-time inference and testing.