- Engineered high-availability data streaming analytics platform, ensuring a 98% real-time uptime for 20 smart homes using Python
- Built an energy dashboard visualizing Key Performance Indicators (KPIs) and metrics using SQL, MariaDB, and Jupyter Notebooks
- Applied SQL query optimization and software engineering best practices, leading to a 60% reduction in SQL query response times
- Performed in-depth Exploratory Data Analysis (EDA) to unveil patterns and trends to support data-driven decision-making processes
Data Engineer | Data Platform Engineer
-
MontAIJul 2022 - Dec 2022
- Established collected, cleaned, and aggregated health, food, drug, biotech, and bioinformatics Data Lake on AWS of 100 TB of data
- Developed ETL (Extract, Transform, Load) pipelines using AWS Glue, Redshift, S3, PySpark, Lambda, and SQS to process 100 TB
- Implemented Test Automation by test-driven development (TDD) GitHub actions workflows, improving existing code quality by 100%
- Performed data integration of 5 GB/day from disparate data sources, APIs, and file formats (XML, CSV, JSON, Parquet, Avro, ORC)
- Translated business objectives and Key Results (OKRs) into technical specifications, design documentation, and optimal solutions
- Engineered Python-based web scrapers to collect, integrate, and parse 10 GB from APIs, boosting data collection efficiency.
Senior Big Data Engineer
-
LegatoJun 2018 - Aug 2021
- Constructed and led scalable ETL/ELT data pipelines for 5 US healthcare initiatives across Anthem and Blue Cross Blue Shield
- Optimized large-scale data transformations in AWS Cloud using EMR, EC2, Athena, CloudWatch, Step Functions, cutting costs by 30%
- Automated data quality framework in Spark Scala for Hive and SQL Server, cutting errors, resulting in $7000 quarterly cost savings
- Built extensible data architecture with 10 Snowflake data marts and data models, improving Business Intelligence (BI) metrics by 20%
- Mentored and led a team of 4 new graduates in a training program focused on Big Data and DevOps tools (Jira, Bitbucket, Bamboo)
- Orchestrated code migration, continuous integration, and continuous deployment (CI/CD), reducing deployment time by 25%
- Unified Agile Scrum with unit testing, system integration testing, and code reviews, slashing post-deployment defects by 18%
- Enabled the migration of 112 TB of data from on-premises distributed computing Hadoop clusters to AWS Cloud and Snowflake
- Performed massive parallel processing (MPP) and large-scale data transformations on cloud and on-premises data warehouses
- Built large-scale applications and data structures for data ingestion from Neo4j, MongoDB (Graph, NoSQL OLTP) databases in Airflow
- Integrated 2 ML pipelines with Continuous Improvement/Learning and automated deployments using Apache Airflow and Docker
- Created high-volume robust data infrastructure on distributed computing on-premises Hadoop clusters capable of handling 15 TB
- Designed interactive Tableau Dashboards to derive meaningful insights, leading to a 12% increase in pharmaceutical finance sales
- Developed 10 end-to-end high-volume, multi-layer data processing pipelines from ingestion layers to the serving and reporting layers
- Ingested 26 TB from Relational databases (MySQL, Oracle, PostgreSQL) RDBMS via Sqoop and Shell scripts, enhancing data access
- Improved enterprise data warehouse (EDW) scalability using facts and dimensions data modeling (Star Schema, SCD) by 20%
- Established DataOps for version control with git, data cleaning, data management, governance, and data lineage tracking for datasets
- Maintained integrity and compliance with clean data sets for 4 projects, contributing to a 15% improvement in accuracy and reliability
- Collaborated with cross-functional teams, managed stakeholders, and gathered requirements for streamlined project communication
- Implemented join optimization, data partitioning, and performance tuning, achieving a 40% improvement in Apache Spark jobs
- Redesigned Data Lake to use Parquet and Snappy compression to cut 30% storage and compute costs
- Performed Partitioning, Bucketing, Join optimizations, and Compression in Hive
- Developed a Central Data Repository for MIT, Manipal.
- It is a web application with its main objectives to serve as a means of data entry, to collect the required data, to analyze the given data, and finally to
generate reports dynamically according to the custom report format requirements of the user.
- The data was loaded from the databases using Sqoop and analyzed using a Hadoop cluster.
- The reports are generated after querying using Hive and displayed in the web application.
Software Development Intern
-
CGIMay 2015 - Jul 2015
Developed a Project Management web application that enabled the interaction between different users of different departments and their respective projects while accessing their functions on a large scale.