Hi, my name is

Thomas,

I build data pipelines

A Data Engineer passionate about Artificial Intelligence and Machine Learning Operations. I love building scalable data warehouses, extentsible data models, insightful dashboards, and orchrestrating Machine Learning pipelines.

About Me

I am a seasoned engineer with over 6 years of experience in the Design, and Deployment of Analytical Solutions and Machine Learning Models.

I had the privilege to work with Forbes 33, Fortune 500 organizations and contribute towards Amazon and Google Cloud Partners across the U.S, Middle East, and India

Here are a few technologies I've been working with recently:
  • Python
  • SQL
  • Scala
  • Bash Shell Scripting
  • MLOps
  • Apache Spark/PySpark
  • Extract Transform Load (ETL)
  • Airflow
  • Snowflake
  • Amazon Web Services (AWS)
  • Google Cloud Platform (GCP)
  • Streamlit
  • Vertex AI
  • dbt
  • Tableau
  • Looker
  • Redshift
  • BigQuery
  • Microsoft SQL Server
  • PostgreSQL
  • MariaDB
  • MySQL
  • DynamoDB
  • MongoDB
  • Neo4j
  • Hadoop
  • Kafka
  • Unix
  • DevOps
  • Docker
  • CI/CD
  • Agile
  • Maven
  • GitOps
  • GitHub Actions
  • Atlassian Bamboo
  • Bitbucket
  • Jira
  • Jupyter Notebooks
  • MLFlow
  • Flask
  • API
  • DVC
  • A/B Testing
  • Big Data
  • Data Warehousing
  • Data Modeling
  • Statistics

Experience

Data Engineer/Data Analyst - Northeastern University
Feb 2023 - Sep 2023
  • Engineered high-availability data streaming analytics platform, ensuring a 98% real-time uptime for 20 smart homes using Python
  • Built an energy dashboard visualizing Key Performance Indicators (KPIs) and metrics using SQL, MariaDB, and Jupyter Notebooks
  • Applied SQL query optimization and software engineering best practices, leading to a 60% reduction in SQL query response times
  • Performed in-depth Exploratory Data Analysis (EDA) to unveil patterns and trends to support data-driven decision-making processes
Data Engineer - MontAI
Jul 2022 - Dec 2022
  • Established collected, cleaned, and aggregated health, food, drug, biotech, and bioinformatics Data Lake on AWS of 100 TB of data
  • Developed ETL (Extract, Transform, Load) pipelines using AWS Glue, Redshift, S3, PySpark, Lambda, and SQS to process 100 TB
  • Implemented Test Automation by test-driven development (TDD) GitHub actions workflows, improving existing code quality by 100%
  • Performed data integration of 5 GB/day from disparate data sources, APIs, and file formats (XML, CSV, JSON, Parquet, Avro, ORC)
  • Translated business objectives and Key Results (OKRs) into technical specifications, design documentation, and optimal solutions
  • Engineered Python-based web scrapers to collect, integrate, and parse 10 GB from APIs, boosting data collection efficiency.
Senior Big Data Engineer - Legato
Jun 2018 - Aug 2021
  • Constructed and led scalable ETL/ELT data pipelines for 5 US healthcare initiatives across Anthem and Blue Cross Blue Shield
  • Optimized large-scale data transformations in AWS Cloud using EMR, EC2, Athena, CloudWatch, Step Functions, cutting costs by 30%
  • Automated data quality framework in Spark Scala for Hive and SQL Server, cutting errors, resulting in $7000 quarterly cost savings
  • Built extensible data architecture with 10 Snowflake data marts and data models, improving Business Intelligence (BI) metrics by 20%
  • Mentored and led a team of 4 new graduates in a training program focused on Big Data and DevOps tools (Jira, Bitbucket, Bamboo)
  • Orchestrated code migration, continuous integration, and continuous deployment (CI/CD), reducing deployment time by 25%
  • Unified Agile Scrum with unit testing, system integration testing, and code reviews, slashing post-deployment defects by 18%
  • Enabled the migration of 112 TB of data from on-premises distributed computing Hadoop clusters to AWS Cloud and Snowflake
  • Performed massive parallel processing (MPP) and large-scale data transformations on cloud and on-premises data warehouses
  • Built large-scale applications and data structures for data ingestion from Neo4j, MongoDB (Graph, NoSQL OLTP) databases in Airflow
  • Integrated 2 ML pipelines with Continuous Improvement/Learning and automated deployments using Apache Airflow and Docker
Software Engineer - Hadoop Developer & Big Data Engineer - Middle East Management Consultancy
Jun 2016 - May 2018
  • Created high-volume robust data infrastructure on distributed computing on-premises Hadoop clusters capable of handling 15 TB
  • Designed interactive Tableau Dashboards to derive meaningful insights, leading to a 12% increase in pharmaceutical finance sales
  • Developed 10 end-to-end high-volume, multi-layer data processing pipelines from ingestion layers to the serving and reporting layers
  • Ingested 26 TB from Relational databases (MySQL, Oracle, PostgreSQL) RDBMS via Sqoop and Shell scripts, enhancing data access
  • Improved enterprise data warehouse (EDW) scalability using facts and dimensions data modeling (Star Schema, SCD) by 20%
  • Established DataOps for version control with git, data cleaning, data management, governance, and data lineage tracking for datasets
  • Maintained integrity and compliance with clean data sets for 4 projects, contributing to a 15% improvement in accuracy and reliability
  • Collaborated with cross-functional teams, managed stakeholders, and gathered requirements for streamlined project communication
  • Implemented join optimization, data partitioning, and performance tuning, achieving a 40% improvement in Apache Spark jobs
  • Redesigned Data Lake to use Parquet and Snappy compression to cut 30% storage and compute costs
  • Performed Partitioning, Bucketing, Join optimizations, and Compression in Hive
Practice School Student/Researcher - Manipal Institute of Technology
Jan 2016 - May 2016
  • Developed a Central Data Repository for MIT, Manipal.
  • It is a web application with its main objectives to serve as a means of data entry, to collect the required data, to analyze the given data, and finally to generate reports dynamically according to the custom report format requirements of the user.
  • The data was loaded from the databases using Sqoop and analyzed using a Hadoop cluster.
  • The reports are generated after querying using Hive and displayed in the web application.
Software Development Intern - CGI
May 2015 - Jul 2015
Developed a Project Management web application that enabled the interaction between different users of different departments and their respective projects while accessing their functions on a large scale.

Education

2021 - 2023
Master of Science in Data Analytics Engineering
Northeastern University, Boston
GPA: 3.9 out of 4.0
2012 - 2016
Bachelor of Technology in Computer Science and Engineering
Manipal Institute of Technology, Manipal

Projects

E-commerce Customer Segmentation
Machine Learning MLOps Python
E-commerce Customer Segmentation
Developed and Deployed real-time Clustering machine learning models and algorithms for E-commerce customers on GCP using Python, Vertex AI, GCS, Flask, Airflow, Docker, MLflow and TensorFlow
YouTube Dashboard
Streamlit Python API
YouTube Dashboard
Real-Time analytics dashboard generated on any input YouTube video with a Demo. Pictures sentiment analysis and 5 KPIs that can be used to drive up ad-revenue.
Age of Plastic
Tableau SQL Data Visualization
Age of Plastic
Data-driven dashboards showing the impact of global plastic pollution on the environment; Land, Ocean and the mitigation steps taken by different countries using Tableau.
Appliance Energy Prediction
Machine Learning Python
Appliance Energy Prediction
Predicted the energy consumed by appliances using custom-coded Machine Learning models and Algorithms like PCA, Neural Networks, Lasso, Ridge, and Linear Regression from scratch in Python with 80% confidence.
Clustering Paris and London
Machine Learning Python ArcGIS API
Clustering Paris and London
Visualizing Geo spatial analysis to cluster similar neighborhoods using ArcGIS and Folium to reveal new insights using Machine Learning
FBI Crime Reports
Machine Learning Tableau Data Visualization
FBI Crime Reports
Forecasted and visualized FBI reported uniform major crimes with focus on Aggravated Assault, Homicides, Intimidation and Motor theft in every US state visualized on a Tableau dashboard and Machine Learning with 90% Accuracy
Kafka Tweet Stream
API Kafka Elasticsearch
Kafka Tweet Stream
Streamed & ingested real-time Tweets of current affairs between high-performance tuned Kafka 2.0.0 producer & consumer into Elasticsearch using safe, idempotent, and compression configurations
Movie Analytics
Data Engineering SQL Scala
Movie Analytics
Analyzed a million movies using Spark Scala to draw useful insights on viewer engagement
Diabetic Readmission Exploration
Data Engineering Data Analysis Python
Diabetic Readmission Exploration
Exploring and drawing meaningful insights for patients readmitted with Diabetes with a Report
Analysis And Visualizations Of Nursing Home Data
Data Visualization Data Analysis Tableau
Analysis And Visualizations Of Nursing Home Data
Computed and visualized a data driven story of the Center for Medicare & Medicaid Services (CMS) nursing facility data to generate visuals that highlight the nursing home’s resource limits using Flourish, Data Wrapper and Tableau hosted on Google sites.
Investigating GDP Expenditure
Data Visualization Data Analysis
Investigating GDP Expenditure
Visualized the expenditure trends in various sectors like Education, Pharmaceuticals, Military, Infrastructure, Research and Development by different countries for the years 1960 - 2020 using Flourish, Data wrapper hosted on Google Sites
Forecasting Healthcare costs
Machine Learning Regression Python
Forecasting Healthcare costs
Predicted the cost of healthcare and insurance using Python and Linear Regression Machine Learning model with 80% accuracy
Predicting hits on Spotify
Machine Learning Classification Python
Predicting hits on Spotify
Analyzed over 40000 songs of 6 different decades to predict hit songs on Spotify using various classification Machine learning models and Python.
Olympic History Analytics
Data Analysis Data Visualization R
Olympic History Analytics
Discovering & visualizing various trends in 120 years of Olympic history using R
Social Media Analysis
Data Analysis Data Visualization R
Social Media Analysis
Taking a look at data of 1.6 million twitter users and drawing useful insights while exploring interesting patterns. The techniques used include text mining, sentimental analysis, probability, time series analysis and Hierarchical clustering on text/words using R

Get in Touch

My inbox is always open. Whether you have a question or just want to say hi, I’ll try my best to get back to you!