Portfolio of Thomas George Thomas

Hi, my name is

Thomas.

I provide enterprise data solutions

A Data Engineer passionate about Artificial Intelligence and Machine Learning. I love building scalable data warehouses, extentsible data models, insightful dashboards, and orchrestrating Machine Learning pipelines.

About Me

I am a seasoned engineer with over 8 years of experience in the Design, and Deployment of Data Engineering, Analytical Solutions and Machine Learning Models across the U.S, Middle East, and India Here are a few technologies I've been working with recently:

Python
SQL
Amazon Web Services (AWS)
dbt
Dagster
MLOps
Scala
Bash Shell Scripting
Apache Spark/PySpark
Extract Transform Load (ETL)
Airflow
Snowflake
Streamlit
Tableau
Redshift
Microsoft SQL Server
PostgreSQL
MariaDB
MySQL
DynamoDB
MongoDB
Neo4j
Hadoop
Kafka
Unix
DevOps
Docker
CI/CD
Agile
Maven
GitOps
GitHub Actions
Atlassian Bamboo
Bitbucket
Jira
Jupyter Notebooks
MLFlow
Flask
API
DVC
Big Data
Data Warehousing
Data Modeling
Statistics

Experience

Northeastern University
MontAI
Legato
Middle East Management Consultancy

Data Engineer | Data Analyst - Northeastern University

Feb 2023 - Sep 2023

Engineered high-availability data streaming analytics platform, ensuring a 98% real-time uptime for 20 smart homes using Python
Built an energy dashboard visualizing Key Performance Indicators (KPIs) and metrics using SQL, MariaDB, and Jupyter Notebooks
Applied SQL query optimization and software engineering best practices, leading to a 60% reduction in SQL query response times
Performed in-depth Exploratory Data Analysis (EDA) to unveil patterns and trends to support data-driven decision-making processes

Data Engineer | Data Platform Engineer - MontAI

Jul 2022 - Dec 2022

I am on the Banner

Established collected, cleaned, and aggregated health, food, drug, biotech, and bioinformatics Data Lake on AWS of 100 TB of data
Developed ETL (Extract, Transform, Load) pipelines using AWS Glue, Redshift, S3, PySpark, Lambda, and SQS to process 100 TB
Implemented Test Automation by test-driven development (TDD) GitHub actions workflows, improving existing code quality by 100%
Performed data integration of 5 GB/day from disparate data sources, APIs, and file formats (XML, CSV, JSON, Parquet, Avro, ORC)
Translated business objectives and Key Results (OKRs) into technical specifications, design documentation, and optimal solutions
Engineered Python-based web scrapers to collect, integrate, and parse 10 GB from APIs, boosting data collection efficiency.

Senior Big Data Engineer - Legato

Jun 2018 - Aug 2021

Constructed and led scalable ETL/ELT data pipelines for 5 US healthcare initiatives across Anthem and Blue Cross Blue Shield
Optimized large-scale data transformations in AWS Cloud using EMR, EC2, Athena, CloudWatch, Step Functions, cutting costs by 30%
Automated data quality framework in Spark Scala for Hive and SQL Server, cutting errors, resulting in $7000 quarterly cost savings
Built extensible data architecture with 10 Snowflake data marts and data models, improving Business Intelligence (BI) metrics by 20%
Mentored and led a team of 4 new graduates in a training program focused on Big Data and DevOps tools (Jira, Bitbucket, Bamboo)
Orchestrated code migration, continuous integration, and continuous deployment (CI/CD), reducing deployment time by 25%
Unified Agile Scrum with unit testing, system integration testing, and code reviews, slashing post-deployment defects by 18%
Enabled the migration of 112 TB of data from on-premises distributed computing Hadoop clusters to AWS Cloud and Snowflake
Performed massive parallel processing (MPP) and large-scale data transformations on cloud and on-premises data warehouses
Built large-scale applications and data structures for data ingestion from Neo4j, MongoDB (Graph, NoSQL OLTP) databases in Airflow
Integrated 2 ML pipelines with Continuous Improvement/Learning and automated deployments using Apache Airflow and Docker

Software Engineer - Hadoop Developer & Big Data Engineer - Middle East Management Consultancy

Jun 2016 - May 2018

Created high-volume robust data infrastructure on distributed computing on-premises Hadoop clusters capable of handling 15 TB
Designed interactive Tableau Dashboards to derive meaningful insights, leading to a 12% increase in pharmaceutical finance sales
Developed 10 end-to-end high-volume, multi-layer data processing pipelines from ingestion layers to the serving and reporting layers
Ingested 26 TB from Relational databases (MySQL, Oracle, PostgreSQL) RDBMS via Sqoop and Shell scripts, enhancing data access
Improved enterprise data warehouse (EDW) scalability using facts and dimensions data modeling (Star Schema, SCD) by 20%
Established DataOps for version control with git, data cleaning, data management, governance, and data lineage tracking for datasets
Maintained integrity and compliance with clean data sets for 4 projects, contributing to a 15% improvement in accuracy and reliability
Collaborated with cross-functional teams, managed stakeholders, and gathered requirements for streamlined project communication
Implemented join optimization, data partitioning, and performance tuning, achieving a 40% improvement in Apache Spark jobs
Redesigned Data Lake to use Parquet and Snappy compression to cut 30% storage and compute costs
Performed Partitioning, Bucketing, Join optimizations, and Compression in Hive

Education

2021 - 2023

Master of Science in Data Analytics Engineering

Northeastern University, Boston

GPA: 3.9 out of 4.0

2012 - 2016

Bachelor of Technology in Computer Science and Engineering

Manipal Institute of Technology, Manipal

Projects

Machine Learning MLOps Python

E-commerce Customer Segmentation

Developed and Deployed real-time Clustering machine learning models and algorithms for E-commerce customers on GCP using Python, Vertex AI, GCS, Flask, Airflow, Docker, MLflow and TensorFlow

Streamlit Python API

YouTube Dashboard

Real-Time analytics dashboard generated on any input YouTube video with a Demo. Pictures sentiment analysis and 5 KPIs that can be used to drive up ad-revenue.

Demo

Tableau SQL Data Visualization

Age of Plastic

Data-driven dashboards showing the impact of global plastic pollution on the environment; Land, Ocean and the mitigation steps taken by different countries using Tableau.

Tableau

Machine Learning Python

Appliance Energy Prediction

Predicted the energy consumed by appliances using custom-coded Machine Learning models and Algorithms like PCA, Neural Networks, Lasso, Ridge, and Linear Regression from scratch in Python with 80% confidence.

Report

Machine Learning Python ArcGIS API

Clustering Paris and London

Visualizing Geo spatial analysis to cluster similar neighborhoods using ArcGIS and Folium to reveal new insights using Machine Learning

Machine Learning Tableau Data Visualization

FBI Crime Reports

Forecasted and visualized FBI reported uniform major crimes with focus on Aggravated Assault, Homicides, Intimidation and Motor theft in every US state visualized on a Tableau dashboard and Machine Learning with 90% Accuracy

Tableau

API Kafka Elasticsearch

Kafka Tweet Stream

Streamed & ingested real-time Tweets of current affairs between high-performance tuned Kafka 2.0.0 producer & consumer into Elasticsearch using safe, idempotent, and compression configurations

Data Engineering SQL Scala

Movie Analytics

Analyzed a million movies using Spark Scala to draw useful insights on viewer engagement

Data Engineering Data Analysis Python

Diabetic Readmission Exploration

Exploring and drawing meaningful insights for patients readmitted with Diabetes with a Report

Data Visualization Data Analysis Tableau

Analysis And Visualizations Of Nursing Home Data

Computed and visualized a data driven story of the Center for Medicare & Medicaid Services (CMS) nursing facility data to generate visuals that highlight the nursing home’s resource limits using Flourish, Data Wrapper and Tableau hosted on Google sites.

Demo

Data Visualization Data Analysis

Investigating GDP Expenditure

Visualized the expenditure trends in various sectors like Education, Pharmaceuticals, Military, Infrastructure, Research and Development by different countries for the years 1960 - 2020 using Flourish, Data wrapper hosted on Google Sites

Demo

Machine Learning Regression Python

Forecasting Healthcare costs

Predicted the cost of healthcare and insurance using Python and Linear Regression Machine Learning model with 80% accuracy

Machine Learning Classification Python

Predicting hits on Spotify

Analyzed over 40000 songs of 6 different decades to predict hit songs on Spotify using various classification Machine learning models and Python.

Data Analysis Data Visualization R

Olympic History Analytics

Discovering & visualizing various trends in 120 years of Olympic history using R

Data Analysis Data Visualization R

Social Media Analysis

Taking a look at data of 1.6 million twitter users and drawing useful insights while exploring interesting patterns. The techniques used include text mining, sentimental analysis, probability, time series analysis and Hierarchical clustering on text/words using R

Achievements

Annual Team Innovation

Annual Innovation award for the enhancements and innovation brought about from 2019 to 2020 at Legato Health Technologies, Mar 2020

Iron Man of Technology

Award for being a stand-out performer and contributor under the Enterprise Data Analytics Tower at Legato Health Technologies, Jan 2020

Thomas.

I provide enterprise data solutions

About Me

Experience

Education

Master of Science in Data Analytics Engineering

Northeastern University, Boston

Bachelor of Technology in Computer Science and Engineering

Manipal Institute of Technology, Manipal

Projects

E-commerce Customer Segmentation

YouTube Dashboard

Age of Plastic

Appliance Energy Prediction

Clustering Paris and London

FBI Crime Reports

Kafka Tweet Stream

Movie Analytics

Diabetic Readmission Exploration

Analysis And Visualizations Of Nursing Home Data

Investigating GDP Expenditure

Forecasting Healthcare costs

Predicting hits on Spotify

Olympic History Analytics

Social Media Analysis

Achievements

Anthem IMPACT Go Above

IBM Data Science Professional

Arctic Code Vault Contributor

Annual Team Innovation

Iron Man of Technology

E-Commerce Customer Segmentation

Predicting treatment costs

File Processing

Performance Tuning Apache Sqoop

A Tale of Two Cities