I am Thomas George Thomas

A Data Engineer passionate about Data Science ๐Ÿ“Š. I like automating things, building pipelines, exploring scalability problems, improving efficiency and performance tuning. I’m a strong advocate for ๐Ÿ“œ open source, โ˜๏ธ Cloud computing, ๐Ÿš€ DevOps, ๐Ÿ†• Innovation and Automation ๐Ÿค–

Contact me Know more








Apache Spark













Data Analyst

Northeastern University

Feb 2023 โ€“ Present Massachusetts, United States of America


  • Building Dashboard for 20 Smart Homes across Colorado and Massachusetts showcasing uptime analysis, energy prediction, and KPIs on Plotly Dash and MariaDB

Technologies : Python, Plotly Dash, MariaDB, MySQL


Data Engineer

Montai Health

Jul 2022 โ€“ Dec 2022 Massachusetts, United States of America


  • Built Extract, Transform and Load (ETL) pipelines on AWS using Redshift, SQS, Lambda, Batch, EMR, EC2, PySpark, Athena, Glue and Boto3 to transform up to 100 TBs of data
  • Developed health, drug, biochemical and bioinformatic Data Lake from multiple Relational (SQL) and NoSQL-based databases using SQL and AWS worth 100 TB
  • Created multithreaded web scrapers to crawl data from a variety of sources and file formats including CSVs, XMLs, Parquet, APIs, and FTP servers using Python to collect 5 GB data daily
  • Enabled test driven development with test automation (TDD) using pytest, pylint, and coverage metrics to increase code quality by 100%
  • Implemented Continuous Integration and Continuous Deployment (CI/CD) pipelines using GitHub actions to ship efficiently by 90%
  • Implemented Python packages like Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn, Folium, Requests, Beautiful Soup, ElementTree, LXML, and Multiprocess

Technologies : Python, API, AWS: S3, EMR, Athena, Glue, Redshift, Lambda, Batch, PySpark, Shell Scripting, Git, GitHub, Packages: Pandas, Requests, BeautifulSoup, Multiprocess, Pytest


Senior Data Engineer

Legato Health Technologies, Elevance Health

Jun 2018 โ€“ Aug 2021 Bangalore, India


  • Built data pipelines for 5 initiatives including providing Clinical Investigative Insights in AWS, Hadoop and Apache Spark
  • Migrated 112 TB of data from the on-premises Hadoop cluster to AWS and Snowflake
  • Innovated and automated post-migration validation reports in Spark Scala which lead to $7000 quarterly savings
  • Redesigned and refactored project architecture and Spark Scala ETL code bringing down costs by 90%
  • Implemented continuous integration and continuous deployment (CI/CD) pipelines for 4 projects using DevOps
  • Chaired release management and code migration for production/pre-production environments for 5 projects
  • Built Spark Scala and PySpark applications using Spark RDD, Data frame/Dataset, and Spark-SQL APIs and performance tuning Spark Jobs.
  • Extensive experience building scalable Extract, Transform, Load (ETL) data pipelines and performing large-scale data transformations on cloud and on-premises infrastructure.

Technologies: Hadoop, Spark, Scala, Snowflake, AWS: RDS, S3, EMR, Athena, Hive, Impala, Unix, Shell scripting, Control M, Bamboo, Git, Bitbucket, Maven, Eclipse, Cloudera distribution


Software Engineer - Hadoop Developer & Big Data Engineer

Middle East Management Consultancy and Marketing

Jun 2016 โ€“ May 2018 Muscat, Oman


  • Shipped and delivered analytics dashboard which led to an increase in pharmaceutical sales by 12% annually
  • Developed pipelines to handle data of 1.5 TB/day from ingestion to reporting layer using Shell script, Hadoop & Spark
  • Implemented dataset transfer of 26 TB between Hadoop and MySQL RDBMS using Sqoop
  • Performed performance tuning in Spark, SQL, and Sqoop resulting in 60% response time reduction
  • Redesigned Data Lake to use Parquet, and Snappy compression to cut 30% storage and compute costs
  • Executed Agile DevOps to build, and maintain code quality, version control and continuous integration, and continuous deployment (CI/CD) using Maven, Git, Bitbucket, Atlassian Bamboo, Jira, confluence and GitHub Actions
  • Handled file formats like Parquet, XML, ORC, Avro, JSON, and CSV
  • Performed Partitioning, Bucketing, Join optimizations, and Compression in Hive

Technologies: Hadoop, Sqoop, Hive, Impala, Shell scripting, MySQL, Spark, Scala, SonarQube, Flume, Unix, Git


Practice School Student/ Researcher

Manipal Institute of Technology

Jan 2016 โ€“ May 2016 Manipal, India

Central Data Repository for MIT, Manipal:

Delivered a web application with its main objectives to serve as a means of data entry, to collect the required data, to analyze the given data, and finally to generate reports dynamically according to the custom report format requirements of the user. The data was loaded from the databases using Sqoop and analyzed using a Hadoop cluster. The reports are generated after querying using Hive and displayed in the web application.


Software Development Intern

CGI Information Systems and Management Consultants Pvt. Ltd

May 2015 โ€“ Jul 2015 Manipal, India

Project Management System:

Developed a web application that enabled the interaction between different users of different departments and their respective projects while accessing their functions on a large scale.



Anthem Go Above IMPACT Award 2021

Awarded for going above and beyond in 2021
See certificate

IBM Data Science Professional

Earned for completing IBM Data Science Certification
See certificate

Annual Team Innovation Award

Awarded for innovations delivered for 2019 โ€“ 2020.

Arctic Code Vault Contributor

Awarded for OSS contributions towards the GitHub Archive program.
See certificate

Iron Man of Technology 2

Awarded for being a standout performer for Q4 of 2019.
See certificate



Diabetic Readmission Exploration

Exploring and drawing meaningful insights for patients readmitted with Diabetes

Appliance Energy Prediction

Predicting the Energy consumed by appliances using Machine Learning algorithms built from scratch

YouTube Analytics Dashboard on Streamlit

Real-Time analytics dashboard generated on input YouTube video. Shows sentiment analysis that can be used to drive up ad-revenue.

Age of Plastic

Created a data driven storyboard showing the impact of global plastic pollution on the environment; Land and Ocean and the recycling rates of the different countries using Tableau.

FBI Crime Reporting Analytics

Data mining FBI uniform major crimes reported in every US state and visualized on a Tableau dashboard.

Analysis and Visualizations of Nursing Home Data

Computed and visualized a data driven story of the Center for Medicare & Medicaid Services (CMS) nursing facility data to generate visuals that highlight the nursing home’s resource limits using Flourish, Data Wrapper and Tableau hosted on Google sites.

Predicting Hits on Spotify

Predicting hit songs on Spotify by classifying 40,000 songs using various Classification Machine Learning Models

Investigating GDP Expenditure

Visualized the expenditure trends in various sectors like Education, Pharmaceuticals, Military, Infrastructure, Research and Development by different countries for the years 1960 - 2020 using Flourish, Data wrapper hosted on Google Sites.

Social Media Analytics

Taking a look at data of 1.6 million twitter users and drawing useful insights while exploring interesting patterns. The techniques used include text mining, sentimental analysis, probability, time series analysis and Hierarchical clustering on text/words using R

Olympic History Analytics

Discovering & visualizing various trends in 120 years of Olympic history using R

Retro Movies Recommender

A Content-based recommendation engine API for movies of the 1900โ€™s built using NLP, Flask, Heroku and Python.

Clustering Paris and London

Clustering Neighborhoods of Paris and London using Machine learning.

Forecasting Healthcare Costs

Predicting the cost of treatment and insurance using Machine Learning.

Covid 19 Tweet Data Scraping

Stream real time Tweets of current affairs like covid-19 using Kafka 2.0.0 high throughput producer & consumer into Elasticsearch using safe, idempotent and compression configurations.

File Processing Comparative Analytics

Determining which programming languages and execution engines are the quickest or the slowest at processing files

Movie Analytics in Spark and Scala

Data cleaning, pre-processing, and Analytics on a million movies using Spark and Scala.

Recent Publications

Some of my recent literary work

Determining insurance and treatment costs based on different features to ultimately build a regression model that accurately predict trends.

Comparing and Benchmarking popular programming languages and execution engines

Six definite ways to improve efficiency and reduce load times.

Clustering Neighborhoods of London and Paris using Machine Learning


My links