I am Thomas George Thomas

A Data Engineer passionate about Data Science ๐Ÿ“Š. I like automating things, building pipelines, exploring scalability problems, improving efficiency and performance tuning. I’m a strong advocate for ๐Ÿ“œ open source, โ˜๏ธ Cloud computing, ๐Ÿš€ DevOps, ๐Ÿ†• Innovation and Automation ๐Ÿค–

Contact me Know more








Apache Spark













Data Analyst

Northeastern University

Feb 2023 โ€“ Present Massachusetts, United States of America
  • Designed and implemented an analytics platform using a Kappa system architecture with MariaDB, Python, SQL, and Jupyter Notebook, achieving a 98% uptime for 20 smart homes in Colorado and Massachusetts.

Technologies : Python, Plotly Dash, MariaDB, MySQL, Jupyter Notebook


Data Engineer

Montai Health

Jul 2022 โ€“ Dec 2022 Massachusetts, United States of America
  • Established a comprehensive health, drug, biochemical, and bioinformatic Data Lake, aggregating data from diverse Relational Database Management Systems (RDBMS), Graph, and NoSQL-based databases on AWS, accumulating 100 TB of high-quality data.
  • Developed Extract, Transform, and Load (ETL) pipelines on the AWS cloud, leveraging Redshift, SQS, Lambda, Batch, EMR, EC2, PySpark, Athena, Glue, and Boto3 to process a substantial 100 TB of data.
  • Created multithreaded web scrapers to extract data from a variety of sources and file formats, including CSV, JSON, XML, Parquet, ORC, Avro, API, and FTP servers using Python to collect 5 GB data daily
  • Orchestrated Continuous Improvement and automated test-driven development (TDD) workflows using GitHub actions, pytest, pylint, and py-coverage metrics, leading to an outstanding 100% enhancement in code quality.
  • Utilized Python packages such as Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn, Folium, Requests, Beautiful Soup, ElementTree, LXML, and Multiprocess for data analysis and processing.
  • Implemented a robust data pipeline framework, optimizing data processing and transformation workflows.

Technologies : Python, API, AWS: S3, EMR, Athena, Glue, Redshift, Lambda, Batch, PySpark, Shell Scripting, Git, GitHub, Packages: Pandas, Requests, BeautifulSoup, Multiprocess, Pytest


Senior Data Engineer

Legato Health Technologies, Elevance Health

Jun 2018 โ€“ Aug 2021 Bangalore, India
  • Spearheaded the construction of scalable (Extract, Transform, Load) ETL/ELT data pipelines for 5 U.S. healthcare initiatives, enabling large-scale data transformations on cloud and on-premise data warehouses.
  • Successfully migrated 112 TB of data from on-premises Hadoop clusters to AWS Cloud and Snowflake, resulting in substantial cost savings.
  • Pioneered an automated data quality framework in Spark Scala, reducing data errors by 35% and leading to $7000 quarterly savings.
  • Executed Agile DevOps practices, ensuring code quality, version control, continuous integration, and continuous deployment (CI/CD) for 4 projects using tools such as Maven, Git, Bitbucket, Atlassian Bamboo, Jira, and Confluence.
  • Proficient in stakeholder interaction, business requirements gathering, data analysis, design document creation, release management, source control management, code migration, and code reviews.

Technologies: Hadoop, Spark, Scala, Snowflake, AWS: RDS, S3, EMR, Athena, Hive, Impala, Unix, Shell scripting, Control M, Bamboo, Git, Bitbucket, Maven, Eclipse, Cloudera distribution


Software Engineer - Hadoop Developer & Big Data Engineer

Middle East Management Consultancy and Marketing

Jun 2016 โ€“ May 2018 Muscat, Oman
  • Delivered an analytics product that contributed to a 12% annual increase in pharmaceutical finance sales.
  • Designed and implemented a highly efficient data pipeline capable of processing 1.5 TB of data daily, streamlining data processing and transformation.
  • Achieved seamless data transfer of 26 TB between Hadoop and MySQL RDBMS using Sqoop, improving data accessibility.
  • Implemented best practices and performance tuning in Apache Spark jobs, resulting in a remarkable 60% reduction in response times for Spark, SQL queries, and Sqoop processes.
  • Redesigned Data Lake to use Parquet, and Snappy compression to cut 30% storage and compute costs
  • Performed Partitioning, Bucketing, Join optimizations, and Compression in Hive
  • Applied advanced dimensional data modeling techniques, including Star schema, Kimball, and Inmon, resulting in a remarkable 20% improvement in data warehouse scalability.
  • Established robust data management, governance, and data quality standards, ensuring the reliability and accuracy of datasets for data warehousing and decision-making processes.

Technologies: Hadoop, Sqoop, Hive, Impala, Shell scripting, MySQL, Spark, Scala, SQL, SonarQube, Flume, Unix, Git


Practice School Student/ Researcher

Manipal Institute of Technology

Jan 2016 โ€“ May 2016 Manipal, India

Central Data Repository for MIT, Manipal:

  • Delivered a web application with its main objectives to serve as a means of data entry, to collect the required data, to analyze the given data, and finally to generate reports dynamically according to the custom report format requirements of the user. The data was loaded from the databases using Sqoop and analyzed using a Hadoop cluster. The reports are generated after querying using Hive and displayed in the web application.

Software Development Intern

CGI Information Systems and Management Consultants Pvt. Ltd

May 2015 โ€“ Jul 2015 Manipal, India

Project Management System:

  • Developed a web application that enabled the interaction between different users of different departments and their respective projects while accessing their functions on a large scale.



Anthem Go Above IMPACT Award 2021

Awarded for going above and beyond in 2021
See certificate

IBM Data Science Professional

Earned for completing IBM Data Science Certification
See certificate

Annual Team Innovation Award

Awarded for innovations delivered for 2019 โ€“ 2020.

Arctic Code Vault Contributor

Awarded for OSS contributions towards the GitHub Archive program.
See certificate

Iron Man of Technology 2

Awarded for being a standout performer for Q4 of 2019.
See certificate



Diabetic Readmission Exploration

Exploring and drawing meaningful insights for patients readmitted with Diabetes

Appliance Energy Prediction

Predicting the Energy consumed by appliances using Machine Learning algorithms built from scratch

YouTube Analytics Dashboard on Streamlit

Real-Time analytics dashboard generated on input YouTube video. Shows sentiment analysis that can be used to drive up ad-revenue.

Age of Plastic

Created a data driven storyboard showing the impact of global plastic pollution on the environment; Land and Ocean and the recycling rates of the different countries using Tableau.

FBI Crime Reporting Analytics

Data mining FBI uniform major crimes reported in every US state and visualized on a Tableau dashboard.

Analysis and Visualizations of Nursing Home Data

Computed and visualized a data driven story of the Center for Medicare & Medicaid Services (CMS) nursing facility data to generate visuals that highlight the nursing home’s resource limits using Flourish, Data Wrapper and Tableau hosted on Google sites.

Predicting Hits on Spotify

Predicting hit songs on Spotify by classifying 40,000 songs using various Classification Machine Learning Models

Investigating GDP Expenditure

Visualized the expenditure trends in various sectors like Education, Pharmaceuticals, Military, Infrastructure, Research and Development by different countries for the years 1960 - 2020 using Flourish, Data wrapper hosted on Google Sites.

Social Media Analytics

Taking a look at data of 1.6 million twitter users and drawing useful insights while exploring interesting patterns. The techniques used include text mining, sentimental analysis, probability, time series analysis and Hierarchical clustering on text/words using R

Olympic History Analytics

Discovering & visualizing various trends in 120 years of Olympic history using R

Retro Movies Recommender

A Content-based recommendation engine API for movies of the 1900โ€™s built using NLP, Flask, Heroku and Python.

Clustering Paris and London

Clustering Neighborhoods of Paris and London using Machine learning.

Forecasting Healthcare Costs

Predicting the cost of treatment and insurance using Machine Learning.

Covid 19 Tweet Data Scraping

Stream real time Tweets of current affairs like covid-19 using Kafka 2.0.0 high throughput producer & consumer into Elasticsearch using safe, idempotent and compression configurations.

File Processing Comparative Analytics

Determining which programming languages and execution engines are the quickest or the slowest at processing files

Movie Analytics in Spark and Scala

Data cleaning, pre-processing, and Analytics on a million movies using Spark and Scala.

Recent Publications

Some of my recent literary work

Determining insurance and treatment costs based on different features to ultimately build a regression model that accurately predict trends.

Comparing and Benchmarking popular programming languages and execution engines

Six definite ways to improve efficiency and reduce load times.

Clustering Neighborhoods of London and Paris using Machine Learning


My links