I am Thomas George Thomas

A Data Engineer passionate about Data Science 📊. I like automating things, building pipelines, exploring scalability problems, improving efficiency and performance tuning. I’m a strong advocate for 📜 open source, ☁️ Cloud computing, 🚀 DevOps, 🆕 Innovation and Automation 🤖

Contact me Know more





















Senior Data Engineer

Legato Health Technologies, Anthem Inc.

Jun 2018 – Aug 2021 Bangalore, India


  • Built data pipelines for 5 initiatives including providing Clinical Investigative Insights in AWS, and Hadoop
  • Migrated 112 TB of data from the on-premises Hadoop cluster to AWS and Snowflake
  • Innovated and automated post-migration validation reports in Spark Scala bringing down costs by 90% for 2 projects
  • Innovated and reduced latency by 50% which lead to $7000 quarterly savings by refactoring Spark Scala ETL code
  • Implemented continuous integration and continuous deployment (CI/CD) pipelines for 4 projects using DevOps
  • Chaired release management and code migration for production/pre-production environments for 2 projects

Technologies: Hadoop, Spark, Scala, Snowflake, AWS: RDS, S3, EMR, Athena, Hive, Impala, Unix, Shell scripting, Control M, Bamboo, Git, Bitbucket, Maven, Eclipse, Cloudera distribution


Software Engineer - Hadoop Developer & Big Data Engineer

Middle East Management Consultancy and Marketing

Jun 2016 – May 2018 Muscat, Oman


  • Shipped and delivered analytics dashboard which led to an increase in pharmaceutical sales by 12% annually
  • Developed pipelines to handle data of 1.5 TB/day from ingestion to reporting layer using Shell script, Hadoop & Spark
  • Implemented dataset transfer of 26 TB between Hadoop and MySQL RDBMS using Sqoop
  • Performed performance tuning in Spark, SQL, and Sqoop resulting in 60% response time reduction
  • Redesigned Data Lake to use Parquet, and Snappy compression to cut 30% storage and compute costs

Technologies: Hadoop, Sqoop, Hive, Impala, Shell scripting, MySQL, Spark, Scala, SonarQube, Flume, Unix, Git


Practice School Student/ Researcher

Manipal Institute of Technology

Jan 2016 – May 2016 Manipal, India

Central Data Repository for MIT, Manipal:

Delivered a web application with its main objectives to serve as a means of data entry, to collect the required data, to analyze the given data, and finally to generate reports dynamically according to the custom report format requirements of the user. The data was loaded from the databases using Sqoop and analyzed using a Hadoop cluster. The reports are generated after querying using Hive and displayed in the web application.


Software Development Intern

CGI Information Systems and Management Consultants Pvt. Ltd

May 2015 – Jul 2015 Manipal, India

Project Management System:

Developed a web application that enabled the interaction between different users of different departments and their respective projects while accessing their functions on a large scale.



Anthem Go Above IMPACT Award 2021

Awarded for going above and beyond in 2021
See certificate

IBM Data Science Professional

Earned for completing IBM Data Science Certification
See certificate

Annual Team Innovation Award

Awarded for innovations delivered for 2019 – 2020.

Arctic Code Vault Contributor

Awarded for OSS contributions towards the GitHub Archive program.
See certificate

Iron Man of Technology 2

Awarded for being a standout performer for Q4 of 2019.
See certificate



Age of Plastic

Created a data driven storyboard showing the impact of global plastic pollution on the environment; Land and Ocean and the recycling rates of the different countries using Tableau.

FBI Crime Reporting Analytics

Analyzed FBI uniform major crimes reporting in every US state and visualized on a Tableau dashboard. A data mining hackathon.

Analysis and Visualizations of Nursing Home Data

Created data driven storyline using the Center for Medicare & Medicaid Services (CMS) nursing facility data to generate visuals that highlight the nursing home’s resource limits using Flourish, Data Wrapper and Tableau hosted on Google sites. A Computation and Visualization Hackathon.

Predicting Hits on Spotify

Predicting hit songs on Spotify by classifying 40,000 songs using various Classification Machine Learning Models

Investigating GDP Expenditure

Visualized the expenditure trends in various sectors like Education, Pharmaceuticals, Military, Infrastructure, Research and Development by different countries for the years 1960 - 2020 using Flourish, Data wrapper hosted on Google Sites.

Social Media Analytics

Taking a look at data of 1.6 million twitter users and drawing useful insights while exploring interesting patterns. The techniques used include text mining, sentimental analysis, probability, time series analysis and Hierarchical clustering on text/words using R

Olympic History Analytics

Discovering & visualizing various trends in 120 years of Olympic history using R

Retro Movies Recommender

A Content-based recommendation engine API for movies of the 1900’s built using NLP, Flask, Heroku and Python.

Clustering Paris and London

Clustering Neighborhoods of Paris and London using Machine learning.

Forecasting Healthcare Costs

Predicting the cost of treatment and insurance using Machine Learning.

Covid 19 Tweet Data Scraping

Stream real time Tweets of current affairs like covid-19 using Kafka 2.0.0 high throughput producer & consumer into Elasticsearch using safe, idempotent and compression configurations.

File Processing Comparative Analytics

Determining which programming languages and execution engines are the quickest or the slowest at processing files

Movie Analytics in Spark and Scala

Data cleaning, pre-processing, and Analytics on a million movies using Spark and Scala.

Recent Publications

Some of my recent literary work

Determining insurance and treatment costs based on different features to ultimately build a regression model that accurately predict trends.

Comparing and Benchmarking popular programming languages and execution engines

Six definite ways to improve efficiency and reduce load times.

Clustering Neighborhoods of London and Paris using Machine Learning