vishwas kumar

Data Analyst

Dallas, TX, USA

Followers

Following

Careers

Big Data Engineer (Azure)

Republic Services

Contract08/2021 - 01/2023

• Designing and Developing Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and non-relational to meet business functional requirements. • Involved in the creation of Automated Databricks workflow to run multiple data loads using Python. • Experienced in automation of jobs using Azure Data Factory and used ingested data for Analytics using PowerBI. • Developed ETL Pipelines using Apache PySpark - Spark SQL and Data Frame APIs. • Worked in the building of end-to-end data pipelines from various data sources to the Azure SQL Datawarehouse. • Created an Azure SQL database, Migrated Microsoft SQL server to Azure SQL database, and monitored and restored it. • Developed Python script for loading data into Azure Synapse Analytics table. • Good experience in the creation of database objects like tables, views, stored procedures, triggers, packages, and functions using T-SQL to provide structure and maintain data efficiently. • Used Azure Synapse to manage processing workloads and served data for BI and prediction needs. • Performed the integration of data on-prem (MY SQL, Cassandra) using Azure Data Factory and cloud (Azure SQL DB, Blob Storage) and applied transformations to the data, and loaded it to Azure Synapse. • Experienced in creating Automated Databricks workflow using Python to run multiple data loads. • Develop and maintain various data ingestion & continuous integration (CI/CD) pipelines as per the design architecture and processes: source to landing, landing to curated & curated to process. • Utilize the Databricks Delta Lake Storage layer to create versioned Apache Parquet files with transaction logs and audit history. • Hands-on experience with Spark applications like batch interval time, level of parallelism, and memory tuning to improve processing time and efficiency. • Write Databricks notebooks (Python) for handling large volumes of data, transformations, and computations to operate with various types of file formats. • Reduced access time by refactoring data models, and query optimization and implemented Redis cache to support Snowflake.

Big Data Engineer (AWS)

Bio-Rad Laboratories

Contract07/2020 - 08/2021

• Extensively used AWS Athena to import structured data from S3 into other systems such as RedShift or to generate reports. • Worked with Spark to improve the speed and optimization of Hadoop's current algorithms. • Created a Data Pipeline utilizing Processor Groups and numerous processors in Apache NiFi for Flat File, RDBMS as part of a Proof of Concept (POC) on Amazon EC2. • Migrated an existing on-premises application to AWS. AWS services such as EC2 and S3 were used for data set processing and storage. Experienced in maintaining a Hadoop cluster on AWS EMR (Elastic Map Reduce). • Utilized Spark-Streaming APIs to perform on-the-fly transformations and actions for the common learner data model, which gets data from Kinesis in near real-time. • Performed end-to-end architecture and implementation evaluations of different AWS services such as Amazon EMR, Redshift, S3, Athena, Glue, and Kinesis. In Hive, we created external table schemas for the data being processed as the primary query engine of EMR. • Created Apache presto and Apache drill configurations on an AWS EMR cluster to integrate different databases such as MySQL and Hive. This allows for the comparison of outcomes such as joins and inserts on many data sources controlled by a single platform. • AWS RDS (Relational database services) was created to act as a Hive meta store, and metadata from 20 EMR clusters could be integrated into a single RDS, preventing data loss even if the EMR was terminated. • Developed and implemented ETL pipelines on S3 parquet files in a data lake using AWS Glue. • Developed a cloud formation template in JSON format to utilize content delivery with cross-region replication using Amazon Virtual Private Cloud. • The AWS Code Commit Repository was utilized to preserve programming logic and scripts, which were subsequently replicated to new clusters. • Implemented Columnar Data Storage, Advanced Compression, and Massive Parallel Processing using the Multi-node Redshift technology. • Involved in the development of the new AWS Fargate API, which is comparable to the ECS run task API. • Worked on the code transfer of a quality monitoring application from AWS EC2 to AWS Lambda, as well as the construction of logical datasets to administer quality monitoring on snowflake warehouses. • Proficient with container systems such as Docker and container orchestration tools such as EC2 Container Service, Kubernetes, and Terraform. • Worked on creating workloads HDFS on Kubernetes clusters to mimic the production workload for development purposes. • Deployed Kubernetes pods using KUBECTL in EKS. • Wrote Python scripts using Boto3 to spin up instances automatically on AWS EC2 and OPS Works stacks; integrated it with Auto scaling to set up servers automatically with specified AMIs; Environment: Python, Databricks, PySpark, Kafka, Reltio, GitLab, PyCharm, AWS S3, Delta Lake, Snowflake. Cloudera CDH 5.9.16, Hive, Impala, Kubernetes, Flume, Apache NiFi, Java, Shell-scripting, SQL, Sqoop, Oozie, Java, Python, Oracle, SQL Server, HBase, PowerBI, Agile Methodology.

Data Engineer

Snapdeal

Full time04/2018 - 06/2020

• Processed data into HDFS by developing solutions to analyze the data using Hive and produce summary results from Hadoop to downstream systems. • Load data into Amazon Redshift and use AWS Cloud Watch to collect and monitor AWS RDS instances within Confidential. • Developed and executed a migration strategy to move Data Warehouse from an Oracle platform to AWS Redshift. • Migrated existing on-premises VMware VMs and applications to AWS Cloud using Server Migration Services • Configured RBAC models and Security in AWS IAM, to authenticate applications & users in the AWS. • Responsible for the configuring, implementing, automating, and maintenance of Linux/Unix-based infrastructure and various technologies such as VMware vSphere Amazon AWS and EC2 for cloud provisioning. • Used Spark as an ETL tool to do transformations, event joins, and some pre-aggregations before storing the data onto HDFS. • Worked with Spark scripts using PySpark & Scala also maintained ELK (Elastic Search, Logstash, and Kibana). • Developed a data pipeline using Flume, Sqoop to extract the data from weblogs and store it in HDFS. • Proficient in using Sqoop to import and export data from RDBMS to HDFS and vice-versa. • Created Hive tables and was involved in data loading and writing Hive UDFs. • Exported the analysed data to the relational database MySQL using Sqoop for visualization and generating reports. • Created HBase tables to load large sets of structured data and Unstructured Data. • Managed and reviewed Hadoop log files. • Involved in providing inputs for estimate preparation for the new proposal. • Developed UDF, UDAF, UDTF functions and implemented them in Hive Queries. • Designed Map Reduce Jobs to convert the periodic of XML messages into a partition Avro Data. • Worked extensively with Sqoop to import data from various systems/sources (like MySQL) into HDFS. • Created Hive UDFs for additional functionality in Hive for analytics. • Used different file formats like Text files, Sequence Files, Avro. • Experienced working on Cluster coordination services through Zookeeper. • Assisted in creating and maintaining technical documentation to launching Hadoop Clusters and even for executing Hive queries. • Develop the software systems, using scientific analysis, complex and optimal algorithms like sectorization for generating the statistics into web site analyser. • Extensive knowledge in Data transformations, Mapping, Cleansing, Monitoring, Debugging, performance tuning, and troubleshooting Hadoop clusters. • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python, and Scala. • Developed DDL and DML scripts in SQL and HQL for creating tables and analysing the data in the RDBMS and Hive • Assisted in Cluster monitoring, cluster maintenance, adding, & removing cluster nodes. • Trained new developers in the team on how to commit their work and how can they make use of the CI/CD pipelines that are in place. • Have knowledge working with Jenkins to implement Continuous deployment (CD) and continuous integration (CI) processes. • Worked on integrating GIT into the Continuous Integration (CI) environment along with Anthill-Pro, Jenkins, and Subversion.

Data Analyst

Techspan Engineering Pvt Ltd

Full time09/2017 - 03/2018

• Collaborated across departments with Business Analysts and SMEs to gather business needs and identify feasible items for further development. • Worked in Agile model throughout the project involved in release management on a weekly and daily basis. • Used SQL extensively to query the data and generate ad-hoc reports. • Worked in data manipulation and visualization with NumPy, Pandas, SciPy, and Matplotlib in Python. • Worked extensively the complex SQL queries such as procedures, functions, group by and joins • Developed models utilizing Bayesian HMM and machine learning classification methods such as XG Boost, SVM, and Random Forest. • Worked with SSAS, SSIS, and SSRS as part of the ETL process. • Created reports in both Tableau and PowerBI after the ETL load as per the business requirement. • Performed data mining and analysis using Python and sophisticated Excel tools (Pivot tables and V-Lookup).

ETL Developer

PepperTap

Full time08/2015 - 09/2017

• Involved in requirement gathering sessions with the client technical and functional teams. • Extensively worked with transformations like SQL, Merge, Query and Validation to transform and load data from staging database to target database based on complex business rules. • Developed custom SQL queries to extract data from multiple sources. • Configured multiple repositories to test and load the data without corrupting the data. • Extensively used data services manager for scheduling the jobs. • Worked in creating custom functions when required for specific use cases. • Tuned the jobs using parallelism techniques to reduce the overall runtime of the job. • Wrote SQL queries to validate the data after data is transferred to the target tables. • Created multiple type of output files like flat-files, CSV and Excel workbooks based on the requirement.

Education

The University of Texas at Dallas

Business Analytics

08/2021 - 04/2023Master's DegreeClass of 2023

Welcome to Outdefine

A free tokenized community dedicated to connecting global tech talent with remote job opportunities. Our platform is designed to help you connect, learn, and earn in the tech industry while providing the chance to collect DEF tokens. Join our vibrant community today and explore a world of possibilities for your tech career!

Join for free

Skills

PythonAWS LambdaMicrosoft AzureAWS.NETML modelsHTMLThird-party API’sSparkSQL Server

ExperienceSenior-level8+ years

Hourly rate$70/hr