Hello! This is Shwetha Sivakumar. I am a Data Engineer with experience building end-to-end data pipelines across transit operations, IoT telemetry, and ML analytics. I have consistently improved system performance and dashboard responsiveness through data modeling, query optimization, and data quality automation. I am strong in Python, SQL, PySpark, Denodo, Informatica IICS, Snowflake, Azure Data Explorer, and Power BI, with AWS Solutions Architect and Denodo Developer certifications. I have partnered effectively with operations, and BI teams to translate technical insights into business decisions.
Github Github
• Built and maintained enterprise Denodo data virtualization layer integrating 400K+ daily records from Amazon Redshift, SQL Server, Databricks, and MongoDB for a passenger rail network, enabling unified analytics across ridership, revenue, and operational performance.
• Designed and optimized Denodo base, derived, and summary views, applying cached views with scheduled refreshes (daily, monthly, yearly), cost-based optimization, and query tuning to improve Power BI dashboard load times by 30%.
• Developed Power BI dashboards consuming Denodo views for revenue, ridership, route performance, and ticket journey analytics; automated metadata propagation by modifying TMDL (.SemanticModel) files.
• Implemented RBAC and global security policies across Denodo and Power BI, managing multiple user roles and enabling GitHub-based version control for Fabric workspaces to support secure and reproducible BI development.
• Built scalable PySpark pipelines on AWS S3 to process 75K monthly accessibility sessions for a B2B mobile/kiosk application, powering reinforcement learning models generating personalized UI adaptations for users with visual, motor, and cognitive impairments.
• Designed incremental data processing and partitioned data models (by event date and impairment type) enabling efficient trend analysis across accessibility metrics using Spark window functions.
• Optimized downstream analytics and model-training workloads via partition pruning, reducing feature read times by 40%.
• Orchestrated end-to-end Spark workflows using Apache Airflow, ensuring reliable daily execution, reproducibility, and backfill support.
• Built end-to-end Informatica IICS ETL pipelines ingesting 500K+ GPS, GTFS, and fare records daily into Snowflake.
• Engineered transformations in Informatica Mapping Designer, calculating route-level delay metrics and operational KPIs for 200+ transit routes; embedded data quality checks reducing reporting errors by 25%.
• Automated near real-time and batch orchestration with Taskflows, ensuring data availability within 5 minutes of ingestion.
• Enabled Power BI dashboards providing actionable insights on route delays, ridership trends, and revenue metrics.
• Designed 15+ real-time operational dashboards in Azure Data Explorer (ADX) to visualize telemetry from 500+ autonomous haul trucks, providing fleet health, utilization, and data quality insights.
• Built IoT telemetry ingestion pipelines into ADX by defining table schemas and JSON ingestion mappings to structure high-frequency GPS and operational data at scale.
• Developed data transformations using ADX update policies to convert epoch timestamps, normalize vehicle identifiers, and derive operational states; created materialized views that pre-aggregated metrics and improved dashboard performance by 40%.
• Implemented ingestion monitoring using KQL queries and Azure Monitor alert rules to detect missing or delayed telemetry, ensuring pipeline reliability.
• Led a team of 3 to develop a cyber-attack detection system using ensemble and deep learning models, achieving 97% accuracy in classifying normal traffic, automated attacks, and manual intrusions.
• Built robust data preprocessing pipelines using Python and regex to parse Apache web logs into structured pandas DataFrames for threat analysis.
• Conducted global cyber-attack pattern analysis across 87 countries using protocol analysis, attack categorization, and subnet tracking.
• Implemented XGBoost and LSTM/GRU neural networks for feature engineering, model training, and evaluation; visualized insights using seaborn and Plotly.
Coursework: Deep Learning in Computer Vision, Applied Data Science, Cloud Computing, Statistics
Coursework: Data Structures and Algorithms, Operating Systems, Computer Networks, Computer Architecture
• Programming & Processing: Python, SQL, PySpark, Java, Pandas, NumPy, scikit-learn, Bash
• Data Engineering: JApache Airflow, dbt, Denodo, Informatica IICS, Data Modeling
• Databases & Warehouses: Snowflake, Redshift, PostgreSQL, SQL Server, MongoDB, Firebase
• Cloud & Infrastructure:AWS (Certified), Microsoft Azure, GCP, Docker, Git
• Visualization & BI:Power BI (DAX, TMDL), KQL, Matplotlib, Seaborn, Plotly
• Tools & Technologies:Jira, Postman, Agile/Scrum, REST APIs, Unit Testing
Amazon Web Services Solutions Architect Associate (Apr 2025)
Google Cloud Skill Boost Badges> (Apr 2022)
Cambridge Certification Authority – Java Level 2 (Dec 2021)
Denodo Platform 9.0 Certified Developer Associate (Oct 2025)