Job Title: Senior Data Engineer - Cloud Migration & Platform Architecture (GCP/AWS/Azure)
Location: [Remote - Need to relocate to portugal]
Experience Level: 5+ Years (Mid-Senior to Senior)
Position Overview
We are undergoing a fundamental shift in our data infrastructure, moving away from legacy on-premise Cloudera (CDH/HDP) environments toward a modern, hybrid-cloud data mesh architecture spanning Google Cloud Platform (GCP) , Amazon Web Services (AWS) , and Microsoft Azure.
We are looking for a Senior Data Engineer who does not just use these platforms but has built them from the ground up. The ideal candidate has the scars and medals from leading large-scale migration projects—specifically, the re-platforming of Hive/Impala workloads and HDFS datasets to cloud-native storage and compute (Snowflake/Databricks). You will be responsible for writing high-performance Python code, optimizing Spark jobs that process petabytes of data, and ensuring our real-time streaming infrastructure (Kafka/PubSub/Event Hubs) is rock-solid. A key part of this role will be architecting the integration layer with our Azure data services and establishing a cohesive multi-cloud strategy.
Detailed Tech Stack & EnvironmentCategorySpecific Technologies & Tools Used DailyLanguagesPython 3.9+ (Advanced: Decorators, Generators, Multiprocessing, Pydantic, Poetry), PySpark, SQL (ANSI, BigQuery & T-SQL Dialect), Scala (Maintenance only).Compute & ProcessingApache Spark 3.x (DataFrames, Structured Streaming), Databricks (Delta Live Tables, Photon, Unity Catalog), GCP Dataproc (Serverless & Cluster Mode), AWS EMR (on EC2 & EKS), Azure Synapse Analytics (Spark Pools).Streaming & MessagingApache Kafka (Schema Registry, Avro), GCP Pub/Sub, AWS Kinesis Data Streams, Azure Event Hubs, Debezium (CDC).Storage & WarehouseSnowflake (Snowpipe Streaming, Streams & Tasks, Time Travel), GCP BigQuery (BI Engine, Materialized Views), AWS S3, GCP Cloud Storage, Azure Data Lake Storage (ADLS) Gen2, Delta Lake / Apache Iceberg.Orchestration & OpsApache Airflow 2.x (GCP Cloud Composer, AWS MWAA, Azure Data Factory & Self-Hosted IR), dbt Core/Cloud, Terraform (IaC), Bicep (Azure ARM DSL), Docker, GitHub Actions / Jenkins, Azure DevOps.Azure Services (Detailed)Azure Data Factory (ADF) , Azure Synapse Analytics (Dedicated SQL Pools, Pipelines), Azure Event Hubs, Azure Databricks, Microsoft Purview, Azure Key Vault, Azure Active Directory (AAD/Entra ID).Legacy (Migration Source)Cloudera CDH/HDP, Apache Hive, Apache Impala, Oozie, HDFS.Detailed Must-Have Responsibilities & Technical Expectations1. Core Software Engineering in Python (Deep Dive)
Requirement: 5+ years of professional experience in software engineering.
Detailed Expectations:
Code Quality: You treat data pipelines as software products. You enforce unit testing (PyTest), integration testing, and CI/CD for all Spark jobs, including those deployed on Azure Databricks using Azure DevOps.
Optimization: You possess the ability to debug JVM garbage collection issues in Spark UDFs and refactor them into native Spark SQL functions or Pandas UDFs (Vectorized) to achieve 10x performance improvements.
Modularity: You design reusable Python packages and libraries for data ingestion, validation (Great Expectations), and logging that are shared across GCP, AWS, and Azure environments.
2. Large-Scale Migration Expertise (The "Proven" Requirement - Expanded)
Requirement: Proven, hands-on experience migrating from Cloudera (CDH/HDP) to Snowflake or Databricks across multiple clouds.
Detailed Scope of Work You Will Own:
Legacy Decommissioning: You will analyze existing Hive Metastore schemas and Impala query patterns to design a migration strategy to BigQuery, Snowflake, or Azure Synapse Analytics Dedicated SQL Pools.
Data Transfer: You will architect and execute the transfer of 100s of TBs from HDFS to GCS, S3, or ADLS Gen2. This includes utilizing DistCp for initial bulk copy, implementing incremental sync strategies, and converting Hive table formats to optimal cloud storage layouts. You will also use Azure Data Factory for orchestrating data movement and metadata-driven copy activities.
Workflow Refactoring: You will reverse-engineer complex Oozie workflows and rebuild them as robust, idempotent DAGs in Apache Airflow, and where appropriate, re-platform them into Azure Data Factory pipelines.
Cloud-Native Feature Adoption: You will replace batch "INSERT OVERWRITE" jobs with Snowpipe Streaming, Databricks Auto Loader, or Azure Synapse Pipelines to reduce latency from hours to seconds.
3. Greenfield Platform Architecture & Integration (Multi-Cloud)
Requirement: Deep technical expertise in building a high-performance data platform from scratch across GCP, AWS, and Azure.
Detailed Architecture Deliverables:
Reverse ETL: You will build secure pipelines from Snowflake/BigQuery back into operational systems (Salesforce, HubSpot, Postgres) using Apache Beam (Dataflow), AWS Lambda, or Azure Functions with custom retry logic and rate limiting.
CDC Implementation: You will design a Change Data Capture pipeline using Kafka Connect (Debezium) -> Pub/Sub/ Event Hubs -> Dataflow/ Synapse Spark -> BigQuery/ ADLS Gen2, ensuring exactly-once semantics and handling schema evolution seamlessly.
API Ingestions: You will build a serverless ingestion framework on GCP Cloud Functions, AWS Lambda, or Azure Functions (Python) that pulls data from 3rd party REST APIs, handles pagination/authentication, and lands raw JSON into object storage partitioned by date.
Azure Integration & Security: You will architect the integration layer leveraging Azure Event Hubs for high-throughput streaming data capture and Azure Active Directory (Entra ID) for unified identity and access management. You will implement Service Principal and Managed Identity authentication for all Azure services, eliminating the need for key-based access.
4. Data Modeling & Distributed Systems Expertise
Requirement: Expertise in structured/unstructured data and distributed systems.
Detailed Expectations:
Modeling: You can explain the trade-offs between Kimball Star Schema vs. Data Vault 2.0 vs. One Big Table (OBT) and implement the correct approach for specific analytical use cases in dbt, including building models for Azure Synapse Dedicated SQL Pools.
Spark Tuning: You are comfortable looking at the Spark UI, diagnosing Data Skew (Salting Keys), optimizing Shuffle Partitions, and managing Broadcast Joins to prevent executor OOM errors on Dataproc, EMR, and Azure Synapse Spark Pools.
Streaming Architecture: You understand the implications of Event Sourcing and can tune Kafka, Pub/Sub, and Event Hubs (throughput units, partitions) retention policies and subscription backlogs to ensure data durability during consumer downtime.
5. Multi-Cloud Platform Proficiency & Governance
Requirement: Proven experience across GCP and Azure data ecosystems.
Detailed Azure & Cross-Cloud Operations:
Azure Synapse Analytics: You will enforce cost and performance governance by designing optimal table distributions (Hash, Round-Robin) and indexing (Clustered Columnstore) in Dedicated SQL Pools. You will leverage workload management for resource allocation.
Orchestration in Azure: You will design and manage complex, event-driven pipelines using Azure Data Factory (ADF), integrating seamlessly with on-premises data sources via a Self-hosted Integration Runtime.
Data Governance & Security: You will implement and manage fine-grained data access control and metadata scanning using Microsoft Purview, ensuring sensitive data is classified and governed across the entire estate. You will centralize secrets management using Azure Key Vault to securely store connection strings and credentials.
Detailed Nice-to-Have Qualifications
Databricks & Unity Catalog: Experience implementing fine-grained access control and lineage using Unity Catalog in a multi-workspace, multi-cloud environment (AWS, Azure, GCP).
NoSQL & Graph:
Redis: Experience implementing Redis as a distributed cache for Lookup Tables in Spark Streaming jobs, including Azure Cache for Redis, to reduce latency on joins against cloud warehouses.
Neo4j: Knowledge of building identity resolution graphs or supply chain dependencies using Cypher queries.
Infrastructure as Code (IaC): Mastery of Terraform for GCP and AWS. Experience writing Azure Bicep or ARM templates to provision Azure Databricks workspaces, Synapse Analytics artifacts, and complex role-based access control (RBAC) in a repeatable manner.
Machine Learning Integration: Experience building Feature Stores on Databricks, using BigQuery ML, or leveraging Azure Machine Learning for batch inference and model registration directly within the data platform.