Job Summary
We are seeking a skilled Data Engineer with strong expertise in Java and big data technologies to design, develop, and maintain scalable batch data pipelines. The ideal candidate will have hands-on experience working with modern data Lakehouse architectures, cloud-native data platforms, and automation tools to support high-performance analytics and data processing workloads.
Experience - 5-8 Years
Must Haves
- Bachelors or masters degree in computer science, Engineering, or a related technical field.
- Strong proficiency in Java programming with solid understanding of object-oriented design principles.
- Proven experience designing and building ETL/ELT pipelines and frameworks.
- Excellent command of SQL and familiarity with relational database management systems.
- Hands-on experience with big data technologies such as Apache Spark, Hadoop, and Kafka or equivalent streaming and batch processing frameworks.
- Knowledge of cloud data platforms, preferably AWS services (Glue, EMR, Lambda) and Snowflake.
- Experience with data modeling, schema design, and concepts of data warehousing.
- Understanding of distributed computing, parallel processing, and performance tuning in big data environments.
- Strong analytical, problem-solving, and debugging skills.
- Excellent communication and teamwork skills with experience working in Agile environments.
Nice to Have
- Experience with containerization and orchestration technologies such as Docker and Kubernetes.
- Familiarity with workflow orchestration tools like Apache Airflow.
- Basic scripting skills in languages like Python or Bash for automation tasks.
- Exposure to DevOps best practices and building robust CI/CD pipelines.
- Prior experience managing data security, governance, and compliance in cloud environments.
Responsibilities:
- Design, develop, and optimize scalable batch data pipelines using Java and Apache Spark to handle large volumes of structured and semi-structured data.
- Utilize Apache Iceberg to manage data lakehouse environments, supporting advanced features such as schema evolution and time travel for data versioning and auditing.
- Build and maintain reliable data ingestion and transformation workflows using AWS Glue, EMR, and Lambda services to ensure seamless data flow and integration.
- Integrate with Snowflake as the cloud data warehouse to enable efficient data storage, querying, and analytics workloads.
- Collaborate closely with DevOps and infrastructure teams to automate deployment, testing, and monitoring of data workflows using CI/CD tools like Jenkins.
- Develop and manage CI/CD pipelines for Spark/Java applications, ensuring automated testing and smooth releases in a cloud environment.
- Monitor and continuously optimize the performance, reliability, and cost-efficiency of data pipelines running on cloud-native platforms.
- Implement and enforce data security, compliance, and governance policies in line with organizational standards.
- Troubleshoot and resolve complex issues related to distributed data processing and integration. Work collaboratively within Agile teams to deliver high-quality data engineering solutions aligned with business requirements