DevJobs

Data Engineer

Overview
Skills
  • SQL SQL ꞏ 5y
  • Python Python ꞏ 5y
  • Scala Scala
  • Java Java
  • Kafka Kafka ꞏ 3y
  • RDBMS RDBMS ꞏ 5y
  • Design Patterns
  • Kubernetes Kubernetes
  • Apache Spark ꞏ 5y
  • Trino ꞏ 5y
  • Presto ꞏ 5y
  • Non-relational databases ꞏ 5y
  • Lakehouse ꞏ 5y
  • ETL ꞏ 5y
  • Data warehouse ꞏ 5y
  • Cloud Functions ꞏ 3y
  • Vertex ꞏ 3y
  • Task queues ꞏ 3y
  • Sub ꞏ 3y
  • Streaming technologies ꞏ 3y
  • Stream processing technologies ꞏ 3y
  • Asynchronous programming ꞏ 3y
  • Spark Streaming ꞏ 3y
  • Sagemaker ꞏ 3y
  • Pub ꞏ 3y
  • Kubeflow ꞏ 3y
  • Iceberg
  • Software engineering concepts
  • Complex data sets
  • Data modeling
  • Paimon
  • Distributed systems
  • Google Cloud Dataproc
  • Amazon EMR
  • Unstructured data

About Us:

We, at AUI are excited to introduce you to Apollo. Apollo is our breakthrough language model, built with a neuro-symbolic architecture to make conversational agents possible. Apollo enables the native tool use and controllability transformer-based agents lack. Apollo unlocks fine-tuning for agents, allowing continuous evolution from human feedback and ever-improving performance for conversational agents of any kind. We, at AUI, are seeking an experienced R&D Manager to lead our Research and Development team, driving the development of innovative products and features.


Who are you?

You are a seasoned Data Engineer with a deep understanding of data modeling, massive parallel processing (in both real-time and batch) and bringing Machine learning capabilities into large-scale production systems. You have experience at a cutting-edge startup and are passionate about building the data infrastructures that fuel the world’s first intelligent agent. You are a team player with excellent collaboration, communication skills, and a “can-do” approach.


What you’ll be doing?

  • Build, maintain, and scale data pipelines for both batch and real-time data processing across multiple sources and ecosystems.
  • Design and implement robust APIs and integrate diverse data systems to support data collection and aggregation.
  • Develop and manage advanced data architectures, including lakehouses, streamhouses, and data warehouses.
  • Collaborate with data scientists and other stakeholders to implement effective data solutions and integrate large language models (LLMs) into our systems.
  • Work with cross-functional teams to define business needs and translate them into technical implementations that leverage your deep understanding of data architectures and software engineering best practices.
  • Develop and lead initiatives to manage, monitor, and debug data systems, enhancing their reliability, efficiency, and overall quality.


What should you have?

  • 5+ years of experience in designing and managing sophisticated lakehouse and data warehouse architectures, ensuring scalable, efficient, and reliable data storage solutions.
  • 5+ years of experience building and maintaining ETLs using Apache Spark.
  • 3+ years of experience working with streaming technologies (e.g., Apache Kafka, Pub/Sub) and implementing real-time data pipelines using Stream processing technologies (e.g., Spark Streaming, Cloud Functions).
  • 5+ years of experience with SQL and distributed query engines such as Presto and Trino, with a strong focus on analyzing and optimizing query plans to develop efficient and complex queries.
  • 3+ years of experience developing APIs using Python, with proficiency in asynchronous programming and task queues.
  • Proven expertise in deploying and managing Spark applications on enterprise-grade platforms such as Amazon EMR, Kubernetes (K8S), and Google Cloud Dataproc.
  • Solid understanding of distributed systems and experience with open file formats such as Paimon and Iceberg.
  • 3+ years of experience developing infrastructures that bring machine learning capabilities to production, using solutions such as Kubeflow, Sagemaker, and Vertex.
  • 5+ years of experience writing production-grade Python code and working with both relational and non-relational databases.
  • Solid understanding of software engineering concepts, design patterns, and best practices, with the ability to architect solutions and integrate different system components.
  • Proven experience working with unstructured data, complex data sets, and data modeling.
  • Advantage – Demonstrated experience orchestrating containerized applications in AWS and GCP using EKS and GKE.
  • Advantage – Proficiency in Scala and Java.
AUI