Neotech

Role: Data Engineer
location: San Diego, CA 92129 (onsite)

Job Description:
Design, build, and performance-tune Apache Spark workloads using Spark SQL and PySpark for complex transformations (JSON/semi-structured data, nested structures, window functions, joins, aggregations).
2. Profile and optimize Spark jobs: partitioning, shuffles, join strategies, skew, memory/spill, and right-sized resource usage—especially on EMR Serverless—for large-scale and petabyte-scale data.
3. Support Customers and Monitor Pipelines with Strict SLA for Fixs and Re Instating Issues around the clock.
4. Implement reusable patterns for incremental loads, deduplication and CDC-style processing.
5. Build and maintain ETL/ELT on AWS EMR Serverless (Spark), with S3 as the data lake layer: partitioning, compression, external tables, and layouts that support fast Spark and downstream SQL.
workloads: sort keys, distribution, and SQL patterns that fit S3 Spark Redshift flows.
7. Optimize cost and performance across Spark jobs, S3 storage, and Redshift (including retention and lifecycle thinking where relevant).
8. Produce end-to-end designs: pipeline topology, data models, staging vs curated layers, incremental strategies, and clear tradeoffs (freshness, cost, complexity, reliability).
9. Apply access controls for sensitive financial and user data (least privilege, row/column-level patterns where required).
10. Support data governance: metadata, documentation, and alignment with compliance expectations.
11. Implement data quality (validation rules, regex, null-safety) and monitoring/alerting with error handling for production pipelines.
12. Manage schema evolution and migrations with backward compatibility and risk reduction.
13. Partner with IRL Teams and ML/Data Science on feature-rich datasets; work with risk/compliance and platform teams.

What we’re looking for
1. Strong Spark + Spark SQL + hands-on performance tuning (not only SQL writing).
2. Python for Spark/data engineering.
3. AWS: EMR Serverless, S3 (delta and data lake patterns), Redshift (SQL + tuning).
4. Ability to design pipelines and data models and communicate tradeoffs.
5. Familiarity with access control concepts for data platforms (AWS IAM, lake/warehouse permissions, RLS / column-level security as applicable).
6. Ownership of production systems, support, 24/7 monitoring and collaboration.

Good to have
1. Fraud, risk, or compliance domains.
2. Athena,GCP and other S3-query engines alongside Spark.
3. Highly interactive or SLA-tight workloads support and monitoring on large data piplines.
4. Deeper Redshift ops (WLM, queues, workload patterns) alongside Spark.

To apply for this job email your details to Suresh.j@neotechusa.com