Internship : Testing Data Migrations with Synthetic Data: AI-powered approach

ELCA

4 days ago

Full-time

On-site

Pully, Switzerland

Description

Data platform migrations are common in enterprise environments, moving from legacy systems to modern infrastructure while preserving business logic. The technical challenge isn't just syntax translation; it's validation. When developers migrate SQL scripts or data pipelines between platforms, they face different execution environments, modified data access permissions, and no safe way to test against production data.

This internship tackles synthetic data generation for migration script testing. You'll design and implement a system that generates realistic test datasets mirroring production structure and behavior without exposing sensitive information. There are different approaches, it could be a small dataset living in a git repository, or a fully-fledged synthetic data warehouse. Still, the data must be realistic enough to catch real bugs.

The challenge goes beyond simple data mocking. You'll need to decide whether to generate from real data (anonymization risks), from query analysis alone (requires good documentation), or hybrid approaches. Should categorical values match production exactly or can we substitute them and adapt the scripts? Can we extend unit-testing to end-to-end testing, and what would be the required dataset properties?

Part of the work involves establishing an evaluation methodology—potentially collecting a reference set of migration scripts and their expected behaviors to measure how well different synthetic data approaches catch real issues. There's potential to explore multi-agent architectures where specialized agents handle different aspects: schema analysis, constraint extraction, data generation, anonymization verification, and test validation. This is applied research with immediate production impact.

Objectives

Design a strategy for migration script testing that balances realism, anonymization, and practical constraints
Implement a proof-of-concept system that generates test datasets from schema documentation, existing queries, or (carefully) sampled production data
Define testing strategies: unit tests vs. end-to-end tests, minimum viable data sizes, etc.
Develop an evaluation methodology to measure the effectiveness of different synthetic data generation approaches
Explore multi-agent architectures for decomposing the generation pipeline into specialized components (schema analysis, constraint satisfaction, validation)

Our offer

A dynamic work and collaborative environment with a highly motivated multi-cultural and international sites team
The chance to make a difference in peoples’ life by building innovative solutions
Various internal coding events (Hackathon, Brownbags), see our technical blog
Monthly After-Works organized per locations

Skills required

Strong Python programming: data processing, testing patterns, CI/CD integration
Understanding of relational databases, SQL, and data modeling concepts
Experience with LLMs and agentic systems: prompting, tool use, multi-agent orchestration
Familiarity with data security and data anonymization concepts
Problem-solving mindset: comfort with ambiguous requirements and making justified technical trade-offs
Clear technical writing and documentation skills

Apply now

Internship : Testing Data Migrations with Synthetic Data: AI-powered approach

More jobs

10592 - Analista de NegÃ³cios (Business Analyst) â SAP Data Migration - SÃªnior

RunTalent

Data Integration & API Engineer m/w/d - Bi & Data Automation

loyos bi

Internship : Testing Data Migrations with Synthetic Data: AI-powered approach

More jobs

10592 - Analista de NegÃ³cios (Business Analyst) â SAP Data Migration - SÃªnior

RunTalent

Data Integration & API Engineer m/w/d - Bi & Data Automation

loyos bi

10592 - Analista de NegÃ³cios (Business Analyst) â SAP Data Migration - SÃªnior