Authors: David Lengweiler, Tobias Weber, Heiko Schuldt, Marco Vogt
Modern data is characterized by its high-volume and inherent heterogeneity, primarily managed by systems tailored to three distinct modeling paradigms: the relational model, which enforces strict schema and high structural integrity; the document model, which offers schema flexibility for semi-structured data; and the graph model, which prioritizes modeling complex relationships between entities. While the database industry is trending toward multi-model systems that incorporate features from all paradigms, data management practices still lag behind. Data scientists rely on manual, multi-stage and labor-intensive workflows to integrate disparate data sources. This process forces users to switch tools, results in high data shipping costs, and forfeits database-level optimizations and structural guarantees, leading to complex, brittle and non-reusable “one-off” solutions.
We argue that embedding data pipelines directly into a multi-model database offers significant benefits, including streamlining, simplification, and improved maintainability, by utilizing declarative, database-native operators.
This paper presents PolyPipe, an extension to the Polypheny multi-model database system. Poly-Pipe integrates data pipeline functionality as a first-class citizen, allowing the construction of complex pipelines using a hybrid of database and classical operators within a single system.
Link: