Complex Software Engineering
Assemble a data engineering team to develop and operate a Spark cluster, or similar, to implement all the needed data transformations. Slow software engineering development lifecycles instead of rapid iterations performed directly by data scientists.
Manually Calculated Features
Using bare bones SQL, you’re forced to manually write pages and pages of column transformations or at best fall back to scripting string manipulations that are hard to maintain and error prone, with inherent limitations on sophistication. Faced with this difficulty, stop generating more features in SQL and rely on downstream users to “deal with it”.
Manual Sanity Checks
Manually check for correctness once in a while or when a real problem arises. As more development takes place, very hard to ensure correctness of the entire pipeline. Teams are often forced to either stop development/start fresh or suffer significant blockages as unexpected breakages surface in datasets.
Manual Tuning Efforts
Ad-hoc investigation of slow-downs and cost increases in data pipelines. Relies on extensive debugging sessions and manual observation of runs to know where the issues arise. Time consuming and ultimately misses out on several “low hanging fruit” opportunities in an even medium sized pipeline, let alone with 100s of tables in play.