By Dan Jaye, CTO

There’s a common assumption in technical circles: if data starts out structured, then SQL and well-designed schemas should be enough to unlock its value. Embeddings, NLP, and other unstructured techniques are seen as tools for text, images, or documents, but not databases.

That assumption holds only when the data is clean, complete, and of limited complexity. It breaks down the moment data encounters the realities of production systems.

As systems grow and interoperate with more sources and channels, and as tools age or evolve, structured data inevitably becomes complex and messy. Sometimes the mess is subtle: one source reports age as 31–40, another uses 35–44; one defines income as $25–50k, another as $20–40k. All technically “structured,” but no longer structurally compatible. And when this happens, methods designed for unstructured inputs often provide more value than traditional database approaches.

Nothing crashes when this happens. Tables still validate. Dashboards still render. Pipelines still run. On the surface, everything looks fine, yet underneath, the structure begins to slow insight rather than support it.

Still Structured, Just Not Very Helpful

Data can remain neatly arranged in rows and columns and still need to be treated as unstructured input. It may carry remnants of past decisions, mix formats, or arrive from sources that never fully align. Manual fixes become routine. Every fix quietly signals that the schema no longer matches the work the business is trying to do with it.

Seeing that isn’t a setback. It’s often the first sign that something better is possible.

Why Unstructured Techniques Matter Here

Structured tools still have value, but they often no longer suffice on their own. Unstructured approaches help surface the parts of the data that the schema no longer expresses:

  • Embeddings retain connections that normalization strips away

     

  • NLP makes inconsistent text fields usable again

     

  • ML models find patterns before any schema is agreed on

     

  • Flexible storage allows data to evolve without redesigning it first

This isn’t a rejection of structured systems. It’s a way to restore meaning when structure has stopped doing its job.

Embeddings for Structured Dimensions

We’ve found that a modest set of embeddings can address a surprising number of structured-data problems.

With around 256 embeddings, we can retain many of the patterns and relationships in an original file with 10,000 dimensions that are sparsely and inconsistently populated. The issue isn’t just sparse data. The “Curse of Dimensionality” dilutes the potency of what should be rich, powerful signals.

Abstracting that complexity, inconsistency, and absence into a moderate set of embeddings often results in broader utility and far greater reliability.

The False Dichotomy

“SQL and feature selection for structured data.
Embeddings and ML for unstructured data.”

This is a common assumption, but it can unnecessarily handcuff your data.

Often, explainability is the primary objection. Unstructured techniques seem to obscure the raw features people are used to seeing. 

But Explainability Isn’t Binary

Using embeddings for structured data doesn’t mean abandoning explainability—it simply changes how we achieve it. While embeddings transform explicit fields and values into abstract vector representations, several complementary techniques maintain interpretability:

Attribution Methods:
Tools like SHAP and LIME trace predictions back to the original structured features that influenced them. Even after embedding, these methods reveal which customer attributes, transaction patterns, or behavioral signals drove a recommendation or decision.

Embedding Space Analysis:
Visualization techniques such as t-SNE or UMAP, along with simple distance metrics, show how the model organizes information. For structured data, this validates that customers with similar purchase histories cluster together, or that accounts with comparable risk profiles map to nearby points in the embedding space.

Hybrid Approaches:
Many production systems maintain both the original structured representation and the embedding. Queries execute efficiently in the vector space, while explanations reference the underlying structured attributes. This “best of both worlds” approach preserves performance and interpretability.

The key insight: embeddings don’t eliminate explainability: they shift it from field-by-field matching to geometric reasoning. Instead of explaining “these records matched on demographics AND purchase_category,” you explain “these customers are similar across a learned representation of their complete behavioral profile.” 

For Those Who Work Closely with This

If you build or maintain systems at scale, you already know this isn’t a debate about AI versus SQL. It’s about timing.

Structure reflects early assumptions. Those assumptions eventually become boundaries.

Avoiding that outcome doesn’t require starting over. It requires sequencing the work so the data tells you when to change it.

And when it does, you won’t need to force it … you’ll already see it coming.

 

 

About the Author

Daniel Jaye

Chief Technology Officer

Dan has provided strategic, tactical and technology advisory services to a wide range of marketing technology and big data companies.  Clients have included Altiscale, ShareThis, Ghostery, OwnerIQ, Netezza, Akamai, and Tremor Media. Dan was the founder and CEO of Korrelate, a leading automotive marketing attribution company, purchased by J.D. Power in 2014.  Dan is the former president of TACODA, bought by AOL in 2007, and was the founder and CTO of Permissus, an enterprise privacy compliance technology provider.  He was the Founder and CTO of Engage and served as the acting CTO of CMGI. Prior to Engage, he was the director of High Performance Computing at Fidelity Investments and worked at Epsilon and Accenture (formerly Andersen Consulting).

Dan graduated magna cum laude with a BA in Astronomy and Astrophysics and Physics from Harvard University.

Categories

Recent Posts

Subscribe Now

This field is for validation purposes and should be left unchanged.