By: Sarah Marquart

Four researchers from Cornell Tech received an Outstanding Paper Award at the 2023 Empirical Methods in Natural Language Processing (EMNLP) Conference in December 2023. The winning paper, Text Embeddings Reveal (Almost) As Much As Text, was co-authored by Associate Professor of Computer Science Alexander “Sasha” Rush, Professor of Computer Science Vitaly Shmatikov, Assistant Professor of Computer Science Volodymyr Kuleshov, and PhD student Jack Morris.

The paper explores privacy concerns surrounding text embeddings, a technique in natural language processing (NLP) that solves the challenges presented by the nuanced and sometimes ambiguous nature of words and phrases. While machines can quickly and efficiently understand numbers, human language is much more tricky. Therefore, text data is converted to numerical data that a machine learning algorithm can adeptly process. In some instances, such as with systems that utilize large language models, auxiliary data is stored in a vector database of dense embeddings until it needs to be retrieved.

But just how private are these vector databases? If someone with malicious intent were to attempt to reverse engineer text embeddings, how much private information could they reveal about the original text?

As it turns out, quite a bit. Using a multi-step method called Vec2Text, the authors were able to reconstruct 92 percent of a data set of original text exactly. Further, the team successfully retrieved 94 percent of first names, 95 percent of last names, and 89 percent of full names from a data set of clinical notes. Their findings have profound implications for data privacy, especially in sensitive domains like healthcare.

“Large language models are causing us to rethink lots of assumptions about privacy and natural language. While it was known that this technique was theoretically possible, it was quite surprising to see it work so well on real instances,” says Rush.

The researchers conclude that text embeddings and raw data expose similar amounts of sensitive information. Consequently, they advocate for treating both with equal precautions, both technically and perhaps legally.