By Patricia Waldron
The data inputs that enable modern search and recommendation systems were thought to be secure, but an algorithm developed by Cornell Tech researchers successfully teased out names, medical diagnoses and financial information from encoded datasets.
People are able to search large databases because an encoder has transformed each piece of data into an “embedding” – a series of numbers representing the meaning of the text, image, sound recording or any other type of information. The new algorithm, called vec2vec, can translate databases of text embeddings back into English – with no knowledge of the original data or how it was encoded. Until recently, companies had assumed these embeddings were as good as encrypted.
“Everybody should think of these embeddings as being as sensitive as the underlying text,” said senior author Vitaly Shmatikov, professor of computer science in the Cornell Ann S. Bowers College of Computing and Information Science and at Cornell Tech. “Trusting anyone with your embeddings is the same as trusting them with your data.”