In obstetrics AI boom, data is the new oil. So why not be able to sell your own products?
From major tech companies to startups, AI makers are licensing e-books, images, videos, audio, and more to data brokers, all in the pursuit of training more capable (and legally defensible) AI-powered products. Shutterstock has deals with Meta, Google, Amazon, and Apple to supply millions of images to train models, while OpenAI has signed agreements with several news organizations to train its models on news archives.
In many cases, the individual creators and owners of that data have never seen a dime of the cash changing hands. A startup called Vana wants to change that.
Anna Kazlauskas and Art Abal, who met in a class at MIT's Media Lab focused on building technology for emerging markets, co-founded Vana in 2021. Before Vana, Kazlauskas studied computer science and economics at MIT, then eventually left to launch Financial technology company. Automation startup, Iambiq, from Y Combinator. A corporate lawyer by training and education, Appen was an associate at Cadmus Group, a Boston-based consulting firm, before heading up influence sourcing at data annotation firm Appen.
With Vana, Kazlauskas and Abal set out to build a platform that allows users to “aggregate” their data — including conversations, speech recordings, and images — into datasets that can then be used to train generative AI models. They also want to create more personalized experiences — for example, a daily motivational voicemail based on your health goals, or an art-creating app that understands your style preferences — by adjusting overall models based on that data.
“Vana’s infrastructure actually creates a user-owned data vault,” Kazlauskas told TechCrunch. “It does this by allowing users to collect their personal data in a non-custodial way… Vana allows users to own AI models and use their data across AI applications.”
Here's how Vana showcases its platform and API to developers:
Vana API communicates user personal data across platforms…to allow you to customize your app. Your app has instant access to the user's custom AI model or underlying data, simplifying the onboarding process and removing concerns about computing cost… We believe users should be able to bring their personal data from walled gardens, like Instagram, Facebook, and Google, into your app, So you can create an amazing personalized experience from the first time a user interacts with your consumer AI application.
Creating an account with Vana is fairly simple. After confirming your email, you can attach data to your digital avatar (such as personal photos, descriptions, and audio recordings) and explore apps created with the Vana platform and datasets. The app selection ranges from ChatGPT-style chatbots and interactive storybooks to a detailed profile builder.
Now, you may be wondering why — in this age of heightened awareness of data privacy and ransomware attacks — would anyone volunteer their personal information to an anonymous startup, let alone a venture-backed company? (Vana has so far raised $20 million from Paradigm, Polychain Capital, and other backers.) Can any for-profit company really be trusted not to misuse or mishandle any monetizable data it acquires?
In response to this question, Kazlauskas emphasized that the primary goal of Vana is for users to “take back control of their data,” noting that Vana users have the option to self-host their data instead of storing it on Vana servers and control how their data works. Data is shared with apps and developers. It also said that because Vana makes money by charging users a monthly subscription (starting at $3.99) and charging developers “data transaction” fees (e.g. for transferring datasets to train AI models), the company is discouraged from exploiting users And the personal data they bring with them.
“We want to create models that are owned and governed by users who all contribute their data, and allow users to bring their data and models with them into any application,” Kazlauskas said.
Now, while Vana doesn't sell user data to companies to train AI models (or so it claims), it wants to let users do it themselves if they choose — starting with their Reddit posts.
This month Vana launched what it calls Reddit Data DAO (Digital Autonomous Organization), a program that aggregates multiple users' Reddit data (including their karma and posting history) and lets them decide together how to use that combined data. After joining a Reddit account, submitting a request to Reddit for their data and uploading that data to the DAO, users get to vote alongside other members of the DAO on decisions such as licensing the collected data to generative AI companies for a shared profit. .
It's an answer of sorts to Reddit's recent moves to commercialize data on its platform.
Reddit did not previously provide access to posts and communities for AI training purposes. But it reversed course late last year, ahead of the IPO. Since the policy change, Reddit has taken in more than $203 million in licensing fees from companies including Google.
“The broad idea [with the DAO is] “To free user data from the major platforms that seek to store and monetize it,” Kazlauskas said. “This is a first and is part of our efforts to help people aggregate their data into user-owned datasets to train AI models.”
Not surprisingly, Reddit — which does not work with Vana in any official capacity — is not happy with the DAO.
Reddit has banned the Vana subreddit dedicated to discussing DAO. A Reddit spokesperson accused Vana of “exploiting” its data export system, which was designed to comply with data privacy regulations such as the GDPR and the California Consumer Privacy Act.
“Our data arrangements allow us to place guardrails on such entities, even on public information,” the spokesperson told TechCrunch. “Reddit does not share non-public personal data with commercial organizations, and when Redditors request to export their data from us, they receive non-public personal data from us in accordance with applicable laws. Direct partnerships between Reddit and vetted organizations, with clear terms and accountability, and these partnerships and agreements prevent abuse Personal data and its misuse.
But does Reddit have any real reason to worry?
Kazlauskas envisions the DAO growing to the point where it impacts how much Reddit can charge customers for its data. This is far-fetched, assuming it happens at all; The DAO has just over 141,000 members, a small fraction of Reddit's 73 million user base. Some of these members may be bots or duplicate accounts.
Then there is the issue of how to fairly distribute the payments the DAO may receive from data buyers.
Currently, the DAO awards “tokens” – cryptocurrency – to users who match their Reddit karma. But karma may not be the best measure of the quality of contributions to a dataset — especially in smaller Reddit communities that have fewer opportunities to earn them.
Kazlauskas floats the idea that DAO members could choose to share their data across platforms and demographics, making the DAO more valuable and driving signups. But this also requires users to put more trust in Vana to handle their sensitive data responsibly.
I personally don't see Vana's DAO reaching critical mass. The barriers standing in the way are too many. However, I believe it will not be the last popular attempt to assert control over the data increasingly used to train generative AI models.
Startups like Spawning are finding ways to let creators impose rules to guide how their data is used for training while vendors like Getty Images, Shutterstock, and Adobe continue to experiment with compensation systems. But no one has been able to crack the code yet. Can it even be cracked? Given the aggressive nature of the generative AI industry, it is certainly a difficult task. But maybe someone will find a way, or policymakers may impose it.