The shady world of Brave selling copyrighted data for AI training

I'm fairly certain that I was not the only person in the world who thought to himself, "Did they just yoink the entire Internet and bundle it together into a glorified copy and paste machine?" upon the release of ChatGPT. And even though there are some concerns about the type of data that was used […]

And even though there are some concerns about the type of data that was used to train OpenAI's latest model, it seems that the overall stance of OpenAI and other companies working on similar projects is that it is fair use. Whether or not that is going to hold up in the long run, remains to be seen.

After Google published an announcement saying they're interested in exploring alternatives to robots.txt to provide broader control over AI-related content issues, I was curious to see what other search engines are doing in regard to AI, both for dealing with AI-generated content but also handling data.

Personally, I'm not a big fan of these conglomerates ingesting other people's work and then reselling it, which also leads me to the story I'm going to talk about today.

Brave gives you "rights" to use data for AI inference/training

As you may have noticed, I used the word copyrighted for the title of this story. And it's not without reason. I think this story could have been fairly decent even without the copyright part, so before we get to the nitty gritty stuff - I can 100% confirm that Brave lets you ingest copyrighted material through their Brave Search API, to which they also assign you "rights".

screenshot-brave.com-2023.07.14-22_12_41

Brave offers numerous API products, some of which are specifically designed for AI. This one, Data for AI, lets you "Feed results to AI models for inference", while their premium version of this same API lets you "Cache/store data to train AI models" not only with "regular" rights but also "storage rights".

Rather than talking about it too much, I thought the logical thing to do would be to sign up for the API and see what kind of data we can find. For its Data for AI product, Brave offers something called "Extra alternate snippets", which are very similar to what we know as Google's Featured Snippets.

screenshot-www.google.com-2023.07.14-22_

An example of a typical Google Featured Snippet

Google's featured snippets tend to be rather short (no more than 50 words), which from a copyright point of view can be classified as fair use.

Fair use is a doctrine in the law of the United States that allows limited use of copyrighted material without requiring permission from the rights holders. It provides for the legal, non-licensed citation or incorporation of copyrighted material in another author's work under a four-factor balancing test:

The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes

The nature of the copyrighted work

The amount and substantiality of the portion used in relation to the copyrighted work as a whole

The effect of the use upon the potential market for or value of the copyrighted work

So, after doing a few queries with Brave's Search API - I was rather surprised to see how generous their snippets are; in this example below - the "extra_snippets" range from 150 to 260 words.

Here is the cleaned-up JSON response from the API; this particular response was from a query "Brave Search", and the "extra_snippets" are extracted from this Wikipedia page. Mind you; this is for a single query from a single site, not taking into account the other (mentioned below) search features that Brave provides through its Data for AI API.

"extra_snippets":[
"Brave Search is a search engine developed by Brave Software, Inc. and released in Beta in March 2021, following the acquisition of Tailcat, a privacy-focused search engine from Cliqz. Brave Search aims to use its independent index to generate search results. However, the user can allow the Brave browser to anonymously check Google for the same query.",
"In October 2021, Brave Search was made the default search engine for Brave browser users in the United States, Canada, United Kingdom (replacing Google Search), France (replacing Qwant) and Germany (replacing DuckDuckGo). In June 2022, Brave Search ended its beta stage and was fully released.",
"In June 2022, Brave Search ended its beta stage and was fully released. In addition to the launch, the new Goggles feature was added, allowing users to apply their own rules and filters to search queries. Brave search has various features designed to enhance users' searching experience:",
"Brave search has various features designed to enhance users' searching experience: Brave Search uses its own web index. As of May 2022, it covered over 10 billion pages and was used to serve 92% of search results without relying on any third-parties, with the remainder being retrieved server-side from the Bing API or (on an opt-in basis) client-side from Google.",
"Brave Search is a search engine developed by Brave Software, Inc., which is set as the default search engine for Brave web browser users in certain countries. Brave Search is a search engine developed by Brave Software, Inc. and released in Beta in March 2021, following the acquisition of Tailcat, ..."
]

I know for a fact that Wikipedia operates under a CC BY-SA 4.0 license, which explicitly states that if you're going to use the data, you must give attribution. As far as search engines go, they can get away with it because linking back to a Wikipedia article on the same page as the search results is considered attribution.

But in the case of Brave, not only are they disregarding the license - they're also charging money for the data and then giving third parties "rights" to that data.

One might argue that even 260 words are not useful enough for any real impact, but I'm not sure that is the case (besides the whole copyright thing) because not only can you manipulate these results and fine-tune the output based on domains, type, date, and other metrics - Brave also offer additional API features for paid customers, such as:

Schema-enriched Web results

Infobox

Discussions

Locations

All of which can be used to extract very specific information, and then be used to fine-tune LLM's without any worry for copyright infringement because Brave acts as a middleman.

Brave doesn't disclose its own robot crawler

I get anywhere from 30 to 50 visitors a day from Brave's search engine. But, if I go through my access.log files, I won't find any indication that a Brave crawler is regularly crawling my content.

They do have something called the Web Discovery Project, but from what I gather - it's an opt-in feature, so you must explicitly agree to it before you partake in the initiative.

The Web Discovery Project is a privacy-preserving way for you to contribute to the growth and independence of Brave Search. If you opt in, you’ll contribute some anonymous data about searches and web page visits made within the Brave Browser (including pages arrived at via some, but not all, other search engines). This data helps build the Brave Search independent index, and ensure we show results relevant to your search queries. By “data” we mean search queries, search result clicks, the URLs of pages visited in the browser, time spent on those pages, and some metadata about the pages themselves.

After some more digging, I was able to find a Reddit comment from Jonathan Sampson, Senior Developer Relations at Brave, who said the following:

We do indeed have our own crawler, actively building our own index. Presently the index consists of over 8 billion pages, with more than 40M crawled each day. The crawler, which does not contain a unique user-agent string, respects robots.txt.

They don't mention their crawler anywhere in their docs, either. So, if you wanted to block Brave from crawling and indexing and ultimately selling your content to third parties, your only option for the time being would be to block all crawlers, which is how Brave would be able to "respect robots.txt".

And don't get me wrong, I love Brave, and I've given them credit where it's due; it's also my understanding that the Brave Search API feature is new as a whole (released in May 2023), so perhaps it wasn't or hasn't been thought through completely.

I've asked for a comment from the Brave team on their thought process, so as soon as I have a statement, I will make sure to include it here.

Source

Adenman and Mutton
2

Recommended Comments

There are no comments to display.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Add a comment...

× Pasted as rich text. Paste as plain text instead

Only 75 emoji are allowed.

× Your link has been automatically embedded. Display as a link instead

× Your previous content has been restored. Clear editor

× You cannot paste images directly. Upload or insert images from URL.

Insert image from URL

Sign In

The shady world of Brave selling copyrighted data for AI training

User Feedback

Recommended Comments

Join the conversation

Recently Browsing 0 members

nsane.down

News

Browse

Activity