How to restrict AI companies from using your content as training data?

Publication / 2 September 2024 / Oliver Kuusk, Marian Moldau

Generative AI systems, which rely on vast amounts of data to generate new content, are evolving rapidly. These systems rely on self-training methods, using machine learning algorithms that analyse extensive databases, including images and text available on the web. However, this practice raises significant copyright concerns, particularly regarding the reproduction of data, even if only temporarily. So, how under EU copyright law can you restrict AI companies from mining your website for training data?

Why is mining copyrighted data allowed in the EU?

Generally, copying and use of copyrighted material online without authorisation constitutes an infringement. This means that when companies mine your website for copyrighted AI training data, they are infringing your rights.

However, to foster innovation and research, the EU introduced an exception to this rule long before generative AI emerged. The Copyright Directive 2019/790/EU introduced an exception that allows text and data mining by default. The copyright directive allows reproductions and extractions of text and data to be made, provided that:

1) the miner had lawful access to the content for the purpose of text and data mining
2) the owner of the rights has not expressly opted out from the exception
3) copies are retained only as long as is necessary for the purposes of text and data mining

This means AI companies can, by default, mine and use your data for training purposes. However, this exception only permits the reproduction and extraction of scraped content. It does not allow, for example, the publication or sale of this data.

How to opt out?

Website owners can exercise control over their data by utilising opt-out mechanisms. Under the copyright directive, opting out of the mining must be expressed in a manner that is machine-readable when the content is made available online. This enables website owners to declare their preferences regarding data mining explicitly. There are multiple ways to make these declarations:

Digital statements: website owners can use digital statements such as robots.txt and ai.txt files to explicitly prohibit or control access by AI crawlers. This file instructs bots as to which parts of the site they may or may not access, serving as a digital statement of the owner’s preferences regarding data extraction.
Digital rights management (DRM): implementing a DRM system provides a method to enforce copyright restrictions by preventing or restricting copying or limiting access to specific devices or IP addresses. DRM systems not only restrict access but also ensure that any attempt to bypass these protections is easily detectable and thus preventable.
Clear terms and conditions: when providing data access, particularly in B2B contexts, clear contractual terms in your terms and conditions serve as a ground to restrict the use of data for text and data mining purposes.

What AI companies need to do?

Companies looking to mine publicly available data need to put in place safeguards to comply with applicable copyright laws. This includes implementing a system to recognise entities that have opted out of the data mining exception. For instance, AI companies will need to update its data crawling algorithms to automatically skip websites that do not allow data mining (e.g. according to robots.txt and ai.txt files).

Why is mining copyrighted data allowed in the EU?

How to opt out?

What AI companies need to do?

What’s changing in ESG? Key moves across the Baltics and EU you shouldn’t miss

European Legal update: 14.05. – 10.06.2025.

How AI, music, and iconic brands are shaping IP law in the Baltics

Cryptonews.com: MiCA Pushes EU to Adapt – Not Everyone Will Make the Cut