How to Protect Your Writing from AI Scraping

Holly Rhiannon
Oct 24
8 min read

You've spent months, maybe years, perfecting your manuscript. Every sentence carefully crafted. Every character meticulously developed. Every plot twist earned through countless revisions. Then you wake up one morning to discover that your writing has been fed into an AI training dataset without your permission, without compensation, and without credit.

This isn't a dystopian future. It's happening right now.

Large language models are harvesting written content across the internet, and most writers have no idea their novels and articles are being used this way. The scale is staggering. ChatGPT alone was trained on 300 billion words. To put that in perspective, if you wrote 1,000 words every single day, it would take you roughly 2,740 years to reach just one billion words.

The question every writer should be asking is: where did all those words come from?

The Uncomfortable Truth About AI Training Datasets

Here's what you won't see advertised: many of the largest datasets were compiled from e-book piracy sites. The same illegal repositories that writers have been fighting against for years are now being used to train the very technologies threatening to replace human authors.

AI companies built these datasets without authorization and without compensation. The assumption? Everything on the internet is fair game.

The Authors Guild has taken a firm stance against this practice. In January 2025, they launched a "Human Authored" certification system to help readers identify content created without AI assistance. Their statement makes it clear: "Generative technologies built illegally on vast amounts of copyrighted works without licenses, without giving authors any compensation or say over the use of their works, are used to cheaply and easily produce works that compete with and displace human-authored books, journalism, and other works."

Translation? Your writing is being used to train the machines that will compete against you.

Why Quality Writing Matters to AI Model Development

AI bots don't just scrape random content from the internet… they hunt down high-quality material to feed upon, which allows them to function properly. When Microsoft tried training their first AI on Twitter, it went catastrophically wrong within 24 hours. That failure taught developers an important lesson: garbage in, garbage out.

So they turned to better sources! Published novels, literary articles, scientific papers, and carefully written blog posts. Books, articles, your writing. Your years of honing your craft. Your unique voice.

The better the writing, the smarter the model becomes. And while writers are often struggling to make ends meet, these systems are generating billions in revenue. OpenAI made approximately $3.7 billion annual revenue in 2024 and is predicted to land at $12.7 billion by the end of 2025.

Meanwhile, the authors whose writing powered that growth? They've seen nothing.

Legal Battles Are Just Beginning

The New York Times was the first major player to sue over AI scraping, because (not surprisingly) these models don’t just “learn” from content, they often reproduce it verbatim. Imagine discovering that passages from your novel are appearing in AI-generated responses without any attribution or compensation.

Other organizations may soon join the legal fight. But lawsuits take years, and writers need action now.

Practical Steps for How to Protect Your Writing from AI Scraping and Theft

1. Add a Copyright Notice

If you self-publish or have control over your copyright page, include a clear "No AI Training" notice.

Here's an example from the Authors Guild:

"NO AI TRAINING: Without in any way limiting the author's [and publisher's] exclusive rights under copyright, any use of this publication to 'train' generative artificial intelligence (AI) technologies to generate text is expressly prohibited. The author reserves all rights to license uses of this work for generative AI training and development of machine learning language models."

While this may not stop all scraping, it establishes your intent clearly and could support future legal action. Think of it as putting a "No Trespassing" sign on your property.

2. Block Bots Using Robots.txt

This is your digital "Do Not Enter" sign. A robots.txt file is a simple text document that tells web crawlers which parts of your site they can and cannot access. The robotstxt file is one of your most important tools for controlling bot access.

OpenAI has publicly stated they will respect robots.txt restrictions for GPTBot. Other developers have made similar commitments. While this isn't foolproof, it's currently one of the most reliable ways to limit access to your website content.

For WordPress Users:

If you use Yoast SEO, go to Yoast > Tools > File Editor to edit your robots.txt directly
Alternatively, install the WP Robots Txt plugin from your WordPress dashboard
In your site's privacy settings, look for options to prevent third-party AI training

For Other Website Platforms:

Squarespace offers a simple toggle in settings
Wix and GoDaddy users can contact support to help edit the robots.txt file

Manual Method: If you access your website files through FTP, you can create or edit the robots.txt file in your root directory. Add these lines:

User-agent: GPTBot

Disallow: /

User-agent: ChatGPT-User

Disallow: /

User-agent: Google-Extended

Disallow: /

User-agent: ClaudeBot

Disallow: /

User-agent: Claude-Web

Disallow: /

User-agent: CCBot

Disallow: /

This tells major crawlers that your entire site is off-limits for scraping.

3. Opt Out on Publishing Platforms

Many sites where writers share their content now offer AI opt-out settings. These aren't always obvious, so you'll need to dig into your account settings.

Substack allows users to block AI models from scraping content through privacy settings. Look for options labeled "AI training" or "AI bots" and disable them.

Tumblr and DeviantArt have added similar opt-out features in their privacy sections.

Medium and other blogging sites are starting to follow suit, though not all have implemented these protections yet.

If you're not sure whether your platform offers this option, contact their support team directly. Ask specifically about their policies on AI training and whether there are ways to opt out.

Don't hesitate to ask questions about how your content is being used.

4. Be Strategic About What You Share Online

Your best defense is intentionality. Before posting online, consider the following:

Post only excerpts or summaries on sites that still allow AI scraping
Keep your most valuable content behind paywalls or in formats that are harder to scrape
For visual artists accompanying your writing: use watermarks and lower-resolution images when sharing publicly
Consider tools like Glaze (a free app from the University of Chicago) that makes subtle pixel changes invisible to humans but challenging for AI to process

The goal isn't to hide your writing entirely. That defeats the purpose of being published. However, you do need to think strategically about where and how you share your pieces.

5. Monitor Your Content

Set up Google Alerts for your book titles, character names, and distinctive phrases. This won't catch everything, but it can help you spot when your writing appears in unexpected places online.

If you find unauthorized copies, send takedown notices immediately. This won't stop training on data already collected, but it can limit future scraping. And, if you’re not sure how takedown notices work we’re happy to volunteer or services for advice. Our founder Holly Rhiannon has worked with them regularly and will do her best to walk you through the process.

Just email her at minion@stygiansociety.com

The Common Crawl Connection

You may have heard of Common Crawl, a nonprofit organization that creates massive datasets of web content for research purposes. The problem is that OpenAI and others have used Common Crawl data to train their models. Instead of scraping websites individually, they went straight to this pre-made collection of text.

If you want a comprehensive method to protect your writing from AI scraping, consider blocking Common Crawl's bot (CCBot) in your robots.txt file as well.

Why This Matters for Independent Authors

At The Stygian Society, we understand the stakes. We were founded by authors, for authors, precisely because we believe in prioritizing human creativity over technological shortcuts. When NaNoWriMo endorsed AI-assisted writing in 2024, our founder Holly Rhiannon, a former NaNoWriMo municipal liaison, resigned in protest.

We created The Order of the Written Word as a direct response to that controversy. Our writing challenge champions human creativity, and our Discord community of over 600 members gathers to support each other in reclaiming their craft from AI encroachment. As we approach the end of October, our AI-free writing challenge is starting up November 1st—if you're a writer who wants to champion human creativity, now is the perfect time to join up.

The future of publishing depends on protecting human voices. Not because we fear technology, but because we value the irreplaceable quality of human storytelling. Stories crafted through lived experience, cultural understanding, and emotional depth can't be replicated, no matter how much data it's trained on.

What the Industry Predicts

According to publishing leaders surveyed by BookBub in 2025, the industry expects significant changes in how AI impacts authors. One prediction stands out: "Old-school authors will dip and then rise anew in popularity as people realize that the human touch is critical for a storyline that reaches their soul."

Readers are already starting to crave authenticity. They're drawn to honesty, vulnerability, and genuine emotion.. The more AI-generated content floods the market, the more valuable truly human-created narratives become.

But that value only translates to sustainable careers if we protect what makes human creativity special in the first place. If AI companies can continue training on copyrighted material without consequence, every article and novel becomes potential data for the next generation of models.

The Bigger Picture

Here's the reality: complete safety from AI scraping and theft of your writing isn't possible right now. Not unless you write exclusively in notebooks and never publish digitally, which isn't practical for most authors trying to build careers.

But taking these steps will help in the creation of barriers. It establishes a clear record of your objections to unauthorized use. And if legal frameworks eventually catch up with technology, you'll have evidence that you took reasonable steps to safeguard your writing.

The Authors Guild and other organizations are fighting for legislative and regulatory changes that would require consent before using books and other written content in training systems. Several lawsuits are working their way through courts right now, potentially setting precedents for copyright in the age of AI.

These legal battles matter. But they take time. In the meantime, writers need to take whatever protective measures we can.

A Word About Copyright Law

Copyright already gives authors exclusive rights to decide how their work is used. This intellectual property protection is fundamental to authorship. The problem is that many tech companies interpret everything on the internet as free to use unless explicitly told otherwise. They'll observe opt-outs when pressed, but they won't proactively seek permission.

This is why the copyright notices, robots.txt files, and platform opt-outs matter. They shift the burden of proof. Instead of relying on copyright law alone, you're creating multiple layers of documented objection.

Taking Action on Your Creative Work

Your words are your livelihood. Your narratives are years of your life transformed into content that connects with people. No one should be able to harvest your writing without authorization, compensation, or acknowledgment.

The steps outlined here won't guarantee absolute safety. But they significantly improve your chances of keeping your content out of AI training datasets. More importantly, they send a clear message. Writers are paying attention, we're taking action, and we won't quietly accept having our creativity exploited. Check your website settings today. Add that copyright notice to your next publication. Opt out of AI training on the sites where you publish. These aren't complicated steps, but they're important ones.

The publishing industry is watching these issues closely. As more authors take protective measures and more lawsuits challenge current practices, we're moving toward a future with clearer rules and better safeguards for human creativity.

At The Stygian Society, we're committed to being part of that future. We believe in human authors, human creativity, and human narratives. We believe your writing deserves to be defended, and we're here to support writers who refuse to let their content be treated as free training data.

Your voice matters. Your narratives matter. And they're worth defending.

Side note? 13 Haunted Nights is still on sale through October! Check it out and enjoy some spooky HUMAN writing.

13 Haunted Nights

CA$20.00CA$15.00

Buy Now

13 Haunted Nights (pdf)

CA$5.00CA$4.00

Buy Now

How to Protect Your Writing from AI Scraping

This isn't a dystopian future. It's happening right now.

The Uncomfortable Truth About AI Training Datasets

Why Quality Writing Matters to AI Model Development

Legal Battles Are Just Beginning

Practical Steps for How to Protect Your Writing from AI Scraping and Theft

1. Add a Copyright Notice

2. Block Bots Using Robots.txt

3. Opt Out on Publishing Platforms

4. Be Strategic About What You Share Online

5. Monitor Your Content

The Common Crawl Connection

Why This Matters for Independent Authors

What the Industry Predicts

The Bigger Picture

A Word About Copyright Law

Taking Action on Your Creative Work

Recent Posts

The Stygian Society

FAQ

SHIPPING & RETURNS

STORE POLICY