Robots.txt, done right.
Build a valid robots.txt in 60 seconds. User-agent rules, allow and disallow paths, sitemap location. Block AI crawlers or let them in. Copy-paste to your site root.
What each line actually does.
Robots.txt has a tiny vocabulary: user-agent, disallow, allow, sitemap. That is it. Master those four and you control every crawler that visits your site. Here is what each one means and the common patterns.
User-agent Target
User-agent: * (all crawlers)
Which crawler the next rules apply to. * means everyone. Use Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot to target specific ones. Group multiple user-agents under one block by listing each on its own line before the rules.
Disallow Block
Disallow: /admin/
Path patterns the user-agent should NOT crawl. Use a leading slash. Trailing slash matters (it means folder; no slash means prefix match). Use * for wildcards. Empty Disallow: means "block nothing" (i.e., allow all).
Allow Override
Allow: /admin/public/
Carve out exceptions inside a Disallow path. If /admin/ is blocked but /admin/public/ should be crawled, add Allow for the inner path. Allow lines must come before the Disallow they override (or be more specific).
Sitemap Pointer
Sitemap: https://yoursite.com/sitemap.xml
Tells crawlers where to find your XML sitemap. Use absolute URLs (not relative). Can appear anywhere in the file but convention is the last line. Multiple sitemaps are allowed (one per line). Improves discovery of new pages.
Wildcards Pattern
Disallow: /*.pdf$ (all PDFs, $ = end)
* matches any sequence. $ marks end of URL. Disallow: /*?* blocks any URL with query parameters. Disallow: /tag/* blocks all tag pages. Google and Bing both honor wildcards; older or niche crawlers may not.
Crawl-delay Throttle
Crawl-delay: 10 (seconds between requests)
Tells the crawler to wait N seconds between requests. Google IGNORES this (use Search Console's crawl rate setting instead). Bing, Yandex, and many smaller crawlers respect it. Useful to throttle aggressive bots that strain your server.
User-agent names you should know
Sources: Google Search Central docs, OpenAI GPTBot documentation, Anthropic Crawler documentation, current as of 2024.
Honest answers.
Robots.txt is a plain-text file at the root of your website (yoursite.com/robots.txt) that tells search engine crawlers which pages to crawl and which to skip. Every major crawler reads it before fetching anything else. It is the simplest, most universal way to control what Google, Bing, and AI crawlers like GPTBot and ClaudeBot can access on your site.
At the root of your domain: yoursite.com/robots.txt (not yoursite.com/folder/robots.txt). Most CMSes have a setting for this. WordPress: Yoast or RankMath plugin. systeme.io: handles it automatically based on your domain. Webflow: Site settings > SEO > robots.txt. For hand-coded sites: upload the file to your web root via FTP or your hosting panel.
No. Robots.txt only tells crawlers what to crawl. A blocked page can still appear in search results if other sites link to it (Google indexes the URL but cannot read the content). To prevent indexing entirely, use a noindex meta tag on the page itself OR remove the page and serve a 404. Robots.txt is a crawl control, not an indexing control.
Disallow (in robots.txt) tells crawlers not to FETCH the page. Noindex (in a meta tag or HTTP header) tells crawlers not to INDEX the page (it can still be fetched). Critical implication: if you Disallow a page in robots.txt, Google cannot see the noindex tag on that page, so the URL might stay indexed forever as a blank entry. To deindex something properly, allow crawling AND set noindex.
For most sites: none. Search crawlers (Googlebot, Bingbot) drive traffic; blocking them hurts SEO. AI crawlers (GPTBot, ClaudeBot, PerplexityBot) are now a strategic decision: block to prevent your content training AI, allow to be cited as a source in AI Overviews and chatbot responses. Block scrapers and bandwidth-hungry bots (SemrushBot, AhrefsBot) if they overload your server, but be aware this also blocks competitive SEO research you might use yourself.
Depends on your strategy. Block them if you sell content (publishers, course creators) and do not want it trained on without compensation. Allow them if you want visibility in ChatGPT, Claude, Perplexity, and Google AI Overviews — AI engines cite sources they were trained on or can browse. Major AI-bots and their robots.txt names: GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Gemini training), PerplexityBot (Perplexity), CCBot (Common Crawl, used by many).
Yes. The generator runs in your browser. Nothing you enter is sent to systeme.io or any other server, stored, or logged. You can verify in DevTools by watching the network tab while you use the tool.
Run your funnels on systeme.io.
Build landing pages, sales funnels, online courses, email automations, and affiliate programs on one platform. Robots.txt, sitemap, SSL handled automatically. Free plan, 2,000 contacts.
Start free