How robots.txt actually works — and what it doesn't do
Robots.txt is a voluntary signal, not a security mechanism. Well-behaved crawlers (Googlebot, Bingbot, Twitterbot) check it before crawling. Bad actors and scrapers ignore it entirely. If a URL is disallowed in robots.txt but linked from other pages, Google may still show the URL in search results (as a URL with no title or snippet) — disallowing doesn't prevent indexing, it prevents crawling. To prevent indexing, use a noindex meta tag on the page itself.
Directives Google honors vs. ignores
| Directive | Google honors it? | Notes |
|---|---|---|
| User-agent | Yes | Wildcard (*) covers all bots; Googlebot is case-sensitive |
| Disallow | Yes | Blocks crawling of the path; empty value = allow all |
| Allow | Yes | Overrides Disallow for a sub-path; useful for /path/* exceptions |
| Sitemap | Yes | Absolute URL to sitemap.xml — recommended to include here |
| Crawl-delay | No | Google ignores this; use Google Search Console rate limiting instead |
| Noindex | Deprecated | Google dropped support in 2019; use meta noindex tag instead |
| Host | No | Not recognized by Google; use canonical tags for domain preference |
Two mistakes that block your entire site
- Disallow: /
Disallow: /underUser-agent: *blocks every crawler from every page. This is the correct robots.txt for a staging server, but if accidentally deployed to production it removes your entire site from search results within days. Always verify after deployment. - Blocking CSS and JS filesIf your robots.txt blocks
/static/or/_next/, Google can't render your pages — it sees unstyled HTML and scores them as low quality. Googlebot must be able to crawl CSS, JS, and font files to render the page the same way a user sees it.
