unrobotstxt 0.1.0
Translation of Google's robots exclusion protocol (robots.txt) parser and matcher
To use this package, run the following command in your project's root directory:
Manual usage
Put the following dependency into your project's dependences section:
unrobotstxt
This is a D translation of Google's robots exclusion protocol (robots.txt) parser and matcher. It's derived from Google's open source project, but not affiliated with Google in any way.
Features
- Matches Google's (open source) implementation of the
robots.txt
standard - Available as a library or a standalone test tool
@safe
Standalone tool
Can be used to test a robots.txt
file, to see if it blocks/allows the URLs you expect.
Usage example
$ wget https://dlang.org/robots.txt
$ cat robots.txt
User-agent: *
Disallow: /phobos-prerelease/
Disallow: /library-prerelease/
Disallow: /cutting-edge/
$ robotstxt robots.txt MyBotName /index.html
user-agent 'MyBotName' with URI '/index.html': ALLOWED
$ robotstxt robots.txt MyBotName /cutting-edge/index.html
user-agent 'MyBotName' with URI '/cutting-edge/index.html': DISALLOWED
Building
Run dub build
from repo root. You can put the resulting robotstxt
binary in your PATH
.
Alternatively, download, build and run from the DUB registry with dub run unrobotstxt
.
Library
Usage example
import std;
import unrobotstxt;
void main()
{
const robots_txt = readText("robots.txt");
auto matcher = new RobotsMatcher();
if (matcher.AllowedByRobots(robots_txt, ["MyBotName"], "/index.html"))
{
// Do bot stuff
}
}
There's no API for parsing once and then making multiple URL checks.
For pure Google-style parsing (no matching), you can also implement the callbacks in the RobotsParseHandler
abstract class and pass it to ParseRobotsTxt
.
Documentation
See the generated docs. The example above is pretty much what you get, though.
The code supports a StrictSpelling
version that corresponds to a kAllowFrequentTypos
global boolean in the original C++ version. It disables some typo permissiveness (e.g., "Disalow" instead of "Disallow"), but still allows various other quirk permissiveness. Otherwise the API matches the original C++ code.
Contributing
Bug fixes and misc. improvements are welcome, but make a fork if you want to extend/change the API in ways that don't match the original. I've named this project unrobotstxt
to leave the robotstxt
name available for a project with a more idiomatic API.
- Registered by Simon Arneaud
- 0.1.0 released 4 years ago
- sarneaud/unrobotstxt
- Apache 2.0
- Copyright © 1999-2020, Google LLC
- Authors:
- Dependencies:
- none
- Versions:
-
0.1.0 2020-Jul-03 ~master 2020-Jul-03 - Download Stats:
-
-
0 downloads today
-
0 downloads this week
-
0 downloads this month
-
7 downloads total
-
- Score:
- 0.0
- Short URL:
- unrobotstxt.dub.pm