unrobotstxt 0.1.0

Translation of Google's robots exclusion protocol (robots.txt) parser and matcher


To use this package, run the following command in your project's root directory:

Manual usage
Put the following dependency into your project's dependences section:

unrobotstxt

This is a D translation of Google's robots exclusion protocol (robots.txt) parser and matcher. It's derived from Google's open source project, but not affiliated with Google in any way.

Features

  • Matches Google's (open source) implementation of the robots.txt standard
  • Available as a library or a standalone test tool
  • @safe

Standalone tool

Can be used to test a robots.txt file, to see if it blocks/allows the URLs you expect.

Usage example
$ wget https://dlang.org/robots.txt
$ cat robots.txt 
User-agent: *
Disallow: /phobos-prerelease/
Disallow: /library-prerelease/
Disallow: /cutting-edge/
$ robotstxt robots.txt MyBotName /index.html
user-agent 'MyBotName' with URI '/index.html': ALLOWED
$ robotstxt robots.txt MyBotName /cutting-edge/index.html
user-agent 'MyBotName' with URI '/cutting-edge/index.html': DISALLOWED
Building

Run dub build from repo root. You can put the resulting robotstxt binary in your PATH.

Alternatively, download, build and run from the DUB registry with dub run unrobotstxt.

Library

Usage example
import std;

import unrobotstxt;

void main()
{
	const robots_txt = readText("robots.txt");
	auto matcher = new RobotsMatcher();
	if (matcher.AllowedByRobots(robots_txt, ["MyBotName"], "/index.html"))
	{
		// Do bot stuff
	}
}

There's no API for parsing once and then making multiple URL checks.

For pure Google-style parsing (no matching), you can also implement the callbacks in the RobotsParseHandler abstract class and pass it to ParseRobotsTxt.

Documentation

See the generated docs. The example above is pretty much what you get, though.

The code supports a StrictSpelling version that corresponds to a kAllowFrequentTypos global boolean in the original C++ version. It disables some typo permissiveness (e.g., "Disalow" instead of "Disallow"), but still allows various other quirk permissiveness. Otherwise the API matches the original C++ code.

Contributing

Bug fixes and misc. improvements are welcome, but make a fork if you want to extend/change the API in ways that don't match the original. I've named this project unrobotstxt to leave the robotstxt name available for a project with a more idiomatic API.

Authors:
  • Simon Arneaud
Dependencies:
none
Versions:
0.1.0 2020-Jul-03
~master 2020-Jul-03
Show all 2 versions
Download Stats:
  • 0 downloads today

  • 0 downloads this week

  • 0 downloads this month

  • 7 downloads total

Score:
0.0
Short URL:
unrobotstxt.dub.pm