Description

A HTML scraper command-line application

Package Information

Version1.1.2 (2016-Dec-06)
Repository https://github.com/mab-on/dominator
LicenseMIT
CopyrightCopyright © 2016, Martin Brzenska
AuthorsMartin Brzenska
Registered byMartin Brzenska
Dependencies

libdominator

Installation

To use this package, put the following dependency into your project's dependencies section:

dub.json
dub.sdl

Readme

dominator

dominator is a forgiving HTML-parser for the command-line.

usage & examples

Parameters

ParametershortDescription
--filter-fA Dominator specific filter expression
--output-item-oDefines the output
--output-item-terminator-tCharacter, which terminates one item group on output
--output-item-serparator-sCharacter, which separates the items on output
--input-file-iRead the input from a file instead of stdin
--with-html-comments-cInclude matches in commented html into the output
--squash-whitespaces-wRemoves multiple whitespaces. Only applies to the output-items 'element-strip' , 'element-inner' , 'element'
--output-item: Valid arguments
ArgumentDescription
tagThe name of the node
element-openerThe opening node-tag
elementThe nodes full content
element-innerThe nodes full inner content
element-stripThe nodes full inner content without tags
element-startThe position of the opening tag in the element
element-endThe position of the termination tag in the element
attrib-keysA comma-separated list of the nodes attributes
attrib(ATTRIB)The value of the attribute ATTRIB of the node

This example shows a query for a-tags, that are children of a li-tag and has a class attibute with the value "link". We want to the output to be "Tag"\t"Element attributes csv"\t"value of the element ettribute href"\n for each hit

$ cat ./dummy.html | ./dominator 'li.a{class:link}' -o'tag' -o'attrib-keys' -o'attrib(href)'
a	href,id,class	#a-1-li-1-o2-1
a	href,id,class	#a-2-li-2-o2-1
a	href,id,class	#a-3-li-2-o2-1

This Example shows a query for a-tags where the href begins with "http"

$ cat ./dummy.html | ./dominator 'a{href:(regex)^http}' -o'tag' -o'attrib-keys' -o'attrib(href)'
a	href,id,class	https://github.com

Filter Syntax

Expression = TAG[PICK]{ATTRNAME:ATTRVALUE}

Multiple expression can be concatenated with "." to find stuff inside of specific parent nodes.

ItemDescriptionExample
TAGThe Name of the nodea , p , div , *
[PICK](can be ommited) Picks only the n-th match. n begins on 1. PICK can be a list or range[1] picks the first match , [1,3] picks the first and third , [1..3] picks the first three matches
{ATTRNAME:ATTRVALUE}(can be ommited) The attribute selector{id:myID} , {class:someClass} , {href:(regex)^http://}

Build & install

build

dub build dominator copy the binary in one of your PATH directories

use a already build binary

Check out the bin/ directory. Occasionally i put Windows and Mac binaries in this directory - please be aware, that these binaries usually are not up to date.

Available versions

1.1.2 1.1.1 1.0.6 1.0.3 1.0.2 1.0.1 1.0.0 0.9.0 ~master