The Compass DeRose Utilities
This page was written by Steven J. DeRose on 2008-08-23, and was last updated on 2011-03-13.
These utilities are mostly in Perl, and should work on most any platform. All of my utilities support a "-h" option that explains them, and a "-version" option to report their current version/date. Where relevant, most can accept and produce Mac, *nix, or DOS style line-breaks.
Remember to save any of these to a directory that's listed in your *nix path (say, ~/bin/?), and remember that you'll need to make it executable with: chmod +x [filename]
The categories of utilities include (click any one to see a directory
Below are descriptions of some of the more interesting utilities.
- normalizeXML takes an XML document, and regularizes the markup. It can produce Canonical XML, but can also pretty-print in a very wide range of layouts.
It can also delete various things (selected tag, elements, attributes, etc), and can insert
special data (mainly on new attributes), such as explicit xml:lang attributes, the FQGI (list of all element type names form the root down to the current element), and so on.
- tab2xml provides an easy way to make HTML or similar tables from text files with tab- or otherwise-delimited fields. You pick the delimiter, the tags, XML vs. HTML syntax, when characters are escaped or whole fields are quoted (even the unusual Microsoft syntax that allows newlines inside of quoted fields), whether there's a header record with fields names (if so, you can have the field names put onto the HTML td elements as class attributes, nd it generates a table header row), generate ids for the rows, indent the markup, and more. And if for some strange reason you want a different kind of field-delimited output instead of HTML or XML, it can do that too.
- xmlstats parses an XML document and collects a huge range of statistics about it. Various options control how extensive the statistics are.
- xmlparser is a small, non-validating XML parser. I wrote and debugged it in under 5 hours, after I heard a speaker comment that writing an XML parser was harder than they expected. Since we intended XML to be easy to implement, I just started writing, and had it basically working in about 3 hours 45 minutes. I think the only case it doesn't currently handle is entity references within attribute values. I think it has pretty good error reporting: it should catch and identify all WF errors, and it tracks useful things like where a problematic element *started*.
- xmlpipe lets you run XSLT really easily within a pipe. It take stdin and save it to a temp file, generates a "copy" XSLT file and adds your rules to it, runs it, and puts the output back into stdout. You can specify XSLT rule(s) on the command line, or refer to a file. And common rules such as to rename tags can be done with special short options: "xsltpipe -rename 'PARA>P section>sec chapter>chap'" is all it takes (it also provides an optional, very terse syntax so you can specify quite complex templates even on the command line. Say 'xsltpipe -abbrevs' to get the whole scoop on that). Knows about XML catalogs, pretty-printing the XML output.
Text and Unicode / Character-set Utilities
- ds double-spaces its input.
- dumpx displays the characters of a file in various ways, such as the octal, decimal, and/or hexadecimal equivalents. It is very similar to *nix "od", but IMHO has much nicer display, and can highlight occurrences of particular characters.
- addup adds up the n-th token (or all tokens) of each input line, and displays the total in octal, decimal, and hexadecimal. It can also calculate the product and average. If it finds ":" in tokens, it assumes they are time-durations (hh:mm[:ss]).
- align breaks each input line into fields based on a regular expression, finds the widest instance of each field (across all lines), and then copies the input, padding each field as needed so everything lines up. Justification of fields can be set manually or automatically (including decimal justification), and box-lines can also be drawn, using ASCII or Unicode box-drawing characters.
- bases will accept numbers in many different bases, and echo them back in a lot of bases. You type "0xDEADBEEF" and it will tell you what that is in decimal, hex, and binary. It can also recognize K, M, G, T, and P suffixes (Kilo, Mega, etc). And it will say what your number is if it can be interpreted as an ascii or UTF-8 code point. Finally, you can ask it to treat the input as a number of elapsed seconds into hh:mm:ss form. Like ord and chr, it just does something you need now and then, no muss, no fuss.
- body is like *nix 'head' and 'tail', but lets you pick out any range of lines from the middle of the input. You can choose the start and end points by line number, character offset (global or within a line), or a regex search. /or you ask for n lines starting from such a point. Like most of my scripts, this can handle any kind of input and output line-end types.
- chr displays the character(s) that corresponds to a given number (or string of numbers). For control codes it displays the appropriate mnemonic (or optionally, long name) because the character itself is unprintable. The number may be provided in binary, octal, decimal, or hexadecimal, and may optionally be echoed back in all those bases.
- dropLineBreaks will delete the line-break preceding and/or following all input lines that match a given regular expression. Much easier than trying to coerce most regex processors into doing this. Can optionally replace those line-breaks with some other string, instead of just dropping them.
- iota counts. You can tell it to print out a list of numbers. Boring, right? Except, you can have them in different bases, or a sequences of letters instead, or even a sequence of characters by code poin; and you can have them filled in to a template line. Every need a whole block of lines that are the same except for filling in a sequence of values in one or two places? This makes it trivial.
- nonascii This is one of the handiest tools out here. It will scan a file and report all instances of characters outside a specified range (despite the name, it doesn't have to just report non-ascii characters). It can also do a frequency count of all characters, and can report unprintable and/or Unicode characters by name (is that cool or what?). Even cooler, you can tell it to find not just literal characters, but to scan for HTML or XML named and numeric character references; special character escapes as coded in C, URIs, and several others places, and so on.
- ord displays the numeric code point that corresponds to a given ASCII or Latin-1 character, in binary, octal, decimal, and hexadecimal; also display the character's Unicode name. For control characters, it accepts the appropriate mnemonics. Characters that cannot be typed due to shell limitations, are listed in the help available via -h. The number may be provided in binary, octal, decimal, or hexadecimal, and may optionally be echoed back in all those bases. Also has options to display an entire chart.
- randomrecords pull out a certain number of random records from a text file. You can use it to random-sample almost anything.
- splitat inserts line-breaks before, after, or in place of all matches to some regular expression. Much easier than trying to coerce most regex processors into doing this. Can optionally indent non-first portions of the original lines.
- bases accepts a number in octal, decimal, or hex, and prints it in all of those plus binary. It also displays the ASCII or UTF-8 equivalent, can use units like Kilo-, Mega-, Giga-, etc., and can convert elapsed seconds to hh:mm:ss form.
The "shell" is what you're talking to when you run the Mac OS X "Terminal" program. It allows you to do an incredible range of things, most of which you already know if you use Linux, Unix (particularly BSD), etc.)
- colorstring takes a color name, and return the escape-string needed to switch the shell to displaying in that color (the color "default" switches back). You can have it just give you the escape string, to use in your own output from a script, or in a Bash prompt-string; or you can provide an optional message to print in the named color, to either stdout or stderr; or you can have it color whatever input it receives in one or more colors.
- hilite scans for various things in its input, and colorizes them. You can colorize the matches to any number of regular expressions, each in a specified color. You can colorize justy the matches, or the entire lines that contain them. You can omit lines with no matches (thus making this handy as a grep for multiple expressions at once). hilite also has many options for predefined sets of regexes, such as to syntax-color XML or particular XML constructs, CSS, CVS output, diff output, man pages, URIs, etc.
OSIS (Open Scripture Information Standard) Utilities
- osisCheck checks an OSIS XML file to make sure it has all the expected books, chapters, and verses, and in the right order. It has options for Hebrew vs. Greek Psalm numbering, whether to expect the Apocrypha, whether the text is TR-based or UBS/Nestle/WH, and so on.
- Moving your data from PalmOS to iPhone 3G: See Instructions. This is very easy for the Calendar and Contact/Address list; considerably harder for the to-do list and memos (but I provide some AppleScript that helps).
- HTML documentation of the Palm Desktop application (Mac OS X) AppleScript Dictionary is here.
- HTML documentation of the Apple iCal application (Mac OS X) AppleScript Dictionary is here.
- A Perl script, appleScriptDoc, that helps you turn an Apple Script Editor dictionary for some application, into more readable HTML documentation. Basically, you copy and paste the doc for each class, all into one big file, then run this script on it to make the HTML.
- findChild(n,name), findPSibling(n,name), findFSibling(n,name)
- tableSum(frow, fcol, lrow, lcol), Product, NCells, Min, Max
Back to home page of Steve DeRose
or The Bible Technologies Group.
or The Bible Technologies Group Working Groups. Or, contact me via email (fix the punctuation).