Faux' blog

2012-03-05

Colemak UK and signtool

I've just uploaded a signed version of my "Colemak UK" (colemauk) keyboard layout: colemauk installer (asc).

I remembered it wasn't signed when the Windows 8 beta started nagging me about it. I allocated a 5-minute task on my "to-do" list to fix it.

However, taking the generated binaries from before (and verifying them with GPG), signtool is perfectly happy to sign the MSIs and the DLLs, but the setup.exe, the actual launcher that asks for elevation in the first place gives:

$ signtool sign /a setup.exe
Done Adding Additional Store
SignTool Error: SignedCode::Sign returned error: 0x80070057
        The parameter is incorrect.
SignTool Error: An error occurred while attempting to sign: setup.exe

Number of errors: 1

Oh.

(Extensive) investigation with STUD_PE reveals that the certificate table, the location where signtool is expecting to find current certificates and write new ones, is full of junk; an address and a block that reads past the end of the file. While STUD_PE allows you to fix this, I elected to write a tool to automatically strip evidence of signatures from files: unsigntool (github), the opposite of signtool.

2012-02-24

Announcing TinyJar

I've just uploaded a first release of tinyjar to github.

It takes a runnable jar and emits a much smaller runnable jar. For my current project, CODENAME GUJ:

guj-maven-shade.jar: 14MB
guj-maven-shade-minimised.jar: 7MB
guj-tiny.jar: 2.8MB

Noting that maven-shade-minimised.jar doesn't start (as it deletes half of Spring, as it can't see it's needed as it's only accessed via. reflection), this is an 80% reduction. Not bad. It should work for any runnable jar.

TinyJar works by running the jar through pack200, then through LZMA. Neither are new technologies, but they are rather slow, even during unpacking. It adds at least a few seconds to application start as it unpacks the application to the temporary directory.

I thought this would be simpler than using a one-jar-style jar-in-jar classloader.. on the stream coming out of the decompressor.. in some way.. etc.

2011-12-29

My gitolite set-up

I'm paranoid, but also poor. I use gitolite to control access to my git repositories, because github wanted $200/month to meet half of my requirements, and wern't interested in negotiating (I tried).

Like github, I have two types of git repositories. Public repositories; which show up on gitweb and git-daemon and etc., that everyone can access; and private repositories, which contain my bank details.

My conf file consists of:

A set of user groups: While gitolite supports multiple keys for one user, I prefer to treat my various machines as separate users, for reasons that'll become apparent later.

@faux    = admin fauxanoia fauxhoki fauxtak
@trust   = @faux alice
@semi    = fauxcodd fauxwilf bob

A set of repositories, both public and private:

@pubrepo = canslations
@pubrepo = coke
@pubrepo = cpptracer
...
@privrepo = bank-details
@privrepo = alices-bank-details

Descriptions for all the public repositories, so they show up in gitweb:

repo coke coke = "Coke prices website"

repo cpptracer cpptracer = "aj's cppraytracer, now with g++ support"

And permissions:

repo @pubrepo RW+ = @trust RW = @semi R = @all daemon gitweb config core.sharedRepository = 0664

repo @privrepo RW+ = @trust

This allows trusted keys to do anything, and semi-trusted keys (i.e. ones on machines where there are other people with root) to only append data (i.e. they can't destroy anything, and can't make any un-auditable changes).

Next, to protect against non-root users on the host itself, I have $REPO_UMASK = 0027; in my .gitolite.rc. This makes the repositories themselves inaccessible to other users. However, gitweb needs to be able to read public repositories; the above config core.sharedRepository = 0664 does this.

This leaves only /var/lib/gitolite/projects.list (which is necessary as non-git users can't ls /var/lib/gitolite/repositories/, so gitweb can't discover the project list itself), and repositories/**/description, again for gitweb.

For this, I have a gitolite-admin.git/hooks/post-update.secondary of:

#!/bin/sh
chmod a+r /var/lib/gitolite/projects.list
find /var/lib/gitolite -name description -exec chmod a+r {} +

Now, gitweb can display public projects fine, and local users can't discover or steal private repositories.

2011-10-25

Diagnosing character encoding issues

Natural language is horrible. Unicode is an attempt to make it fit inside computers.

I'm going to make up some terms:

Symbol: A group of related lines representing what English people would call a letter
Glyph: A group of related lines that might be stand-alone, or might be combined to make a symbol

And use some existing, well defined terms. If you use one of these wrongly, people will get hurt:

Code point: A number between 0 and ~1.1 million that uniquely identifies a glyph. They have numbers, written like U+0000, and names,
Encoding: A way of converting information to and from a stream of bytes,
Byte: The basic unit of storage on basically every computer; an octet, 8 bits. This has nothing to do with letters, or characters, or... etc.

Let's start at the top:

Here's an grave lower case 'a': à. This is a symbol.
- It could be represented by a single glyph, the code point numbered "U+00E0" and named "Latin Small Letter A With Grave", like above,
- It could be represented by two glyphs.
  - This looks identical, but is actually two glyphs; an 'a' (U+0061: "Latin Small Letter A") followed by a U+0300: "Combining Grave Accent". These two glyphs combine to make an identical symbol.
  - This is, of course, pointless in this case, but there are many symbols that can only be made with combining characters. Normalisation is the process of simplifying these cases.
  - Don't believe me? Good. Not believing what you see is an important stage of debugging. Open my grave a accent test page, and copy the text into Notepad (or use the provided example), or any other Unicode-safe editor, and press backspace. It'll remove just the accent, and leave you with a plain 'a'. Do this with the first à and it'll delete the entire thing, leaving nothing. See? Different.
  - Side note: The test used to be embedded into this post, but the blog software I use is helpfully normalising the combining character version back to the non-combining-character version, so they actually were identical. Thanks, tom!
So, let's assume we're going with the complex representation of the symbol, the two code points: U+0061 followed by U+0300. We want to write them to any kind of storage, be it a file, or a network, or etc. We need to convert them to bytes: Encoding time.
- Encodings generate "some" bytes to represent a code point. It can be anywhere between zero and infinity. There's really no way to tell. Common encodings, however, will generate between one and six bytes per code point. Basically everyone uses one of the following three encodings:
- UTF-8: generates between one and six bytes, depending on the number of the code point. Low-numbered code-points use less bytes, and are common in English and European languages. Other languages will generally get longer byte sequences. Common around the Internet and on Linux-style systems.
- UTF-16: generates either two or four bytes per code point. The vast majority of all real languages available today fit in two bytes. Common in Windows, Java and related heavy APIs.
- I have no idea what I'm doing: Anyone using anything else is probably doing so by mistake. The most common example of this is ISO-8859-*, which means you don't care about non-Western-European people, i.e. 80% of the people in the world. These generate one byte for every code point, i.e. junk for everything except ~250 selected code points.
Let's look at UTF-8.
- First code point: U+0061. This happens to be below U+007F, so is encoded in a single byte in UTF-8. This happens to align with low-ASCII, a really old encoding that's a subset of ISO-8859. The single byte is 0x61, 0b0110001. Note that the first bit is '0'.
- Second code point: U+0300. This is not below U+007F, so goes through the normal UTF-8 encoding process. In one sentence, UTF-8 puts the number of bytes needed in the first byte, and starts every other byte with the bits "10". In this case, we need two bytes, which are 0xCC, 0x80; 0b11001100, 0b1000000. Note how the first byte starts with "110", indicating that there are two bytes (two ones, followed by a zero), and the second byte starts with "10".
- Note: All valid UTF-8 data can easily be validated and detected; if a byte has it's left-most bit set to '1', it must be part of a sequence. No exceptions.
- Consider
```
$ xxd -c1 -b some.txt | grep -1 ': 1'
0000046: 01100001  a
0000047: 11001100  .
0000048: 10000000  .
```
  , which shows the byte patterns outlined: The 'a' with the leading 0, then the two bytes of the combining character. xxd is the only tool you should trust when diagnosing character encoding issues. Everything else will try and harm you, especially your text editor and terminal.
UTF-16 is much easier to recognise and much harder to confuse with other things as, for most Western text (including XML and..), it'll be over 40% nulls (0x00, 0b00000000).
Now you've done your conversion, you can write the bytes, to your file or network, and ensure that whoever is on the other end has enough information to work out what format the bytes are in. If you don't tell them, they'll have to guess, and will probably get it wrong.

In summary:

Have some data, in bytes? Don't pretend it's text, even if it looks like it is; find out or work out what encoding it's in and convert it into something you can process first. It's easy to detect if it's UTF-8, UTF-16 or if you have a serious problem.
Have some textual information from somewhere? Find an appropriate encoding, preferably UTF-8 or UTF-16, to use before you send it anywhere. Don't trust your platform or language, it'll probably do it wrong.
Can't work out what's in a file? Run it through xxd and look for nulls and bytes starting with '1'. This'll quickly tell you if it's UTF-16, UTF-8 or effectively corrupt.

Hopefully that's enough information for you to know what you don't know.

For more, try Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

2011-10-09

PuTTY Tray

I've released an updated version of PuTTY Tray to puttytray.goeswhere.com, ~~direct download: putty.exe p0.61-t004~~ please see the site for the latest version and details.

This is a fork of Barry Haanstra's PuTTY Tray, which is abandoned.

Main advantages:

Now built against PuTTY 0.61, getting features like Windows 7 Jumplist and Aero support, and four years of core PuTTY development
Ctrl+mousewheel zoom support
URL detection works on URLs ending with close-brackets
Much easier to continue development of, build script generator works and source, issue and pull-request tracking provided by github.

Please raise a bug if you have any problems or requests!

2011-09-04

Addition is free, static compilation is expensive

Summing the integers from 1 to 10,000,000?

Perl: 1.01 seconds
Python: 2.04 seconds
Java, including javac: 0.76 seconds
Java, including ecj: 0.61 seconds
Java, including javac, with -Xint: 1.01 seconds.
Java, including javac, with -Xint on the compiler too: 1.16 seconds.

-Xint disables practically all optimisations that Java offers, forcing the JVM into interpretation mode, so it'll operate much like perl and python do.
ecj is Eclipse's compiler for Java, a faster and cleaner implementation of javac that can run standalone.

i.e. including compilation time on an entirely unoptimised compiler, Java is still twice the speed of Python.

(This isn't really interesting or surprising to me, but the question comes up often enough that I'd like to have these here to link WRONG people to.)

« Prev - Next »