Faux' blog

2013-05-18

xqillac

xqillac is a fork of the xqilla command-line XQuery processor to.. make it work better from the command line. XQuery is a language for searching and editing XML; a superset of XPath.

For example, let's say you have some XML coming along a shell pipeline:

<a>
  <b id="5">hello</b>
  <b id="7">world</b>
</a>

..and you want just the 5th b's text:

$ printf '<a><b id="5">hello</b><b id="7">world</b></a>' |
    xqilla -i /dev/stdin <(echo "data(//b[@id='5'])")
hello

Groovy. Much safer than trying to do this kind of thing with regexes or cut or whatever.

However, as you can see, this involves horrible abuse of the shell "<(" operator (which turns a command's output into a file (..well, close enough, right?)), and of /dev/stdin.

In xqillac, this is just:

printf '<a><b id="5">hello</b><b id="7">world</b></a>' |
    xqillac "data(//b[@id='5'])"

The above shell hack also fails (horribly: "Caught unknown exception"... or worse) if you attempt to use XQuery Update to edit the document in the pipeline:

xqillac allows this (for documents that fit in memory):

$ printf '<a><b id="5">hello</b><b id="7">world</b></a>' |
    xqillac "delete nodes //b[@id='5']"
<a><b id="7">world</b></a>

Code on xqillac's github. Please use the Github functionality for issues and pull requests.

2013-05-18

I like StatCounter's browser usage statistics, even if there are outstanding arguments about how they deal with data from multiple countries (i.e. if you're using them for decisions, you're probably much better taking the stats for your target country, rather than the global or regional stats... but you'd probably be doing that anyway.)

I dislike, however, the way they handle browser versions. I have always thought that they have more data than they display, and do a very poor job of actually showing it to you. For example, the Browser Versions for the time of writing has, at one point, 60+% of browsers in "uncategorised". How does that even happen?!

Browser Version (Partially Combined) makes a better stab at useful information, but is still lacking so much.

So... I wrote my own. Click for a full view:

This picks out some reasonably interesting features:

The huge black wedge between 2010-02 and 2011-04, in the Firefox area, is Firefox 3.6. You can see it absolutely refusing to die for the following few years, despite people using more modern versions.
The relentless Chrome release schedule.
How long of a tail IE has on releases being picked up by users, and how IE10 is doing better (automatic updates, perhaps?)

This is generated by browserstats. Yes, that has a lot of data archived from StatCounter. Yes, the code is awful and it's a pretty manual process to update the graph. Feel free to fix it.

2013-05-18

foo_disccache

(I've been saving up a load of tiny projects for one mega blog post. Apparently that's not working. Small project avalanche!)

foo_disccache (possibly more correctly spelt foo_diskcache) is a small foobar2000 component to help smooth playback on machines with plenty of memory.

It tries to trick the OS into caching some of the files that are soon to be played, so playback won't need to stall (waiting for the disc) when the track comes to play. This additionally allows the drive the music is being read from to spin down (saving noise/power) more frequently, and will assist with uninterrupted music playback when the system is under heavy IO load.

If you don't understand any of that, you almost certainly don't care.

Download and installation instructions are at Github. Please use Github's issue tracker or wiki for discussion, not comments.

2012-09-05

Frameworks, libraries and graphs

The Framework Graph

In his Serious Beard Talk, delivered at DjangoCon 2008, Cal Henderson explains how he thinks frameworks affect productivity. I would like to bring it up again, because it's still as relevant as ever, and it's hard to link busy software engineers to YouTube videos (of things that aren't cats).

A framework (or library?) exists to save you from doing common, repetitive tasks (he's thinking of things like mapping URIs to places, and reading and writing entities to the database) by providing a bunch of pre-existing code that'll do all this stuff for you:

If you decide to write all this stuff from scratch, you're going to be spending a lot of time developing boring, repetitive code (but maybe you're still doing it better than other people?), and you're not going to be delivering anything fancy. Eventually, though, you'll get up to speed, and your well-understood, targeted framework/library will assist you in every way.

If you pick the framework route, it takes you a short amount of time to get one of the damn demos to start, then you get a whole load of stuff for free, but, eventually, you run into things the framework can't or won't do for you, and you have to work out how to ram your workflow down its stupid throat aargh. Your delivery speed plummets, but eventually you'll get it working, and you'll have some free good design and code reuse in there. Now, with a decent grounding in how the framework works, and your extra flexibility layer, you can get back to the good stuff.

This results in roughly The Framework Graph, shown here, reproduced (in Paint) without permission.

I believe an unawareness of this Graph comes up frequently, even when not discussing frameworks directly:

I don't want to use a big complex library, it'll be simpler to do it myself.
I can't get it to do quite what I want, I'll start again from scratch.
I don't really understand why it makes this operation so hard, the tool must be broken. (Hibernate, anyone?)
Why does it want all those parameters/config/values? I just want it to work.

I can pick many examples from my non-professional experience has come up, my favourite is (coincidentally!) That Graph Drawing Code:

I have two forks of some disasterously [old and] terrible plotting code for generic popularity data visualisation and political compass aggregation, and I have recommended (and/or forced) other people to use the same technique in their projects, mainly to avoid the tyrany of gnuplot.

For a more recent project plotting hard drive prices, I chose to finally give in and learn gnuplot. It was a pain, but the resulting gnuplot code is much neater than any of that PHP, and then I've got things like suprise svg support for free. Plus, it means that I can quickly bang out other things, like my statcounter browser versions reinterpretation.

This may not sound like a framework choice on the scale that he's talking about, but I urge you to consider the sheer horror of that PHP code.

In summary: The big horrible library/framework/tool/etc. will almost certainly provide you with more total productivity in the short (when most projects fail, at about the vertical line on the Graph), medium and long term, regardless of any pain it gives you.

2012-06-06

Password policy

In light of today's supposed LinkedIn breach, it seems like an appropriate time to finally write up my password policy.

Many people have cottoned on to the idea that having the same password on different sites is a bad idea. There's various technical solutions to this, such as generating a site-specific password. I, however, believe this scheme to be too inconvenient; they require you to always have access to the site or tool, and don't work well in public places.

What we're really trying to do here is:

Have different passwords on different sites.
Have passwords that are (very) hard to guess.
Be as lazy as possible.

What the first means is: If an attacker is given my password for a specific site, they can't easily derive the password for any other site. I am willing to risk the chance of them retrieving the password for multiple sites.

My proposal is to have a way to generate secure, site-specific passwords in one's head:

Remember an excessively long password.
Come up with some way to obscure the site name.
Put the obscured site name in the middle of your long password, and use that password for the site.

That is:

Remember an excessively long password: 14 characters is a good start. (My) pwgen can help you come up with suggestions. Note that this password doesn't need to be full of capitals, numbers or symbols; the sheer length makes it secure. "c8physeVetersb" is around a thousand times "more secure" (higher entropy) than "A0Tv|6&m".
Come up with a set of rules to obscure the site name: For example, take the "letter in the alphabet after the first character of the site name", and "the last character of the site name, in upper case". e.g. for "amazon", the obscured version of the site name would be "bN".
Mix them together: e.g. I'm going to insert the first bit, 'b' after the 'V', and the second bit, 'N' after the last 's', giving me "c8physeVbestersNb".
Use this password on Amazon.

Even if Amazon are broken into, all the attacker will get (after many CPU-decades of password cracking), will be "caphyseVbester5Nb", which, even if they know you're using this password scheme (but not the details of your transformation), doesn't tell them anything about your password on any other site.

All you have to do is remember the alphabet (uh oh).

2012-05-15

Repetitive crypto miscellany

HTTPS (HTTP over TLS) is the most accessible form of encryption for end users. It protects against real annoyances and attacks. I believe it's probably the most important thing to advocate, even among developers.

Paranoid internet user? Google and DuckDuckGo will run your search results over TLS. Some websites, like Facebook, allow you to specify that you always want to use HTTPS. You should.
Want more? HTTPS Everywhere is a Chrome/Firefox extension that tries to upgrade your connection to HTTPS on any website where it's available.
Host a website? HTTPS (HTTP over TLS) is free, easy to set up and isn't CPU intensive any more (for typical sites). While you're there, enable HTTP STS.
Yes, CAs sucking ruins some of this.

Cryptography, contrary to what you may have heard, is easy:

Data at rest?GPG. Data in motion? TLS.
You never, ever, ever want to use a "hash function" or a "cipher" directly. Ever.
Storing details about passwords? "Oh, I'll hash them with a hash function? Lots!" No. Use PBKDF2 (with 50,000 or more iterations), bcrypt or scrypt.
Offering any kind of integrity, oh, I'll use a hash function? No. HMAC.

Rough overview of primitive deprecation:

MD5 has been deprecated for all uses since last century. Why do people still use it for anything? Please mock anyone who does.
SHA-1 (from 1995) has been deprecated for most uses since 2010. Please don't use it for anything new, and start migrating away from it.
RC4 was designed in 1987 (25 years ago!), but is still supported everywhere because Windows XP (2001 technology) doesn't support AES for TLS. It has no other advantages.
Don't compromise security for speed. Why bother, if it's not going to be secure? Don't use old benchmarks for your decisions. My five year old computer is about twice as fast as that. Run real benchmarks, yourself. Want something actually fast? Use Salsa20/X.
Basically: SHA-2 (or 3!), AES (in CBC or CTR mode) or, if you're desperate, Salsa20.

Rough overview of key recommendations:

2⁸⁰ security: 160-bit SHA-1, 1024-bit modulus on public keys (thanks, GNFS), 160-bit EC keys. Attacking 2⁸⁰ combinations could plausibly be done in a human lifespan on a supercomputer or two; not enough.
2¹¹² security: 2048-bit modulus on public keys, currently believed to be okay until 2020-2030.
2¹²⁸ security: 256-bit SHA-2, 256-bit EC keys, 3096-bit modulus on public keys, etc. are likely to be fine for the foreseeable future.
2²⁵⁶ security: 512-bit SHA-2, 512-bit EC keys, 15360-bit modulus on public keys. That's a big step up.
128-bit AES falls somewhere into the middle, i.e. use 192-bit AES after 2020.
Note that these dates are when you expect your data to still be relevant, or your system in use; not when you plan to design or release the thing.

« Prev - Next »

Faux' blog

xqillac

Browser usage stats

foo_disccache

Frameworks, libraries and graphs

Password policy

Repetitive crypto miscellany

RSS Feed

Links

Read...