2015-04-22

Fat data and Postgres

I spend a reasonable portion of my life dealing with Medium Data: columnar files (typically CSV) you can easily fit on disc, can easily fit in disc cache after compression, and can probably fit in RAM after some pre-processing; files slightly too big to open in modern versions of Excel, and far, far too big for LibreOffice Calc or Apple's Pages.

Sometimes, I like to load these files into a database table to have a look around; say, to look at the distinct values in some columns. I was slightly surprised to find, however, that mostdatabaseengines have a column limit around the 1,000-2,000 column mark.

I've got a fat dataset, ~2,000 columns, 600MB total. Let's load it into Postgres anyway.

My first attempt was to insert it directly from psycopg2 (all operations inside a single transaction):

for line in file:
    cur.execute("""insert into big values (%s)""", (line,))

This results in binaries all over your db (thanks, Python!), and takes ~90s.

Fixing:

cur.execute("""insert into big values (%s)""", (line.decode('utf-8'),))

...gets it down to 60s. I'm not sure where the performance is coming from here. More efficient/aggressive compression for TOASTing of the non-binary text, even though it's bit-identical (as all the data is (close enough) to low-ascii)? More likely that the wire format and/or the driver code for strings has had more love.

Right! Data's in the db. Can't turn it into columns directly, so ... wait, there's no limit on array lengths!

create table arrays as select regexp_split_to_array(line, E'\t') arr from big;

... yeah. Well, it was a good idea. I didn't wait for this to complete, but I estimate 1h45m. Not sure what's slow here... maybe compiling the regexp, or it not expecting such a large array to be returned, such as would happen if you were heavily optimised for statements like:

... regexp_split_to_array(line, E'\t')[3] ...

Never mind. Python can probably do a better job!

cur.execute("""insert into arr values (%s)""", (line.decode('utf-8').split('\t'),))

Much to my surprise, Python actually does do a better job. 8m55s, around 50% of the time spent in Python, so would probably be a lot faster in a non-naive implementation, or after a JIT had fixed it up.

This table is actually usable:

select max(array_length(line, 1)) from arr2;
...
Execution time: 1971.869 ms

Good sign! Right, now, on to some data analysis. Let's look at the distinct values in each column:

echo select $(for i in {1..1959};echo "count(distinct line[$i]) line$i,") \
    | sed 's/,$/ from arr;/' \
    | psql -U big -x

For those that don't read shell, or are confused by the zshisms, this generates:

select count(distinct line[1]) line1, count(distinct line[2]) line2, ... from arr;

And it returns:

ERROR:  target lists can have at most 1664 entries

I actually had to look this error up. You can't select more than that totally random number of results at a time. Bah! I set it going on 1,600 columns, but also got bored of that running after around 20m.

It's worth noting at this point that Postgres does most operations on a single thread, and that this isn't considered a serious problem. It's not ideal for many of my usecases, but it's also not all that hard to get some parallelism in the client.

Let's parallelise the client:

P=8
COLS=1959
echo {1..$COLS} | xargs -n $(($COLS/$P+$P)) -P $P \
    zsh -c 'echo select $(for i in "$@"; echo "count(distinct line[$i]) line$i,")' \
    '| sed "s/,\$/ from arr;/"' \
    '| psql -U big -x'

(I really need to find a good way to manage parallelism in shell without xargs. Wait, no. I really need to stop writing things like this in shell.)

This takes around 17m total, and consumes a minimum of 20gb of RAM. But, at least it works!

For comparison:

  • pandas can ingest and compute basic stats on the file in ~1m / 6gb RAM (although it's cheating and only supporting numerics).
  • shell (cut -f $i | sort -u | wc -l) would take about 1h20m.
  • Naive Java implementation took me 4m to write from scratch, and takes about 35s to compute the answer.

In summary: Don't use Postgres for this, and Java wins, as usual.

Continue reading...


2015-04-10

LXC, dnsmasq and nginx

I have started using lxc extensively. It's reasonably easy to set up lxc on modern Ubuntu (I would argue that it's easier than Docker), and, with it's ability to run as a limited user, is much more security friendly. It also gels much better with the way I think about virtualisation and services; I don't want to be locked-in to a vendor specific way of thinking or deploying. All my tools and my muscle memory already understand ssh and shell.

A (long, oops) while ago, I spent a while diagnosing an interesting issue. Upon reboot, nginx would be unable to talk to some lxcs, but not others. This would persist for an annoyingly long time; restarting things would have very little effect.

It turned out that dnsmasq, which the Ubuntu setup uses for dhcp and dns resolution on containers, has some very non-ideal behaviour (spoilers). Let's have a look.

First, let's try and resolve a machine that's turned off:

% lxc-ls -f example
NAME    STATE    IPV4  IPV6  GROUPS  AUTOSTART
----------------------------------------------
example STOPPED  -     -     -       NO

% dig -t A example @10.0.3.1; dig -t AAAA example @10.0.3.1

;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 44338
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;example.               IN  A

;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 23299
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;example.                IN  AAAA

Here, we can see dnsmasq (listening on 10.0.3.1) report that the name is not found. My understanding of DNS, at the time, was that this was correct. You asked for a domain that didn't exist, and you get a NXDOMAIN, right? Turns out this is wrong, but let's go with my original understanding for now.

Let's start the machine, and see that it now has a name:

% lxc-start -n example
% lxc-ls -f example
NAME    STATE    IPV4       IPV6  GROUPS  AUTOSTART
---------------------------------------------------
example RUNNING  10.0.3.52  -     -       NO


% dig -t A example @10.0.3.1

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56889
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; ANSWER SECTION:
example.         0   IN  A   10.0.3.52

Yep! It's picked up the name, as expected. Instantly. No cache timeouts or anything... after all, the thing managing the DHCP is the same as the thing managing the DNS, so it should all be instant, right?

Let's check AAAA again:

% dig -t AAAA example @10.0.3.1

;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 5445
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;example.                IN  AAAA

Still returning NXDOMAIN, as I was expecting.

However, in this state, nginx will frequently fail to resolve the name "example".

nginx appears to send both a v4 (A) and v6 (AAAA) request to the configured DNS server simultaneously, and will act on whichever response it sees first. Again, I thought this would be okay; it will ignore the missing v6 record, and continue with the v4 record?

Well, no. That's not how DNS works. In DNS, NXDOMAIN does not mean "I don't know". It means "this definitely doesn't exist". This is not the appropriate error for something which is unknown. nginx sees the "this definitely doesn't exist" v6 response, and hence ignores the v4 response: it doesn't exist, so couldn't possibly have a v4 response.

What dnsmasq is doing is forwarding the query for unknown names to the upstream DNS server, which is correctly informing us that it will never give an answer for a single-word name. dnsmasq is caching this response, seperately for v6 and v4. When the lxc boots, and gets its DHCP/DNS name assigned for v4, dnsmasq doesn't purge the NXDOMAIN cache from v6. I would call this a bug, but the people who understand DNS say otherwise.

The solution is simple: start dnsmasq with domain-needed. This prevents it from forwarding requsets for single words (like example) to the upstream server, so it never sees an NXDOMAIN, so you don't get into this weird state.

On the Ubuntu setup, I am doing this by adding a config file to /etc/default/lxc-net:

LXC_DHCP_CONFILE=/etc/dnsmasq-lxc-overrides.conf

... which itself can contain:

domain-needed

... in addition to any other settings you might want to set, in the hope that dnsmasq might honour them.

(It won't. More on that another time.)


2015-04-10

Violating CORS with just nginx

I have come across a number of situations where I "want" to write an application mostly in Javascript, but it needs to access a couple of resources from other people's websites, which would be blocked by CORS. Previously, at this point, I'd have either given up and written the page in PHP, or at least had a PHP script somewhere doing the fetch and returning the content. Looking around, I can find many examples of both; none pretty.

It turns out, however, that PHP is hard to host generally, it's slow, it frequently has security issues, and isn't deployed anywhere fun... for example, on your CDN nodes. Rightly so.

You can solve this problem using (regex) locations and proxy_pass. For example, let's say we want our dynamic webpage to request something from api.micro.mug:

location = /viral {
    proxy_pass https://api.micro.mug/3/gallery/hot/viral/0.json;
}

Done! Our client app (served from the same server block, as index.html), can now...

$.getJSON('viral', function(data) {
    ....

.. at will (or manually, if you hate libraries).

You have full control over the connection, so you can add headers. For example, you might want to add a required header with proxy_set_header:

proxy_set_header Neutralization 'ChitDoI abcdef0123456'

You can take this further by accepting information from the client, as path parameters. You may have a monitoring page for example.com, which wants to check the /status on various foo.example.coms.

location ~ /status/([a-z0-9.-]+\.example\.com)$ {
    proxy_pass https://$1/api/status;
    proxy_connect_timeout 2s;
    proxy_read_timeout 2s;
}

The regex here is slight paranoia; maybe you could accept .+? Who knows what problems that could cause.

The client is also simple, again, with libraries:

$.ajax('status/foo.example.com')
    .done(function() { $('#foo').text('ok'); })
    .fail(function(xhr) { $('#foo').text(xhr.status); });

... or whatever.

As I mentioned previously, this can be directly on the CDN or load balancer nodes, and is static content. This makes it very likely that the status page will be up.


2015-04-10

New blog

My blog used to be hosted using self-hosted WordPress. As part of the most recent yak shaving machine migration, I'd decided to eliminate PHP from normal web serving. I looked at a number of static site generators, but they all had significant issues, specifically:

  • The awful formatting in my old blog posts. WordPress lets you get away with pretty much anything except using scheme-relative urls (//google.com).
  • Being written in things that were hard to install, or that required specific versions of software. Always a bad sign; I need to be convinced that something I don't care about at all will not require any input from me in the next five years. Looking at things like Jekyll and pandoc, but I also had issues with Python tools like Hyde.
  • Letting me actually control the output format, so I can waste my life fiddling with CSS.

I was too lazy to move the content across to anything modern, or to work out how to customise any of the existing solutions to meet my requirements, so I wrote my own; some sed, then some python, then some kind of markdown library and some more python, and woo.

Notable features:

  • RFC-2822 (Header: value) and Markdown/HTML/mixed crap input
  • Totally invalid HTML output
  • Barely valid RSS and Atom feeds
  • No comments, so no spam. Woo.

It's totally hardcoded to generate the exact site, but it's so small (160loc Python and <100loc of HTML templates) that you can probably just fork it. I didn't even name it: Faux' blog generator.


2014-03-09

Plover

I've been trying to learn Plover for a couple of weeks. It's an entertaining experience. Quite a few words still stump me and I end up going to the dictionary.

I've built a Plover summary and hints chart, and a new mobile-compatible Plover dictionary look-up tool.

Plover keying chart

Some good examples:

This post was composed on Plover at a total speed of around four words per minute. It's rather hard going so far.


2014-02-03

Fosdem 2014

I was at Fosdem 2014. I experienced some things which may count as learning, or just exclamations. I'm not really sure how this kind of thing works. (editor's note: I guess, having written it, that the term is "braindump")

  • Reproducible Builds for Debian: 65% of the archive can be reproduced from binaries we have available. readdir() is a great source of entropy; timestamps are a great source of source-control noise. Maybe we should make a vcs-give-me-a-version for people to use instead of $(date). We're heavily dependent upon old binaries, and on toolchains that will never be audited. We trust debian developers to upload a binary that matches their source(??!).
  • Shenandoah GC: Non-gc threads can do useful gc work, by copying stuff off the dead bits of heap onto live bits during writes. "Brooks pointers"(?) are a formalisation of that kind of store-and-forward kind of model in a atomic-compare-and-exchange world. People build non-generational garbage collectors but apparently this is silly and always abandoned. Concurrency without locking is hilarious. 8mb sub-heaps are cool.
  • Sage is a nice bundle of ipython --notebook and loads of maths libraries, and a magic cloud thing, but demos are hard, and nearly any audience is going to respond better to notebook (yes that's really what it looks like when you're editing it) than the command-line. Also, it was annoyingly broken on Android Chrome.
  • SPARK 2014: Extending Ada to have the compiler check asserts; e.g. ensuring you never get bounds checks, cool and perfectly sensible; looked like they hadn't quite made it as automatic as I'd've hoped for Ada, which I chose to assume was a sensible language. Additionally: Let's stop unit testing trivial stuff, and let the compiler show that it's correct. Some great slides (which I can't find, yet) on how to layer your testing infrastructure when you have a competent compiler and devs who can tell it how to be happy.
  • Scaling Dovecot: Nice overview of threads, clients, handoff, etc. in a world where you're expecting long-running IMAP connections (instead of the HTTP/REST nonsense I generally think about), and an overview of their load-balancing infrastructure which sounded pretty ... unique, rings of machines, rings of rings, ...
  • Motivations for MAGEEC: Why and how we should care about optimising for power; changing compiler optimisation cost models for power usage (instead of instruction length or expected speed). Pretty heatmaps showing power usage across instructions, e.g. sometimes it uses less power (but is slower) to do operation on 32-bit units instead of 8-bit, etc. Proposed a performance test framework (programming competition?!) targeting "total joules" instead of "code length" or "time".
  • FOSDEM wifi: Defaulting to ipv6, and how to configure bind/etc. to make people able to access the ipv4 bits of the internet. Massive shock seeing getting a reverse-dns response for an ipv6 address with a work hostname in. Big (yeah, big) discussion with anyone who would listen about why DNSSEC and this didn't or couldn't be resolved, which was eventually laid to rest by Muz. Android does its wonderful wifi-reboot-loop if presented with wifi without ipv4 connectivity. mosh doesn't support ipv6 yet.
  • Autovectorisation in LLVM: LLVM is flexible enough to deal with loads of people having a stab at things. Cost models are hard, especially when they're across instruction boundaries and dependent on what the CPU is currently doing (none of the vectorisers do this right now). It's hard to keep the right amount of information available when you're running much later in the compiler, although if it was in the original source then it's fine. They're heavily dependent on inlining. I wish Java was better at inlining. It seems to fix everything.
  • Avatar, running analysis tools on embedded stuff: Let's just pause for a moment here and look at this:

    AVATAR block diagaram

    Wow. It's some kind of black magic for running a huge binary blob that they don't have the code nor the toolchain for both on the real hardware and on an emulator and kind of syncing all of it backwards and forwards over python3 and multiple different connectors and multiple gdbs writing random memory editors into spare space in the embedded device and recompiling the qemu backend code with llvm (??!!) and running it through their symbolic execution framework and syncing that state all over and... basically I had no idea what was going on but it was wonderous. No serious issues found in a Nokia 3310 yet, though.
  • LLVM Linux: The kernel and gcc have grown together, and hence the kernel gets away with all kinds of awful things that llvm just outright refuses to add support for. The guy was amazingly angry about a tiny bit of C that gcc supports (badly), named VLAIS. It's in some of the crypto code in the kernel that gets copy-pasted between modules. This was far more horrifying to me.
  • Power management: A system wide challenge: Hardware engineers are really smart and like to show it with maths and clock gating. Understood some interesting things; e.g. it's worth shipping one high-nm/high-bin core (to run at a lower power and a lower frequency more of the time), and loads of low-nm/low-bin cores to run fast and loose. Power usage is related to the frequency cubed.
  • Helgrind for Valgrind: Let's re-interpret concurrency as a graph. It doesn't really matter what you were trying to do with your locks; just the presence of a lock or unlock implies that you meant to do something. Add a graph edge to nearby other uses of that lock. Other edges are time on the same thread. Dominators. Bam, done. Then they optimised it with vector clocks, which I thought I'd escaped when it became acceptable to only understand Raft, not the other horror.
  • Capsicum: Capabilities are a good way to do things. Allows applications to drop arbitary privileges, i.e. "don't let me make any directories I've currently got open for reading writable, or let me do anything scary to the network card". tcpdump as an example, which was nicely real-world. Good slides comparing the other options on Linux and other OSes for sandboxing (e.g. Chromium).
  • No more IPv4: A review of loads of stuff that would have been useful to know for the previously mentioned argument, that we had to learn from the horrors of wikipedia and nearby Cisco employees.
  • Software Transactional Memory: Might help you write better code (declarative instead of imperative concurrency), and what looked like a declarative distributed transaction manager, which I completely failed to understand in any way.

9/10 would attend again.


« Prev - Next »