2016-10-11

HTTP2 slowed my site down!

At work, we have a page which asynchronously fetches information for a dashboard. This involves making many small requests back to the proxy, which is exactly the kind of thing that's supposed to be faster under HTTP2.

However, when we enabled HTTP2, the page went from loading in around two seconds, to taking over twenty seconds. This is bad. For a long time, I thought there was a bug in nginx's HTTP2 code, or in Chrome (and Firefox, and Edge..). The page visibly loads in blocks, with exactly five second pauses between the blocks.

The nginx config is simply:

resolver 8.8.4.4;
location ~ /proxy/(.*)$ {
  proxy_pass https://$1/some/thing;
}

.. where 8.8.4.4 is Google Public DNS.


It turns out that the problem isn't with HTTP2 at all. What's happening is that nginx is processing the requests successfully, and generating DNS lookups. It's sending these on to Google, and the first few are getting processed; the rest are being dropped. I don't know if this is due to the network (it's UDP, after all), or as Google think it's attack traffic. The remainder of the requests are retried by nginx's custom DNS resolver after 5s, and another batch get processed.

So, why is this happening under HTTP2? Under http/1.1, the browser can't deliver requests quickly enough to trigger this flood protection. HTTP2 has sped it up to the point that there's a problem. Woo? On localhost, a custom client can actually generate requests quickly enough, even over http/1.1.

nginx recommend not using their custom DNS resolver over the internet, and I can understand why; I've had trouble with it before. To test, I deployed dnsmasq between nginx and Google:

dnsmasq -p 1337 -d -S 8.8.4.4 --resolv-file=/dev/null

dnsmasq generates identical (as far as I can see) traffic, and is only slightly slower (52 packets in 11ms, vs. 9ms), but I am unable to catch it getting rate limited. On production, a much smaller machine than the one I'm testing, dnsmasq is significantly slower (100+ms), so it makes sense that it wouldn't trigger rate limiting. dnsmasq does have --dns-forward-max= (default 150), so there's a nice way out there.


In summary: When deploying HTTP2, or any upgrades, be aware of rate limits in your, or other people's, systems, that you may now be able to trigger.


2016-02-01

Fosdem 2016

I was at Fosdem 2016. Braindump:

  • Pottering on DNSSEC: Quick overview of some things Team Systemd is working on, but primarily about adding DNSSEC to systemd-resolvd.
    • DNSSEC is weird, scary, and doesn't really have any applications in the real world; it doesn't enable anything that wasn't already possible. Still important and interesting for defence-in-depth.
    • Breaks in interesting cases, e.g. home routers inventing *.home or fritz.box, the latter of which is a real public name now. Captive portal detection (assuming we can't just make those go away).
    • systemd-resolvd is a separate process with a client library; DNS resolution is too complex to put in libc, even if you aren't interested in in caching, etc.
  • Contract testing for JUnit: Some library sugar for instantiating multiple implementations of an interface, and running blocks of tests against them. Automatically running more tests when anything returns a class with any interface that's understood.
    • I felt like this could be applied more widely (or perhaps exclusively); if you're only testing the external interface, why not add more duck-like interfaces everywhere, and test only those? Speaker disagreed, perhaps because it never happens in practice...
    • Unfortunately, probably only useful if you are actually an SPI provider, or e.g. Commons Collections. This was what was presented, though, so not a big surprise.
    • The idea of testing mock/stub implementations came up. That's one case where, at least, I end up with many alternative implementations of an interface that are supposed to test the same. Also discussed whether XFAIL was a good thing, Rubby people suggested WIP (i.e. must fail or test suite fails) works better.
  • Frida for closed-source interop testing: This was an incredibly ambitious project, talk and demo. ("We're going to try some live demos. Things are likely to catch fire. But, don't worry, we're professionals.").
    • Injects the V8 JS VM into arbitrary processes, and interacts with it from a normal JS test tool; surprisingly nice API, albeit using some crazy new JS syntaxes. Roughly: file.js: {foo: function() { return JVM.System.currentTimeMillis(); } } and val api = injectify(pid, 'file.js'); console.log(api.foo());.
    • Great bindings for all kinds of language runtimes and UI toolkits, to enable messing with them, and all methods are overridable in-place, i.e. you can totally replace the read syscall, or the iOS get me a url method. Lots of work on inline hooking and portable and safe call target rewriting.
    • Live demo of messing with Spotify's ability to fetch URLs... on an iPhone... over the network. "System dialog popped up? Oh, we'll just inject Frida into the System UI too...".
  • USBGuard: People have turned off the maddest types of USB attack (e.g. autorun), but there's still lots of ways to get a computer to do something bad with an unexpected (modified) USB stick; generate keypresses and mouse movements, even a new network adaptor that may be chosen to send traffic to.
    • Ask the user if they expect to see a new USB device at all, and whether they think it should have these device classes (e.g. "can be a keyboard"). Can only reject or accept as a whole; kernel limitation. UX WIP.
    • Potential for read-only devices; filter any data being exfiltrated at all from a USB stick, but still allow reading from them? Early boot, maybe? Weird trust model. Not convinced you could mount a typical FS with USB-protocol level filtering of writes.
    • Mentioned device signing; no current way to do identity for devices, so everything is doomed if you have any keyboards. Also mentioned CVEs in USB drivers, including cdc-wdm, which I reviewed during the talk and oh god oh no goto abuse.
  • C safety, and whole-system ASAN: Hanno decided he didn't want his computer to work at all, so has been trying to make it start with ASAN in enforce mode on for everything. Complex due to ASAN loading order/dependencies, and the fact that gcc and libc have to be excluded because they're the things being overridden.
    • Everything breaks (no real surprise there). bash, coreutils, man, perl, syslog, screen, nano. Like with Fuzzing Project, people aren't really interested in fixing bugs they consider theoretical, but are really real on angry C compilers, or in the future. Custom allocators have to be disabled, which is widely but not totally supported.
    • Are there times where you might want ASAN on in production? Would it increase security (trading off outages), or would it add more vulnerabilities, due to huge attack surface? ~2x slowdown at the moment, which is probably irrelevant.
    • Claimed ASLR is off by default in typical Linux distros; I believe Debian's hardening-wrapper enables this, but Lintian reports poor coverage, so maybe a reasonable claim.
  • SSL management: Even in 2014, when Heartbleed happened, Facebook had not really got control of what was running in their infrastructure. Public terminators were fine, but everything uses SSL. Even CAs couldn't cope with reissue rate, even if you could find your certs. Started IDSing themselves to try and find missed SSL services.
    • Generally, interesting discussion of why technical debt accumulates, especially in infra. Mergers and legacy make keeping track of everything hard. No planning for things that seem like they'll never happen. No real definition of ownership; alerts pointed to now-empty mailing lists, or to people who have left (this sounds very familiar). That service you can't turn off but nobody knows why.
    • Some cool options. Lots of ways to coordinate and monitor SSL (now), e.g. Lemur. EC certs are a real thing you can use on the public internet (instead of just me on my private internet), although I bet it needs Facebook's cert switching middleware. HPKP can do report-only.
    • Common SSL fails that aren't publicly diagnosed right now: Ticket encryption with long-life keys (bad), lack of OCSP stapling and software to support that.
  • Flight flow control: Europe-wide flight control is complex, lots of scarce resources: runways, physical sky, radar codes, radio frequencies. Very large, dynamic, safety-critical optimisation problem.
    • 35k flights/day. 4k available radar codes to identify planes. Uh oh. Also, route planning much more computationally expensive in 4D. Can change routes, delay flights, but also rearrange ATC for better capacity of bits of the sky.
    • Massive redundancy to avoid total downtime; multiple copies of the live system, archived plans stored so they can roll-back a few minutes, then an independently engineered fall-back for some level of capacity if live fails due to data, then tools to help humans do it, and properly maintained capacity information for when nobody is contactable at all.
    • Explained optimising a specific problem in Ada; no actual Ada tooling so stuck with binary analysis (e.g. perf, top, ..). Built own parallelisation and pipelining as there's no tool or library support. Ada codebase and live system important, but too complex to change, so push new work into copies of it on the client. Still have concurrency bugs due to shared state.
  • glusterfs and beyond glusterfs: Large-scale Hadoop alternative, more focused on reliability, and NFS/CIFS behaviour than custom API, but also offer object storage and others.
    • Checksumming considered too slow, so done out of band (but actually likely to catch problems), people don't actually want errors to block their reads (?!?). Lots more things are part of a distributed system than I would have expected, e.g. they're thinking of adding support for geographical clustering, so you can ensure parts of your data are in different DCs, or that there is a copy near Australia (dammit).
    • The idea that filesystems have to be in kernel mode is outdated; real perf is from e.g. user-mode networking stacks. Significantly lower development costs in user-mode means FSes are typically faster in user space, as the devs have spent more time (with more tooling modifiers) getting them to work properly: real speedups are algorithmical and not code.
    • Went on to claim that Python is fine, don't even need C or zero-copy (but do need to limit copies), as everything fun is offloaded anyway. Ships binaries with debug symbols (1-5% slowdown) as it's totally irrelevant. Team built out of non-FS, non-C people writing C. They're good enough to know what not to screw up.
    • Persistent memory is coming (2TB of storage between DRAM and SSD speed), and cache people are behind. NFS-Ganesha should be your API in all cases.
  • What even are Distros?: Distros aren't working, can't support things for 6mo, 5y, 10y or whatever as upstream and devs hate you. Tried to build hierarchies of stable -> more frequently updated, but failed; build deps at the lower levels; packaging tools keep changing; no agreement on promotion; upstreams hate you.
    • PPAs? They had invented yet another PPA host, and they're all still bad. Packaging is so hard to use, especially as non-upstreams don't use it enough to remember how to use it.
    • Components/Modules? Larger granularity installable which doesn't expose what it's made out of, allowing distros to mess it up inside? Is this PaaS, is this Docker? It really feels like it's not, and it's really not where I want to go.
    • Is containerisation the only way to protect users from badly packaged software? Do we want to keep things alive for ten years? I have been thinking about the questions this talk had for a while, but need a serious discussion before I can form opinions.
  • Reproducible Builds: Definitely another post some time.
    • Interesting questions about why things are blacklisted, and whether the aim is to support everything. Yes. Everything. EVERYTHING.
  • Postgres buffer manager: Explaining the internals of the shared buffers data structure, how it's aged poorly, what can be done to fix it up, or what it should be replaced with. Someone actually doing proper CS data-structures, and interested in cache friendliness, but also tunability, portability, future-proofing, etc.
    • shared_buffers tuning advice (with the usual caveat that it's based on workload); basically "bigger is normally always better, but definitely worth checking you aren't screwing yourself".
    • Also talked about general IO tuning on Linux, e.g. dirty_writeback, which a surprising number of people didn't seem to have heard of. Setting it to block earlier reduces maximum latency; numbers as small as 10MB were considered.
  • Knot DNS resolver: DNS resolvers get to do a surprising number of things, and some people run them at massive scale. Centralising caching on e.g. Redis is worthwhile sometimes. Scriptable in Lua so you can get it to do whatever you feel like at the time (woo!).
    • Implements some interesting things: Happy Eyeballs, QNAME minimisation, built-in serving of pretty monitoring. Some optimisations can break CDNs (possibly around skipping recursive resolving due to caching), didn't really follow.
  • Prometheus monitoring: A monitoring tool. Still unable to understand why people are so much more excited about it than anything else. Efficient logging and powerful querying. Clusterable through alert filtering.
    • "Pull" monitoring, i.e. multiple nodes fetch data from production, which is apparently contentious. I am concerned about credentials for machine monitoring, but the daemon is probably not that much worse.
  • htop porting: Trying to convince the community to help you port to other platforms is hard. If you don't, they'll fork, and people will be running an awful broken fork for years (years). Eventually resolved by adding the ability to port to BSD, then letting others do the OSX port.
  • API design for slow things: Adding APIs to the kernel is hard. Can't change anything ever. Can't optimise for the sensible case because people will misuse it and you can't change it. Can't "reduce" compile-time settings as nobody builds a kernel to run an app.
    • Lots of things in Linux get broken, or just never work to start with, due to poor test coverage, maybe the actually funded kselftest will help, but people can help themselves by making real apps before submitting apis, or at least real documentation. e.g. recvmsg timeout was broken on release. timerfd tested by manpage.
    • API versioning is hard when you can't turn anything off. epoll_create1, renameat2, dup3. ioctl, prctl, netlink, aren't a solution, but maybe seccomp is. Capabilities are hard; 1/3rd of things just check SYS_ADMIN (which is ~= root). Big argument afterwards about whether versioning can ever work, and what deprecation means. Even worse for this than for Java, where this normally comes up.
  • Took a break to talk to some Go people about how awful their development process is. CI is broken, packaging is broken, builds across people's machines are broken. Everything depends on github and maybe this is a problem.
  • Fosdem infra review: Hardware has caught up with demand, now they're just having fun with networking, provisioning and monitoring. Some plans to make the conference portable, so others can clone it (literally). Video was still bad but who knows. Transcoding is still awfully slow.
    • Fosdem get a very large temporary ipv4 assignment from IANA. /17? Wow. Maybe being phased out as ipv6 and nat64 kind of works in the real world now.
    • Argument about why they could detect how many devices there were, before we realised mac hiding on mobile is probably disabled when you actually connect, because that's how people bill.
  • HOT OSM tasking: Volunteers digitising satellite photos of disaster zones, and ways to allocate and parallelise that. Surprisingly fast; 250 volunteers did a 250k person city in five days, getting 90k buildings.
    • Additionally, provide mapping for communities that aren't currently covered, and train locals to annotate with resources like hospitals and water-flow information.
    • Interesting that sometimes they want to prioritise for "just roads", allowing faster mapping. Computer vision is still unbelievably useless; claiming 75% success rate at best on identifying if people even live in an area.
    • Lots of ethical concerns; will terror or war occur because there's maps? Sometimes they ask residents and they're almost universally in favour of mapping. Sometimes drones are donated to get better imagery, and residents jump on it.
  • Stats talk. Lots of data gathered; beer sold, injuries fixed, network usage and locations. Mostly mobile (of which, mostly android). Non-mobile was actually dominated by OSX, with Linux a close second. Ouch.

Take-aways: We're still bad at software, and at distros, and at safety, and at shared codebases.

Predictions: Something has to happen on distros, but I think people will run our current distros (without containers for every app) for a long time.


2015-12-23

BlobOperations: A JDBC PostgreSQL BLOB abstraction

BlobOperations provides a JdbcTemplate-like abstraction over BLOBs in PostgreSQL.

The README contains most of the technical details.

It allows you to deal with large files; ones for which you don't want to load the whole thing into memory, or even onto the local storage on the machine. Java 8 lambdas are used to provide a not-awful API:

blops.store("some key", os -> { /* write to the OutputStream */);
blops.read("other key", (is, meta) -> { /* read the InputStream */);

Here, the (unseekable) Streams are connected directly to the database, with minimal buffering happening locally. You are, of course, free to load the stream into memory as you go; the target project for this library does that in some situations.

In addition to not being so ugly, you get free deduplication and compression, and a place to put your metadata, etc. Please read the README for further details about the project.


And, some observations I had while writing it:

I continue to be surprised at how hard it is to find good advice on locking techniques and patterns for Postgres. For example,

SELECT * FROM foo WHERE pk=5 FOR UPDATE;

... does nothing if the pk=5 doesn't exist (yet). That is, there's no neat way to block until you know whether you can insert a record. Typically, you don't want to block, but if your code then progresses to do:

var a = generateReallySlowThing();
INSERT INTO foo (pk, bar) VALUES (5, a);
COMMIT;

...it seems a shame to have waited for that slow operation, and then have the INSERT explode on you. The "best" solution here appears to insert a blank record, commit, then lock the record, do your slow operation, and then update it. As far as I'm aware, none of the UPSERT related changes in PostgreSQL 9.5 help with this case at all. I would love to link to a decent discussion of this... but I'm not aware of one.

A similar case comes up later, where I wish for INSERT ON CONFLICT DO NOTHING, which is in PostgreSQL 9.5. Soon.


2015-11-18

xlines: stdin round-robiner

xlines is a combination of xargs and split. It takes a bunch of lines, and sends them to a number of child processes. Each process sees only one of the lines.

e.g.

seq 16 | xlines -c 'cat > $(mktemp)'

...will give you 8 temporary files (on an 8-core machine) containing:

1
9

and:

2
10

etc.

Why would you care?

You have a bunch of INSERT statements coming off a stream, but your database will only use a single core if you run them in series:

zcat sql.gz | xlines -P 32 -- psql

Some speed-up.

zcat sql.gz | xlines -P 32 -c 'buffer | psql'

Zoom.

A specific tool to fix a specific job. I still don't think it makes up for the lack of limited parallelism in shell, however. Still thinking about that one...


2015-11-12

Teensy weensy crypto

As the UK's politicians continue to fail to understand what "strong cryptography" or "banning" even mean, I thought I would have a look at how simple strong cryptography can be.

nanorc4 is a working RC4 encryption and decryption implementation in 16-bit assembly. It will run on any 32-bit (or, presumably, 16-bit!) Windows machine (which, admittedly, are going out of fashion), and on dosbox:

uwACiB/+w3X6MckxwIjIih6AAP7L9vOI44qHgv6Iy4onAOAAxegvAP7Bdd8xybQIzSH+wYjLAi/o
HACIy4oXiOsCF4jTihcwwrQCzSG0C80hhMB12c0giMuKF4jrijeIF4jLiDfD

Yep, that's it. base64 encoded. 102 bytes, or 138 encoded. Fits in a tweet. Probably small enough to memorise. Certainly pretty hard to ban.

With this (and your computer) you can secure a message with a password in a way that's unbreakable. I can't break it, your government can't break it, other people's governments can't break it. Secure.

Why's it so small?

  1. The problem is (relatively) easy. This is known as "pre-shared key cryptography", or "symmetric cryptography", which are one of the easier problems in the science. Things get much harder when you don't have a good way to tell the target the key in advance.
  2. RC4 is surprisingly secure for how simple the code is.
  3. 16-bit assembly, and the COM "format" have no preamble: it's just the code. It just starts executing at the start. (And I hacked at it a bit.)

Demo!

> echo hi | one.com secure password>out ; in DOS (note: no trailing space)

$ make c && ./c 'secure password' <out  # on linux
hi

Should you use it? No. There's many important missing features that are present in proper symmetric encryption tools, such as proper key derivation, protection against modification, IVs, and fewer bugs. Yes, even this 102 byte program has some significant bugs I couldn't be bothered to fix.

Is RC4 secure? For this use-case, yes. For TLS, most certainly not. Even today there are many plausible attacks against RC4 in the TLS context, but none of them apply to this static-data world.

I was actually hoping to be able to fit RC4-drop-N in, which is probably secure in many more contexts, but I couldn't get the byte count down to the (tweet-derived) target. I guess this makes for a reasonable golf competition...

Development notes:

  • dosbox is pretty annoying, but so is cmd. The dosbox debugger is cool, but there doesn't seem to be any current documentation on it. That Forum Post is pretty wrong.
  • dosbox doesn't support pipes or <input redirection, so I couldn't debug with binary files, which is one of the reasons it doesn't work.
  • I have no idea what the actual semantics of the input interrupts are, all the useful documentation seems to have been lost to history, or was commercial (and/or paper) in the first place.
  • Everything fits in three 256-byte blocks, so the bh register == block number, and there's no use of memory segmentation (WOOO).
  • block 0: the PSP, which I couldn't overwrite as it has the key in (as the command-line argument).
  • block 1: the code segment
  • block 2: the 256-byte state for RC4.
  • After the key setup, the bh is left at 2 forever.
  • cl and ch are used for the i and j state parts in RC4.

Update:

  • A number of people pointed me at Odzhan's RC4 implementation in normal x86(_64) which shows a much better understanding of actual assembly programming. For example, their "swap" implementation is amazing compared to mine.
  • Some people asked how much hacking it took to get the size down. It took about six hours, but it was great. I love golf competitions, even if they're just against myself.
  • There was some concern that people might actually accidentally run or incorporate the code without understanding the flaws, as there isn't a big enough warning on this page, or on github. These people additionally didn't read any of the rest of the article, where it is explained that it's broken, 16-bit x86 assembly which you actually can't run anywhere, even if you wanted to.

2015-10-04

Capturing users' ssh keys

Four years ago, I was working on a project that would require users to connect to it over ssh. At the time, asking typical users (even developers!) to send you an ssh public key was a bit of an involved operation.

The situation hasn't improved much.

For example, github suggests generating the keys manually, then using Windows' clip.exe or apt-get install xclip && xclip (from the command line) to get the key into the clipboard, then pasting it into their web-interface. Ugh.

The situation is a little better for PuTTYTray, it has built-in support for SSH agent, and a reasonably streamlined way to get keys into the clipboard, but, then, we're still using the clipboard-into-the-web-interface story. This was written in 2013-08, two years too late (although I'm sure the author could have been convinced to move the development forward).

For this project, I came up with a better way.

I realised I could simply ask the new user to ssh in, and capture their keys. To distinguish concurrent users, I could issue them a fake username, and ask them to ssh account-setup-for-USERNAME@my.service.com. When they do, I can capture their keys and automatically associate them with their account. No platform specific commands, no unnecessary messing around in the terminal.

This is possible due to how ssh authentication works:

  • Client sends the username.
  • Server replies: Sure, you can try logging in with keys, or with passwords if you want.
  • Client sends Public Key 1.
  • Server replies: Nope, but you can try other keys or passwords.
  • Client sends Public Key 2.
  • Server replies: ...

That is, the standard ssh client will just send you all the user's public keys.

Note that this isn't (normally) considered a security problem; the keys are public, after all, and the server isn't leaking any information by saying "nope".

As I was already running a custom SSH server which practically required you to implement authentication yourself anyway, it was a simple step to add key capture to the account setup procedure. I've uploaded a stripped down version to github if you want to see how it works. For example,

Start the server:

server% git clone https://github.com/FauxFaux/ssh-key-capture.git
server% cd ssh-key-capture
server% ./gradlew -q run

The user can try and login, but gets rejected (this isn't reqiured):

john% ssh -p 9422 john@localhost
Permission denied (publickey).

Server logs from the (unnecessary) failed authentication:

KeyCapture - john trying to authenticate with RSA MIIBIjANBg...
KeyCapture - john trying to authenticate with EC MFkwEwYHKoZ...

Tell the server that john has signed up, or wants to add keys, or...

Enter a new user name, or blank to exit: john
Ask 'john' to ssh to '18a74d9f-5c7d-41d0-8369-bae4aaba9867@...'

John now adds his keys, and hence can login:

john% ssh -p 9422 18a74d9f-5c7d-41d0-8369-bae4aaba9867@localhost
Added successfully!  You can now log-in normally.
Connection to localhost closed.

john% ssh -p 9422 john@localhost
Hi!  You've successfully authenticated as john
Bye!
Connection to localhost closed.

Future work:

  • It could capture all of the user's keys (it currently just captures the first).
  • More meaningful behaviour after the first authenticaiton, or during the admin part of the setup?
  • Some way to do this on top of OpenSSH, or other tools people actually run in the wild. PAM?

Update: There was some decent discussion on reddit's /r/netsec about this post.


« Prev - Next »