Arduino radio communication

Once again, I have ordered the wrong hardware from eBay.

This time, it was a set of 433MHz radio transceivers for "Arduino". The majority of these come with embedded circuitry for sending and receiving bits. The ones I ordered, however, did not.

The transmitter emits power when its data line is powered. The receiver emits a varying voltage, which can be ADC'd back into a value, ~1 -> ~800. This is not digital.

I decided to do everything from scratch. Everything.

A useful simple radio protocol is known as "OOK" or "ASK": You turn the radio on when you're sending a "1", you turn it off when you're not.

The transmitter is amazingly simple; you turn on the radio, and you turn it off. These fourteen lines of code actually send two bits, for reasons which will become horrifying later.

Or now. Radio is incredibly unreliable. This is worked around by layering all kinds of encodings / checksums together, and hoping everything works out. (Narrator: It doesn't work out.)

The first type of encoding used is called "Manchester Encoding". This involves doubling the amount of data you send, but gives you lots of scope for detecting problems. For a 1, you send 01, and for a 0, 10. That is, if you see a 111 or a 000 in your stream, you know something's gone wrong.

So, to send the number 6, binary 0110, we're going to send 10_01_01_10. This is why the sending code sends two bits.

The receiver's job is much more horrifying. The receiver has "samples" from a radio (a three-digit integer), at unknown time intervals. The minimum value read varies wildly with environmental conditions, as does the peak value (the value you hope to see when the transmitter is sending).

For this purpose, the receiver has multiple levels of filtering.

First, it takes a fast moving average over the received signal, and a "slow" moving average over the background noise (the average of all samples), and our guess as to the high value. If the fast moving average is greater than half way up this band, it's probably a hi.

This can be observed in the code by enabling DEBUG_BACKGROUND, and rebooting the board. This initially has a bad idea of what the noise environment looks like, so will look like this:

background: 8 sig_high:99 high:47 trigger:53 -- ..XXX.XXXXXXX.XXXXXXXXXXXXXXX...................................XXX.............................................................
background: 6 sig_high:96 high:87 trigger:51 -- .....................................................XXX....XX..................................................................

Here, it's got a very narrow range, so triggering too often and emitting lots of nonsense bits (the XXXs). After a while, it will adjust:

background: 28 sig_high:159 high:757 trigger:93 -- XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX....X..XX..XXX...............................................................................
background: 27 sig_high:163 high:450 trigger:95 -- ................................................................................................................................
background: 26 sig_high:165 high:26 trigger:95 -- ................................................................................................................................

Here, its background estimate is higher, but its sig_high estimate is much higher, so the trigger is higher, and it doesn't incorrectly trigger at all. (Those XXXs are part of a real signal.)

Second, we "decimate" this signal down a lot, by taking a binary average of finite blocks. As the sample rate is still significantly higher than the length of a bit, it does not matter that these are not well aligned. We then count the length of runs of each state we see, ignoring single errors and overly long runs.

As Arduinos, and the radio hardware, don't do anything like what you tell them, it's impossible to know in advance how long (in milliseconds) a pulse will be, or how long of a run represents a 1.

Fixing this problem is called "clock recovery", we need to guess how long a pulse is according to us, regardless of what the sender thinks it's doing.

Manchester encoding helps with clock recovery. The transmitter sends a "preamble" of zeros, which are encoded as 10101010, that is, a series of pulses. The receiver uses this to guess how long a pulse is, and to check the guess is correct.

This code is looking for a high (and keeping the length of this high), then a low of the same length, then another high/low. If we see these, then we're reasonably confident we're synchronised to the signal.

There's a DEBUG_CLOCK which watches this phase working:

7: (XXXXXXX_.......) 0 (XXXXXXX_.......) 0 (_XXXXXXX_..............) 0 (_XXXXXXXXXXXXXX) 1 (...................) end (encoding violated)

Here, it's guessed the length of seven, then seen a two normal valid 0s, then a 0, 1, with the double-length 0 pulse in the centre. After this, the transmitter went silent, and hence we saw a stream of 000s. Three zeros is invalid in Manchester encoding so we stopped decoding.

So! We've got a stream of bits, and an end. From this, we need to find the start of the message. I've chosen to implement this by sending a long stream of zeros, then two ones, then immediately the data. This scheme doesn't seem ideal, but it does work.

The decoder waits for this condition to happen, then starts to read bytes.

The bytes are transmitted as 8-bits (MSB last, unlike normal), with a parity bit. This explains the last piece of unexplained code in the transmitter!

There's also a debugger for this, in DEBUG_DECODE. Here, we can see it waiting for XX (the second accepted X is bracketed), then reading the next nine bits and checking the parity. Note that there's no synchronisation for the second byte, as it's assumed we're still synchronised:

..X(X)...XX.... => 24 (parity: 00)
X..XX..X. => 153 (parity: 00)
X......X. => 129 (parity: 00)

Here, a failure looks like:

X(.X(..X(X)..X.X.... => 20 (parity: 00)
..X.X...X(X)...XX.... => 24 (parity: 00)

To be honest, I have no real idea what's gone wrong here. The cleaned up data stream looks like 101001100101000000101. The 001100 could be synchronisation, or it could be the encoded "24". Argh! Why would you pick this sequence for temperatures?! Why?

The actual data being sent is a temperature reading, encoded as two bytes, (int)Celsius, and the decimal part as a single byte.

As corruption was still getting through at this level, an extra checksum is computed, as the xor of these two bytes together. Finally, it's mostly reliable. With all the debugging disabled, it looks like:

Value checks out: 24.70

Shame the temperature sensor varies (by about 2C) from my other sensors. It also loses about half the messages to errors, as there's no error recovery at all.

Wasn't that fun?

  • What would a normal software decoder look like for this?

Probably about as bad. I wrote an example FSK decoder as part of a radio manipulation tool I wrote, named quadrs.

  • How far is this radio transmitting?


breadboard setup photo

About three centimetres.

  • What's the data rate?



Rebuilding Debian with Java 9

It's about three months until Java 9 is supposed to be released. Debian contains around 1,200 packages that build against Java 8. I have been trying to build them with Java 9.

It's not been going well.

The first attempt found an 87% failure rate, the majority of which were either:

This is too bad to get an idea of what's actually broken, so I gave up.

The second attempt has gone better, only 57% failures. This had a number of issues fixed, but there's still a large number of problems masked by toolchain failures.

However, some real Java 9 breakages are coming to the fore!

Oh, and 135 packages have an unknown problem, so maybe there's a whole other class of bug I've missed.

This is (obviously) an ongoing project, but I thought I'd write up what I'd seen so far.

Also, I wanted to mention how cool it was to hack up a dashboard for your ghetto make/Docker build process in ghetto shell, although slightly less ghetto than the previous shell.

Every 2.0s: ./classify.sh

ascii           cast            deps            doclint         javadoc         keyword         modules         unknown         version
47 total        9 total         203 total       82 total        165 total       15 total        83 total        114 total       206 total
====            ====            ====            ====            ====            ====            ====            ====            ====
antelope        charactermanaj  access-*hecker  android*-tools  access-*hecker  avalon-*mework  activem*otobuf  adql            389-adm*onsole
axis            dom4j           activem*tiveio  antlr4          akuma           axis            android*ork-23  aspectj         389-ds-console
azureus         electric        activemq        args4j          animal-sniffer  biojava-live    android*dalvik  aspectj*plugin  airport-utils
clirr           findbugs        activem*otobuf  bindex          annotat*ndexer  bnd             android*oclava  bouncycastle    android*-tools
cmdreader       jajuk           afterburner.fx  cdi-api         antlr3          dbus-java       android*silver  bsh             antlr
cortado         jsymphonic      akuma           classycle       apache-log4j2   jalview         android*inding  closure*mpiler  artemis
cronometer      libjaud*r-java  animal-sniffer  commons-math3   apache-mime4j   javacc4         android*ibcore  cofoja          beansbinding
dita-ot         olap4j          annotat*ndexer  ditaa           async-h*client  java-gnome      android*ibrary  commons-jcs     bindex
eclipselink     sweethome3d     antlr3          felix-g*-shell  axmlrpc         jmol            android*apksig  convers*ruptor  biojava-live
eclipse                         antlr4          fest-assert     bcel            jruby-openssl   android*s-base  davmail         biomaj
entagged                        apache-log4j2   fest-reflect    bridge-*jector  libcomm*g-java  android*ls-swt  diffoscope      brig
fop                             args4j          fest-util       bsaf            libxml-*1-java  android*helper  dnsjava         brltty
geronim*0-spec                  atinjec*jsr330  ganymed-ssh2    build-h*plugin  libxpp2-java    ant             dumbster        cadencii
imagej                          bcel            gentlyw*-utils  canl-java       mvel            apktool         eclipse*nyedit  castor
jasmin-sable                    bridge-*jector  glassfish       cglib           squareness      basex           eclipse-cdt     cdi-api
jas                             build-h*plugin  hdf5            commons*nutils                  bintray*t-java  eclipse*config  ceph
jasypt                          carrots*h-hppc  hessian         commons*ration                  biojava4-live   eclipse-eclox   cobertura
javacc4                         cglib           intelli*ations  commons-csv                     easybind        eclipse-emf     coco-java
javaparser                      checkstyle      jacksum         commons-io                      eclipse-mylyn   eclipse-gef     colorpicker
jets3t                          codenarc        jcm             commons*vaflow                  eclipse-wtp     eclipse*clipse  commons*client
jgromacs                        commons*nutils  jfugue          commons-jci                     eclipse-xsd     eclipse*es-api  concurr*t-dfsg
king                            commons*ration  jmock           commons-math                    freeplane       eclipse-rse     cvc3
knopfle*h-osgi                  commons-csv     jnr-ffi         cssparser                       gant            eclipse*clipse  db5.3
libcds-*t-java                  commons-io      jpathwatch      csvjdbc                         gradle-*plugin  emma-coverage   dbus-java
libcomm*g-java                  commons*vaflow  jsurf-alggeo    dirgra                          gradle          gdcm            dicomscope
libidw-java                     commons-jci     jts             dnssecjava                      gradle-*otobuf  geronim*upport  ditaa
libiscwt-java                   commons-math    libcds-*c-java  dokujclient                     gradle-*plugin  gettext         docbook*-saxon
libitext-java                   commons-parent  libcomm*c-java  doxia-s*etools                  graxxia         gluegen2        doxia
libjdbm-java                    commons-vfs     libcomm*4-java  dtd-parser                      groovycsv       gnome-split     easyconf
libjt400-java                   core-ca*lojure  libcomm*2-java  easymock                        groovy          h2database      excalib*logger
libstax-java                    cssparser       libhac-java     felix-b*sitory                  gs-collections  ha-jdbc         excalib*logkit
libvldo*g-java                  data-xm*lojure  libhtml*r-java  felix-f*mework                  htsjdk          hdrhistogram    f2j
libxpp3-java                    dirgra          libirclib-java  felix-g*ommand                  ice-bui*gradle  icu4j           felix-osgi-obr
livetri*jsr223                  dnssecjava      libjaba*t-java  felix-g*untime                  insubstantial   icu4j-4.2       fontchooser
mathpiper                       dokujclient     libjgoo*n-java  felix-shell                     ivyplusplus     icu4j-4.4       ganymed-ssh2
maven-a*helper                  doxia           libjhla*s-java  felix-s*ll-tui                  jabref          icu4j-49        gentlyw*-utils
metastudent                     doxia-s*etools  libjoda*e-java  felix-utils                     jackson*tabind  istack-commons  geogebra
naga                            dtd-parser      libjsonp-java   geronim*0-spec                  jackson*-guava  jakarta-jmeter  gridengine
ognl                            easymock        libjsr1*y-java  geronim*1-spec                  java3d          janino          healpix-java


find-deleted: checkrestart replacement

checkrestart, part of debian-goodies, checks what you might need to restart. You can run it after a system update, and it will find running processes using outdated libraries. If these can be restarted, and there's no new kernel updates, then you can save yourself a reboot.

However, checkrestart is pretty dated, and has some weird behaviour. It frequently reports that there are things needing a restart, but that it doesn't feel like telling you what they are. (This makes me moderately angry). It's pretty bad at locating services on a systemd-managed system. It tries to look through Debian packages, making it Debian specific (along with unreliable). This is especially odd, because systemd knows what pid belongs to a unit, as does /proc, and...

Instead of fixing it, I have rewritten it from scratch.

find-deleted is a tool to find deleted files which are still in use, and to suggest systemd units to restart.

The default is to try and be helpful:

% find-deleted
 * blip
   - sudo systemctl restart mysql.service nginx.service
 * drop
   - sudo systemctl restart bitlbee.service
 * safe
   - sudo systemctl restart fail2ban.service systemd-timesyncd.service tlsdate.service
 * scary
   - sudo systemctl restart dbus.service lxc-net.service lxcfs.service polkitd.service
Some processes are running outside of units, and need restarting:
 * /bin/zsh5
  - [1000] faux: 7161 17338 14539
 * /lib/systemd/systemd
  - [1000] faux: 2082
  - [1003] vrai: 8551 8556

Here, it is telling us that a number of services need a restart. The services are categorised based on some patterns defined in the associated configuration file, deleted.yml.

For this machine, I have decided that restarting mysql and nginx will cause a blip in the service to users; I might do it at an off-peak time, or ensure that there's other replicas of the service available to pick up the load.

My other categories are:

  • drop: A loss of service will happen that will be annoying for users.
  • safe: These services could be restarted all day, every day, and nobody would notice.
  • scary: Restarting these may log you out, or cause the machine to stop functioning.
  • other: things which don't currently have a classification

If you're happy with its suggestions, you can copy-paste the above commands, or you can run it in a more automated fashion:

systemctl restart $(find-deleted --show-type safe)

This can effectively be run through provisioning tools, on a whole selection of machines, if you trust your matching rules! I have done this with a much more primitive version of this tool at a previous employer.

It can also print the full state that it's working from, using --show-paths.


Busy work: pasane and syscall extractor

Today, I wrote a load of Python and a load of C to work around pretty insane problems in Linux, my choice of OS for development.

pasane is a command-line volume control that doesn't mess up channel balance. This makes it unlike all of the other volume control tools I've tried that you can reasonably drive from the command line.

It's necessary to change volume from the command line as I launch commands in response to hotkeys (e.g. to implement the volume+ button on my keyboard). It's also occasionally useful to change the volume via. SSH. Maybe there's another option? Maybe something works via. dbus? This seemed to make the most sense at the time, and isn't too awful.

Am I insane?

Next up: Some code to parse the list of syscalls from the Linux source tree.

It turns out that it's useful to have these numbers available in other languages, such that you can pass them to tools, so that you can decode raw syscall numbers you've seen, or simply so that you can make the syscalls.

Anyway, they are not available in the Linux source. What? Yes, for most architectures, this table is not available. It's there on i386, amd64, and arm (32-bit), but not for anything else. You have to.. uh.. build a kernel for one of those architectures, then compile C code to get the values. Um. What?

This is insane.

The Python code (linked above) does a reasonably good job of extracting them from this insanity, and generating the tables for a couple more arches.

I needed to do this so I can copy a file efficiently. I think. I've kind of lost track. Maybe I am insane.


Playing with prctl and seccomp

I have been playing with low-level Linux security features, such as prctl no new privs and seccomp. These tools allow you to reduce the harm a process can do to your system.

They're typically deployed as part of systemd, although the default settings in many distros are yet to be ideal. This is partly because it's hard to confirm what a service actually needs, and partly because many services support many more things than a typical user cares about.

For example, should a web server be able to make outgoing network connections? Probably not, it's accepting network connections from people, maybe running some code, then returning the response. However, maybe you're hosting some PHP that you want to be able to fetch data from the internet? Maybe you're running your web-server as a proxy?

To address these questions, Debian/Ubuntu typically err on the side of "let it do whatever, so users aren't inconvenienced". CentOS/RHEL have started adding a large selection of flags you can toggle to fiddle security (although through yet another mechanism, not the one we're talking about here..).

Anyway, let's assume you're paranoid, and want to increase the security of your services. The two features discussed here are exposed in systemd as NoNewPrivileges= and SystemCallFilter=.

The first, NoNewPrivileges=, prevents a process from getting privileges in any common way, e.g. by trying to change user, or trying to run a command which has privileges (e.g. capabilities) attached.

This is great. Even if someone knows your root password, they're still stuck:

% systemd-run --user -p NoNewPrivileges=yes --tty -- bash
$ su - root
su: Authentication failure

$ sudo ls
sudo: effective uid is not 0, is /usr/bin/sudo on a file system with the 'nosuid'
  option set or an NFS file system without root privileges?

The errors aren't great, as the tools have no idea what's going on, but at least it works!

This seems like a big, easy win; I don't need my php application to become a different user... or do I? It turns out that the venerable mail command, on a postfix system, eventually tries to run a setuid binary, which fails. And php defaults to sending mail via. this route. Damn!

Let's try it out:

% systemd-run --user -p NoNewPrivileges=yes --tty -- bash
faux@astoria:~$ echo hi | mail someone@example.com

postdrop: warning: mail_queue_enter: create file maildrop/647297.680: Permission denied

Yep, postdrop is setgid (the weird s in the permissions string):

% ls -al =postdrop
-r-xr-sr-x 1 root postdrop 14328 Jul 29  2016 /usr/sbin/postdrop

It turns out that Debian dropped support for alternative ways to deliver mail. So, we can't use that!

Earlier I implied that NoNewPrivileges=, despite the documentation, doesn't remove all ways to get some privileges. One way to do this is to enter a new user namespace (only widely supported by Ubuntu as of today). e.g. we can get CAP_NET_RAW (and its associated vulnerabilities) through user namespaces:

% systemd-run --user -p NoNewPrivileges=yes --tty -- \
    unshare --map-root-user --net -- \
    capsh --print \
        | fgrep Current: | egrep -o 'cap_net\w+'

To harden against this, I wrote drop-privs-harder which simply breaks unshare (and its friend clone)'s ability to make new user namespaces, using seccomp.

Unlike NoNewPrivileges=, SystemCallFilter= takes many more arguments, and requires significantly more research to work out whether a process is going to work. Additionally, systemd-run doesn't support SystemCallFilter=. I'm not sure why.

To assist people playing around with this (on amd64 only!), I wrote a tool named seccomp-tool and a front-end named seccomp-filter.

There's a binary of seccomp-tool available for anyone who doesn't feel like compiling it. It depends on only libseccomp2. sudo apt install libseccomp2. It needs to be in your path as seccomp-tool.

seccomp-filter supports the predefined system call sets from the systemd documentation, in addition to an extra set, named @critical, which systemd seems to silently include without telling you. Both of these tools set NoNewPrivilges=, so you will also be testing that.

Let's have a play:

% seccomp-filter.py @critical -- ls /
ls: reading directory '/': Function not implemented

Here, we're trying to run ls with only the absolutely critical syscalls enabled. ls, after starting, tries to call getdents() ("list the directory"), and gets told that it's not supported. Returning ENOSYS ("function not implemented") is the default behaviour for seccomp-filter.py.

We can have a permissions error, instead, if we like:

% seccomp-filter.py --exit-code EPERM @critical -- ls /
ls: reading directory '/': Operation not permitted

If we give it getdents, it starts working... almost:

% ./seccomp-filter.py --exit-code EPERM @critical getdents -- ls /proc

Why does the output look like it's been piped through a pager? ls has tried to talk to the terminal, has been told it can't, and is okay with that. This looks the same as:

seccomp-filter.py --blacklist ioctl -- ls /

If we add ioctl to the list again, ls pretty much works as expected, ignoring the fact that it segfaults during shutdown. systemd's @default group of syscalls is useful to include to remove this behaviour.

Next, I looked at what Java required. It turns out to be much better than I expected: the JVM will start up, compile things, etc. with just: @critical @default @basic-io @file-system futex rt_sigaction clone.

This actually works as a filter, too: if Java code tries to make a network connection, it is denied. Or, er, at least, something in that area is denied. Unfortunately, the JVM cra.. er.. "hard exits" for many of these failures, as they come through as unexpected asserts:


Assertion 'sigprocmask_many(SIG_BLOCK, &t, 14,26,13,17,20,29,1,10,12,27,23,28, -1) >= 0' failed at ../src/nss-myhostname/nss-myhostname.c:332, function _nss_myhostname_gethostbyname3_r(). Aborting.

It then prints out loads of uninitialised memory, as it doesn't expect uname to fail. e.g.

Memory: 4k page, physical 10916985944372480k(4595315k free), swap 8597700727024688k(18446131672566297518k free)

uname: [box][box]s[box]

This demonstrates only one of the operation modes for seccomp. Note that, as of today, the Wikipedia page is pretty out of date, and the manpage is outright misleading. Consider reading man:seccomp_rule_add(3), part of libseccomp2, to work out what's available.

Summary: Hardening good, hardening hard. Run your integration test suite under seccomp-filter.py --blacklist @obsolete @debug @reboot @swap @resources and see if you can at least get to that level of security?


HTTP2 slowed my site down!

At work, we have a page which asynchronously fetches information for a dashboard. This involves making many small requests back to the proxy, which is exactly the kind of thing that's supposed to be faster under HTTP2.

However, when we enabled HTTP2, the page went from loading in around two seconds, to taking over twenty seconds. This is bad. For a long time, I thought there was a bug in nginx's HTTP2 code, or in Chrome (and Firefox, and Edge..). The page visibly loads in blocks, with exactly five second pauses between the blocks.

The nginx config is simply:

location ~ /proxy/(.*)$ {
  proxy_pass https://$1/some/thing;

.. where is Google Public DNS.

It turns out that the problem isn't with HTTP2 at all. What's happening is that nginx is processing the requests successfully, and generating DNS lookups. It's sending these on to Google, and the first few are getting processed; the rest are being dropped. I don't know if this is due to the network (it's UDP, after all), or as Google think it's attack traffic. The remainder of the requests are retried by nginx's custom DNS resolver after 5s, and another batch get processed.

So, why is this happening under HTTP2? Under http/1.1, the browser can't deliver requests quickly enough to trigger this flood protection. HTTP2 has sped it up to the point that there's a problem. Woo? On localhost, a custom client can actually generate requests quickly enough, even over http/1.1.

nginx recommend not using their custom DNS resolver over the internet, and I can understand why; I've had trouble with it before. To test, I deployed dnsmasq between nginx and Google:

dnsmasq -p 1337 -d -S --resolv-file=/dev/null

dnsmasq generates identical (as far as I can see) traffic, and is only slightly slower (52 packets in 11ms, vs. 9ms), but I am unable to catch it getting rate limited. On production, a much smaller machine than the one I'm testing, dnsmasq is significantly slower (100+ms), so it makes sense that it wouldn't trigger rate limiting. dnsmasq does have --dns-forward-max= (default 150), so there's a nice way out there.

In summary: When deploying HTTP2, or any upgrades, be aware of rate limits in your, or other people's, systems, that you may now be able to trigger.

- Next ยป