Sunday, June 19, 2016

Netflix and IPv6 -- Problem solved

I have been griping about Netflix's handling of IPv6 as it interacts with their GeoIP database. This leads them to believe that I am behind a proxy (as I use Hurricane Electric's excellent IPv6 Tunnel Broker service). 

Netflix could fix this problem themselves (if they chose to do it0> The simplest approach would be to trigger a redirect to an IPv4 only version of the site if they don't like the IPv6 source address. However, they don't want to do that (it is work, I suppose). This leaves me no choice but to take action on my side (I'm getting grief from my kids that they can't watch their shows). The problem doesn't affect viewing Netflix on the big screen as we use Tivo boxes for that (and I guess they only support IPv4).

My setup at home uses dnscache as a local DNS cache, and I also have a DNS server written in Perl that handles special domains like my SPF record (and its references) and my ip6.arpa space.

To fix the Netflix problem, I added a forwarding entry to dnscache to point netflix.com to my local perl DNS server. The implementation of the handler for this is:
sub no_aaaa_handler {    my ($base, $qname, $qclass, $qtype, $peerhost) = @_;    my ($rcode, @ans, @auth, @add);
    $rcode = "NXDOMAIN";
    my $res = Net::DNS::Resolver->new(                 nameservers => [qw(8.8.8.8 8.8.4.4)]);
    if ($qtype eq 'ANY') {        $qtype = 'A';    }
    my $ans = $res->send($qname, $qtype, $qclass);
    if ($ans) {        @ans = grep { $_->type ne "AAAA" } $ans->answer;        @add = grep { $_->type ne "AAAA" } $ans->additional;        $rcode = $ans->header->rcode;    }
    push @auth, @soa if $rcode eq 'NXDOMAIN';
    return ($rcode, \@ans, \@auth, \@add, { aa => 1 });}



Problem solved -- traffic to Netflix is now forced over IPv4, and they think that they know where we live (actually, Maxmind gets the town right, though most of the others don't. They nearly all get the state right).

Monday, June 13, 2016

Netflix and IPv6

I have been running IPv6 at home for a few years now. I've been using a Hurricane Electric tunnel running over my Comcast IPv4 service. It performs startlingly well, with reducaed latency over the native IPv6 Comcast service (which wasn't available when I started this process).

All has been good until later May 2016 when my kids started asking me why Netflix was complaining about proxies and not letting them watch whatever it is that they watch. I ignored this for as long as possible -- whatever the problem was, it didn't affect my use of Netflix (we use Tivos as the main TV viewing platform). Then I caught a tweet which indicated that this message was a result of running an IPv6 tunnel. Why?

The Netflix help for the issue is completely useless. It was written by (charitably) a technical person who doesn't understand that the vast majority of Netflix viewers have no idea what IPv6 is (or even what IPv4 is). The message is:
Netflix supports any IPv6 connection that is natively provided to you by your ISP. Tunneling services that provide IPv6 over an IPv4 Network are not supported by Netflix, and may trigger an error message.
This message does not give you any clue as to what to do about the problem. Are they really saying "Reconfigure your network connectivity in order to view Netflix."?

I now understand what the problem is -- their GeoIP database is unable to locate the country where the IPv6 address is, and so they don't provide service to it. Does anybody know which GeoIP database they use -- maybe I could get that DB fixed, However, the whole idea behind Netflix is that it is easy and seamless to use (the idea being trying to discourage people from using pirated content). So why are they being so anti-paying-customer?

The only thing that I can think of is that they are not getting enough complaints. There are two things that they could do that are simple:
  1. Provide a list of IPv6 server addresses that people could block. This would force a fallback to IPv4 and then things would work
  2. Fix the code so that if an IPv6 address cannot be geolocated, then force a redirect to IPv4. 
For now, I've had to disable the IPv6 stack on the kids' laptops. This hardly seems like an ideal solution.

Update: See Netflix-and-ipv6-problem-solved for the resolution.

Saturday, May 21, 2016

Adventures with NodeMCU

I've always wanted to build a retro-themed display for some weather data, and I've been thinking about how to do this over a few years. Recently,  I started to assemble the hardware to actually make it a reality.

The essential piece of the system is an old-fashioned looking analog meter with a simple mechanism to choose the variable to be displayed (temperature, humidity, etc). I always wanted to be able to display two values on the same meter, so I needed a drive mechanism that could handle that. Eventually I found the VID28-05 which is a dual instrument stepper motor. These are designed for displays like car instrument panels so they are made in large volumes and hence are economical! Also they can be driven at 5 volts at low current.

The device that seemed to be suitable to drive these was the NodeMCU -- this is an ESP8266 based board that is very cheap but includes programming hardware and standard pin spacings. It is programmed in Lua -- which is great for prototyping.

The interface to the variable selector device was a cheap rotary encoder (as used in car stereo equipment) and I wrote a module for the NodeMCU to provide a sensible interface. In the course of doing this, I ended up fixing a number of other issues with the base Lua firmware and ended up as a contributer to the nodemcu-firmware project.

One of the big issues with the ESP8266 chipset is that there is very limited RAM available and this is normally the limiting constraint on writing Lua code -- it all gets loaded into RAM at runtime and then interpreted.

It occurred to me that if this could be copied into the flash memory (of which there is a lot), and it could be executed directly, then this would enable much larger applications to be written. More importantly it would allow larger sets of standard libraries to be written and shared.

The base object in Lua for a piece of code is a 'function' which corresponds to the C structure 'Proto'.

typedef struct Proto {
  CommonHeader;
  TValue *k;  /* constants used by the function */
  Instruction *code;
  struct Proto **p;  /* functions defined inside the function */
  unsigned char *packedlineinfo;
  struct LocVar *locvars;  /* information about local variables */
  TString **upvalues;  /* upvalue names */
  TString  *source;
  int sizeupvalues;
  int sizek;  /* size of `k' */
  int sizecode;
  int sizep;  /* size of `p' */
  int sizelocvars;
  int linedefined;
  int lastlinedefined;
  GCObject *gclist;
  lu_byte nups;  /* number of upvalues */
  lu_byte numparams;
  lu_byte is_vararg;
  lu_byte maxstacksize;
} Proto;

It was fairly easy to copy the 'code' to flash and then replace the pointer to point at the readonly copy. Very quickly I discovered that, after writing to the flash directly, the memory mapped, readonly, view of the flash did not update. The documentation on the ESP8266 is pretty rudimentary. It is an Xtensa lx106 core with a number of custom peripherals designed by Espressif.

After some experimentation, it appears that if you read memory at +32k and +64k, then the original cached data is lost and so, if you access it again, then the data is fetched from the flash chip. I haven't done the experiments to see if the cache can be flushed with a single read.

However, it turns out that just moving the code into flash doesn't get much memory back. A lot is consumed in strings (the constants, the local variable names, the upvalue names etc). There is a 16 byte Lua header for each string, and an 8 (or possibly 16) byte memory management overhead per block. This eats into the 48k of RAM that is available. So the next step was to move the strings (represented as TString) into flash. The code seemed fairly straightforward...

However, it didn't work except in the simplest case. The platform would lock up until the watchdog expired and triggered a reset. I had my suspicions that the garbage collector might be trying to write to my flash strings, but this should cause an exception rather than a watchdog timeout.

After some time, I recalled that the NodeMCU code had a custom exception handler that handled exceptions on 8 or 16 bit loads from flash. Apparently, the glue logic to the flash chip could only handle 32 bit loads (although this isn't clear if this is always true or whether it is only when there is a cache miss). Turns out that the exception handler also gets triggered when there is a store to the flash region. The exception handler detects that it is a store, and then (effectively) does a busy wait till the watchdog times out. The underlying SDK (from Espressif) tries to register interrupt handlers so that it can print out a nice message and save the exception parameters for the next reboot. It was a quick fix to make writes to the flash trigger an immediate crash.

This did help me track down a number of places in the garbage collector where it was trying to 'mark' my readonly TString objects. I fixed these.

I started out testing with the following code

function validate(method)
   local httpMethods = {GET=true, HEAD=true, POST=true, PUT=true, DELETE=true, TRACE=true, OPTIONS=true, CONNECT=true, PATCH=true}
   return (httpMethods[method])
end

Once I got the copying to flash to not crash the platform immediately, I tried to exercise the code above (after it was copied to flash). 

> validate("GET")
nil

What??? After lots more investigation, it turns out that the table implementation in Lua relies on the fact that two strings with the same value ("GET") are represented as the same pointer. This is no longer true once the value inside the function is stored in flash, and the interactive prompt version is located in RAM. 

I fixed the rawequal function so that it would compare the values of strings (without any significant performance penalty). It then turned out that the table implementation also used another equality checking function, so I needed to fix that as well.

It feels as though I am heading down a rabbit hole.

The current state is that the platform still triggers a watchdog timeout for complicated cases, but simple cases now work. It is a significant reduction in the amount of memory consumed by code. I am hopeful that I can the code to work reliably. Then the task will be to clean it up and make sure that there is no penalty when this copy-to-flash mode is not compiled in.