Monday, December 21, 2009

The ACMA blacklist, and can it be distributed securely?

One sore point is that the ACMA Blacklist for online restricted content is "closed" - that is, we the public have no way at the present time of viewing what is it on it. The pro-filtering advocates quite validly state that opening up the ACMA Blacklist would basically be publishing URLs for naughty people to view - an "illegal content directory", if you will.

So what? If they want to find it, they'll find it - whether it is public or not.

The current downside though is that the blacklist can't be easily used by third-party filter software producers without what I understand to be an elaborate and expensive process.

So not only is it currently impossible for the public to vet the list to make sure only illegal content makes it on there, but it also means it can't be widely used everywhere unless you're a company with a lot of money to burn.

It seems like a bit of a silly situation to be in, doesn't it?

So, is it feasible to distribute the list in some encrypted way? How easy would it be to find what is on the list itself? This is a good question. The true answer is "no, it isn't feasible." Like everything technological, the true question is how much effort you're willing to go to in order to hide said list.

The ACMA blacklist is integrated into a few products which are currently available. The problem is hiding the URLs from the user. Software hackers are a clever bunch. If your computer runs the software then it is very possible to determine how to decrypt the URL list and use it. So simply publishing the cleartext ACMA blacklist - encrypted or not - is just never going to be secure. I believe this is how the ACMA blacklist was leaked to Wikileaks earlier in 2009.

There already exists a perfectly good way to distribute this sort of URL blacklist - eg Google SafeSearch. The ACMA could take the list of URLs, convert them to sets of MD5 strings to match against, and distribute that. They could distribute this openly - so everyone and anyone who wished to filter content based on this list could do so without having to pay the ACMA some stupid amount of money. Finally, it means that web site owners could compare their own URLs against the content of the blacklist to see if any of their pages are on it. It may not be that feasible for very large, dynamic URL sites - but it certainly is more feasible than what can be done today.

If the ACMA did this then I'd even write up a Squid plugin to filter against said ACMA blacklist. Small companies and schools can then use it for free. That would get the ACMA blacklist more exposure - which benefits the ACMA as much as it benefits anti-censor advocates. More use will translate to a larger cross-section of visited web sites - so people will be more likely to discover if something which shouldn't be blocked suddenly appears on the blacklist.

But is it truely secure? There's currently no way to take an MD5 string and turn it back into a URL. You could theoretically generate a set of URLs which would hash to that MD5 string but it'd take a damned long time. So, for all practical reasons, it can't be reverse engineered.

But what can be done is to log the URLs which match the filter and slowly build up a list of sites that way. Naughty people could then publish the set of URLs which match the blacklist rules. There's no technological method of avoiding that. If people discover a URL has been filtered, they may just share the link online.

The only real way the government has to counter sharing the cleartext URLs from the blacklist would be to make it illegal and enforce that law very strictly. This means enforcing it when naughty stuff is shared - but it also means that anyone who publishes URLs for content which should not be on the list may also get punished. That is a whole other debate.

So in summary - yes, the ACMA could publish the blacklist in a way that is more secure than they currently are. They could publish it - like Google does - to the public, so it can be integrated into arbitrary pieces of software. This may help it be more widely adopted and tested. But they will never be able to publish the list in a way that makes it impossible to identify and publish cleartext URLs.

Let me be clear here - there is no technological method for restricting what information people can share between each other, and this includes URLs identified to be on the ACMA blacklist.

Sunday, December 20, 2009

On filtering proxy/anonymizing servers..

I'd like to briefly talk about anonymizing/proxy servers. These services act as gateways between the user (and their web browser, for example) and the general internet. They typically hide the real user origin from the web site and the ISPs in question so access can not be easily traced. They are also useful diagnostic tools (eg to see whether web sites work from far away networks.) Others use them to circumvent country filters which are blocking access to "free-speech" and social networking web sites (eg China, Iran, etc.)

I'm not going to talk about the legitimate and illegitimate uses of these. Many more devices are used and abused for nefarious ways, but we don't see the postal system implement mandatory written filtering; nor do we see (legal!) mandatory monitoring and filtering of the telephone/cellular network.

One common way of working around URL filters in the workplace, schools and libraries is to use an anonymizer/proxy service on the internet. This is how many schoolchildren log onto facebook and myspace. Their use is dangerous (as you're typically giving the service your facebook/myspace/hotmail/gmail/etc credentials!) but again, there are plenty of legitimate and safe uses for them.

The problem is constructing filters which block access through these anonymizer/proxy services. Some of them will include the original URL in the request - they're easyish to block. Others will encrypt/obfuscate the URL so a normal filter won't work. There are plenty of tricks which are pulled; describing them will take a long time.

A growing number of these anonymizer/proxy services are using SSL encryption to totally hide what is going on (ie, blocking not only the URL, but the content itself.) This is just not possible to break without some intrusive additions to the users' computer. Let's not go there.

So, there really is only a few ways to combat this:
  1. You create complicated rules for each anonymizer/proxy service which attempts to track and decode the URL, and filter on that; or
  2. You create complicated fingerprints to identify types of traffic which indicate the user of an anonymizer/proxy service, and filter on that; or
  3. You just block any and all proxy anonymizer/proxy sites.
The problems!
  • 1 is difficult and longwinded. A lot of effort would have to be spent to continuously update the set of rules as new proxy services come on board designed to thwart these services.
  • 2 is just as difficult and longwinded - and it becomes possible that these fingerprints will identify legitimate sites as proxy services and filter traffic incorrectly.
  • 3 is what the majority of current content filters do. They don't bother trying to filter what people are doing with anonymizer/proxy services; they just blanket filter all of them.
Now, as I've mentioned, plenty of new anonymizer/proxy services pop up every day. I'd hazard a guess and suggest that the majority of them are run by shady, nefarious people who see the value in logging your access credentials to popular webmail/social networking sites and selling them to third parties.

The real concern - I've seen more than one user log onto their internet banking and work sites using these anonymizer/proxy services because they're so used to using them, they forget not to. Imagine, for a moment, that gambling sites are blocked and users turn to anonymizer/proxy services to gamble online. They use their credit card details. Ruh roh.

This is another example of the arms race which filtering companies deal with every day.
New anonymizer/proxy services are created every day - many specifically to allow users to bypass country-level filtering. Many of them may be logging and selling your authentication credentials to third parties. Users will simply begin using new anonymizer/proxy services as they creep up to work around any filtering which may be put in place. There is a non-trivial amount of effort required to keep track of all of these sites and noone will ever be 100% effective.

A large amount of effort will be needed to filter these services and perfectly legitimate uses will be blocked.

You don't want to push users to begin using anonymizing/proxy services - that is a battle that you won't win.

Saturday, December 19, 2009

Filtering via BGP, and why this doesn't always quite work out..

Another interesting thing to look at is the increasingly popular method of filtering by using a BGP "exception list". I've heard this anecdotally touted by various parties as "the solution" (but nothing I can quote publicly in any way, sorry) but really, is it?

This employs a little bit of routing trickery to redirect sites to be filtered via the proxy and passing the rest through untouched. This hopefully means that the amount of traffic and websites which need to pass through the filter is a lot less than "everything."

This is how the British Telecom "cleanfeed" solution worked. They redirected traffic to be filtered via a bunch of Squid proxies to do the actual filtering. This worked out great - until they filtered Wikipedia for a specific image on a specific article. (I'd appreciate links to the above please so I can reference them.) Then everything went pear-shaped:
  • From my understanding, all the requests to Wikipedia came from one IP, rather than properly pretending to be the client IP - this noticably upset Wikipedia, who use IP addresses to identify potential spammers; and
  • The sheer volume of requests to Wikipedia going through the filtering service caused it to slow right down.
So that is problem number 1 - it looks like it will work fine on a set of hardly viewed sites - but it may not work on a very busy site such as Wikipedia. Squid based filtering solutions certainly won't work on the scale of filtering Youtube (at least, not without using the magic Adrian-Squid version which isn't so performance-limited [/advertisement].)

The next problem is determining which IP addresses to redirect. Websites may change their IP addresses often - or have many IP addresses! - and so the list of IP addresses needs to be constantly updated. The number of IP addresses which need to be injected into BGP is based on all of the possible IP addresses returned for each site - this varies in the real world from one to hundreds. Again, they may change frequently - requiring constant updating to be correct. This leads to two main potential issues:
  • Hardware routing devices (ie, the top of the line ones which large ISPs use to route gigabits of traffic) have limited "slots" for IP addresses/networks. The router typically stops working correctly when those run out. If you're lucky, the number of IP addresses being filtered will fit inside the hardware routing table. If you're unlucky, they won't. The big problem - different equipment from different vendors has different limitations. Upgrading this equipment can costs tens or hundreds of thousands of dollars.
  • The only traffic being filtered is traffic being redirected to the filter. If the list of IP addresses for a nasty website is not kept 100% up to date, the website will not be properly filtered.
The third main problem is filtering websites which employ Content Delivery Networks. This is a combination of the above two problems. So I'm going to pose the question:

How do you filter a web page on Google?

No, the answer here isn't "Contact Google and ask them to kill the Web page." I'm specifically asking how one filters a particular web page on a very distributed infrastructure. You know; the kind of infrastructure which everyone is deploying these days. This may be something like Google/Yahoo; this may be being hosted on a very large set of end user machines on an illegally run botnet. The problem space is still the same.
  • The IP addresses/networks involved in potentially serving that website is dynamic - you don't get an "easy" list of IPs when you resolve the hostname! For example - there's at least hundreds of potential hostnames serving Youtube streaming media content. It just isn't a case of filtering "".
  • There are a number of services running on the same infrastructure as Youtube. You may get lucky and only have one website - but you also may get unlucky and end up having to intercept all of the Google services just to filter one particular website.
All of a sudden you will end up potentially redirecting a significant portion of your web traffic to your filtering infrastructure. It may be happy filtering a handful of never-visited websites; but then you start feeding it a large part of the internet.

In summary, BGP based selective filtering doesn't work anywhere near as well as indicated in the ACMA report.
  • You can't guarantee that you'll enumerate all IP addresses involved for a specific website;
  • The ACMA may list something on a large website/CDN which will result in your filtering proxies melting; you may as well have paid the upfront cost in filtering everything in the first place;
  • The ACMA may list something with so many IP addresses that your network infrastructure either stops working; or the filter itself stops working.
Personally - I don't like the idea of the ACMA being able to crash ISPs because they list something which ISPs are just unable to economically filter. Thus, the only logical solution here is to specify a filtering infrastructure to filter -everything-.

How does a proxy interfere with throughput?

Another big question with the filtering debate is figuring out how much of an impact on performance an inline proxy filter will have.

Well, it's quite easy to estimate how much of an impact an inline server running Windows/UNIX will have on traffic. And this will be important as the filtering mechanisms tested by the government and which will be implemented by more than one of the ISPs.

Inline proxies are very popular today. They're used in a variety of ISP and corporate environments. Traffic is either forced there via configuration on users' machines, or redirected there transparently by some part of the network (eg a router, or transparent bridge.)

The inline proxy will "hijack" the TCP sessions from the client and terminate them locally. The client believes it is talking to the web server but it actually talking to the proxy server.

Then the inline proxy will issue outbound TCP sessions to the web server as requested by the user - and in some configurations, the web server will think it is talking directly to the client.

This is all relatively well understood stuff. It's been going on for 10-15 years. I was involved in some of the early implementations of this stuff in Australia back in the mid 1990's. What isn't always well understood is how it impacts performance and throughput. Sometimes this doesn't matter for a variety of reasons - the users may not have a large internet connection in the first place, or the proxy is specifically in place to limit how much bandwidth each user can consume. But these proxies are going to be used for everyone, some of which will have multi-megabit internet connections. Performance and throughput suddenly matter.

I'll cover one specific example today - how inline proxies affect data throughput. There are ways that inline proxies affect perceived request times (ie, how long it takes to begin and complete a web request) which will take a lot more space to write about.

Each request on the inline proxy - client facing and server facing - will have a bit of memory reserved to store data which is being sent and received. The throughput of a connection is, roughly speaking, limited by how big this buffer is. If the buffer is small, then you'll only get fast speeds when speaking to web sites that are next door to you. If the buffer is large, you'll get fast speeds when speaking to web sites that are overseas - but only if they too have large buffers on their servers.

These buffers take memory, and memory is a fixed commodity in a server. Just to give you an idea - if you have 1GB of RAM assigned for "network buffers", and you're using 64 kilobyte buffers for each session, then you can only hold up (1 gigabyte / 64 kilobyte) sessions - ie, 16,384 sessions. This may sound like a lot of sessions! But how fast can you download with a 64 kilobyte buffer?

If you're 1 millisecond away from the webserver (ie, its on the same LAN as you), then that 64 kilobyte buffer will give you (64 / 0.001) kilobytes - or ~ 64 megabytes a second. That's 512 megabits. Quite quick, no?

But if you're on DSL, your latency will be at least 10 milliseconds on average. That's 6.4 megabytes a second, or 51.2 megabits. Hm, it's still faster than ADSL2, but suddenly its slower than what bandwidth the NBN is going to give you.

Say you're streaming from Google. My Perth ISP routes traffic to/from Google in Sydney for a few services. That's 53 milliseconds. With 64 kilobyte buffers, that's (64 / 0.053), or 1207 kilobytes/second. Or, around a megabyte a second. Or, say, 8-10 megabits a second. That isn't even ADSL2 speed (24 megabits), let alone NBN speeds (100 megabits.)

So the operative question here is - how do you get such fast speeds when talking to websites in different cities, states or countries to you? The answer is quite simple. Your machine has plenty of RAM for you - so your buffers can be huge. Those streaming websites you're speaking to build servers which are optimised for handling large buffered streams - they'll buy servers which -just- stream flash/video/music, which handle a few thousand clients per server, and have gigabytes of RAM. They're making enough money from the service (I hope!) to just buy more streaming servers where needed - or they'll put the streaming servers all around the world, closer to end-users, so they don't need such big buffers when talking to end users.

What does this all mean for the performance through a filtering proxy?

Well, firstly, the ISP filtering proxy is going to be filtering all requests to a website. So, it'll have to filter all requests (say) to Youtube, or Wikipedia. This means that all streaming content is potentially passing through it. It's going to handle streaming and non-streaming requests for the websites in question.

So say you've got a filtering proxy with 16GB of RAM, and you've got 64 kilobyte buffers. You have:
  • Say, minimum, 262,144 concurrent sessions (16 gigabytes / 64 kilobytes) going through the proxy before you run out of network buffers. You may have more sessions available if there aren't many streaming/downloading connections, but you'll always have that minimum you need to worry about.
  • Actually, it's half of that - as you have a 64 kilobyte buffer for transmit and a 64 kilobyte buffer for receive. So that's 131,072 concurrent sessions.
  • If your streaming site is luckily on a LAN, and you're on a LAN to the proxy - you'll get ~ 100mbit.
  • If you're on ADSL (10 milliseconds) from the proxy - you'll get 6.4 megabytes/second, or 51 megabits/sec from the proxy.
  • If you're on NBN (1 millisecond, say) from the proxy - you'll get 64 megabytes/second, or 512 megabits from the proxy.
  • BUT - if the proxy is 50 milliseconds from the web server - then no matter how fast your connection is, you're only going to get maximum (65536 / 0.050) bytes/sec, or 1.2 megabytes/second, or 12 megabits/second.
  • And woe be if you're talking to a US site. No matter how fast your connection is, the proxy will only achieve speeds of 320 kilobytes/sec, or 2.5 megabits. Not even ADSL1 speed.
The only way to increase the throughput your proxy has is to use larger buffers - which means either packing much more RAM into a server, or limiting the number of connections you can handle, or buying more servers. Or, if you're very unlucky, all of the above.

Now, the technical people in the know will say that modern operating systems have auto-tuning buffering. You only need to use big buffers for distant connections, rather than for all connections. And sure, they're right. This means that the proxy will handle more connections and obtain higher throughput. But the question now is how you design a proxy service which meets certain goals. Sure, you can design for the best case - and things like auto-tuning buffering is certainly good for raising the best-case performance. But the worst case performance doesn't change. If you have lots of international streaming sessions suddenly being filtered (because, say, some very popular US-centric website gets filtered because of one particular video), the performance will suddenly drop to the worst case scenario, and everyone suffers.

Now, just because I can blow my own trumpet a bit - when I design Squid/Lusca web proxy solutions, I always design for the worst case. And my proxies work better for it. Why? Because I make clear to the customer that the worst case solution is what we should be designing and budgeting for, and new equipment should be purchased based on that. The best case performance is just extra leg room during peak periods. That way clients are never surprised by poorly performing and unstable proxies, and the customer themselves knows exactly when to buy hardware. (They can then choose not to buy new proxies and save money - but then, when you're saving $100k a month on a $6k server, buying that second $6k server to save another $100k a month suddenly makes a lot of sense. Skimping on $6k and risking the wrath of your clients isn't appealing.)

Wednesday, December 16, 2009

People who don't understand an arms race aren't doomed to repeat it...`

This article is amusing. Apparently geeks can build Napster to circumvent "stuff" so geeks should be able to build a better RC filter.

Here's some history for you.

"Stuff" initially was "we're already sharing files via DCC on the Internet Relay Chat system (IRC); let's make an indexed, shiny, graphical, automated version of that!" It wasn't to circumvent any kind of censorship or filtering, and it wasn't a great leap of imagination. It was a small, incremental improvement over what existed. The only reason you think it was a big leap for a lone teenager is that Napster popularised file sharing. It made it easy for the average teenager to do.

Secondly, there are most likely individuals and companies profiting off the construction and use of non-web based distribution of RC materials. Filtering web traffic won't stop this distribution - it will simply stop the web distribution of RC materials. The filtering technology will quickly grow to counter these needs, and then new tools will appear to circumvent the filter. This is a classic arms race, pure and simple.

The only people who profiteer from an arms race are the arms dealers. In this case, the arms dealers are the companies developing tools to distribute the material, and companies developing tools to filter the material.

The astute reader should draw a parallel between what I've described and malware/viruses versus anti-virus software. Why is it we can't filter viruses 100%? Because there's money to be made in both writing the nasty software and filtering the nasty software. The end-users end up paying the price.

This censorship nonsense will suffer the same fate.

Why would more than 10,000 URLs be a problem?

I'm going to preface this (and all other censorship/filtering related posts) with a disclaimer:

I believe that mandatory censorship and filtering is wrong, inappropriate and risky.

That said, I'd like others to better understand the various technical issues behind implementing a filter. My hope is that people begin to understand the proper technical issues rather than simply re-stating others' potentially misguided opinions.

The "10,000 URL" limit is an interesting one. Since the report doesn't mention the specifics behind this view, and I can't find anything about it in my simple web searching, I'm going to make a stab in the dark.

Many people who implement filters using open source methods such as Squid will typically implement them as a check against a list of URLs. This searching can be implemented via two main methods:
  1. Building a list of matches (regular expressions, exact-match strings, etc) which is compared against; and
  2. Building a tree/hash/etc to match against in one pass.
Squid implements the former for regular expression matching and the latter for dstdomain/IP address matching.

What this unfortunately means is that full URL matching with regular expressions depends not only on the complexity of the regular expression, but the number of entries. It checks each entry in the list in turn.

So when Squid (and similar) software is used to filter a large set of URLs, and regular expressions are used to match against, it is quite possible that there will be a limitation on how many URLs can be included before performance degrades.

So, how would one work around it?

It is possible to combine regular expression matches into one larger rule, versus checking against many smaller ones. Technical details - instead of /a/, /b/, /c/; one may use /(a|b|c)/. But unfortunately not all regular expression libraries handle very long regular expressions so for portability reasons this is not always done.

Squid at least doesn't make it easy to match on the full URL without using regular expressions. Exact-match and glob-style match (eg,*) will work very nicely. (I also should write that for Squid/Lusca at some point.)

A google "SafeSearch" type methodology may be used to avoid the use of regular expressions. This normalises the URL, breaks it up into parts, creates MD5 hashes for each part and compares them in turn to a large database of MD5 hashes. This provides a method of distributing the filtering list without specifically providing the clear-text list of URLs and it turns all of the lookups into simple MD5 comparisons. The downside is the filtering is a lot less powerful than regular expressions.

To wrap up, I'm specifically not discussing the effectiveness of URL matching and these kinds of rules in building filters. That is a completely different subject - one which will typically end with "it's an arms race; we'll never really win it." The point is that it is possible to filter requests against a list of URLs and regular expressions much, much greater than a low arbitrary limit.

.. summary from Retro Night, take #2

Gah, I deleted the wrong post. Typical me.

Three of us got together on Tuesday night to resurrect some Amiga hardware. In summary, we have working machines, one sort of working machine, and a few bad floppy drives. The prize thus far is a working Amiga 2000 with a few megabytes of RAM expansion, a not-yet-working SCSI-and-drive expansion card (ghetto indeed!), a video genlock device to overlay graphics on a PAL/NTSC signal, and a functional Amiga 1200 with extra goodies.

The aim now is to get an environment working enough to write Amiga floppy images out so we can start playing some more games. I'm hoping the Amiga 1200, when paired with a floppy drive and some form of MS-DOS readable flash device, will fit that bill reasonably nicely.

More to come next week.

Tuesday, November 10, 2009

More issues with Lighttpd

So occasionally Lighttpd on FreeBSD-7.x+ZFS gets all upset. I -think- there's something weird going on where I hit mbuf exhaustion somehow when ZFS starts taking a long time to complete IO requests; then all socket IO fails in Lighttpd until it is restarted.

More investigation is required. Well, more statistics are needed so I can make better judgements. Well, actually, more functional backends are needed so I can take one out of production when something like this occurs, properly debug what is going on and try to fix it.

Cacheboy Update / October/November 2009


Just a few updates this time around!
  • Cacheboy was pushing around 800-1200mbit during the Firefox 3.5.4 release cycle. I started to hit issues with the backend server not keeping up with revalidating requests and so I'll have to improve the edge caching logic a little more.
  • Lusca seems quite happy serving up 300-400mbit from a single node though; which is a big plus.
  • I've found some quite horrible memory leaks in Quagga on only one of the edge nodes. I'll have to find some time to login and debug this a little more.
  • The second backend server is now offically toast. I need to acquire another 1ru server with 2 SATA slots to magically appear in downtown Manhattan, NY.

Thursday, October 8, 2009

Cacheboy downtime - hardware failures


I've had both backend servers fail today. One is throwing undervolt errors on one PSU line and is having disk issues (most likely related to an undervoltage); the other is just crashed.

I'm waiting for remote hands to prod the other box into life.

This is why I'd like some more donated equipment and hosting - I can make things much more fault tolerant. Hint hint.

Wednesday, September 30, 2009

Just a few Lusca related updates!

  • All of the Cacheboy CDN nodes are running Lusca-HEAD now and are nice and stable.
  • I've deployed Lusca at a few customer sites and again, it is nice and stable.
  • The rebuild logic changes are, for the most part, nice and stable. There seems to be some weirdness with 32 vs 64 bit compilation options which I need to suss out but everything "just works" if you compile Lusca with large file/large cache file support regardless of the platform you're using. I may make that the default option.
  • I've got a couple of small coding projects to introduce a couple of small new features to Lusca - more on those when they're done!
  • Finally, I'm going to be migrating some more of the internal code over to use the sqinet_t type in preparation for IPv4/IPv6 agnostic support.
Stay Tuned!

Lusca updates - September 2009

Just a few Lusca related updates!

  • All of the Cacheboy CDN nodes are running Lusca-HEAD now and are nice and stable.
  • I've deployed Lusca at a few customer sites and again, it is nice and stable.
  • The rebuild logic changes are, for the most part, nice and stable. There seems to be some weirdness with 32 vs 64 bit compilation options which I need to suss out but everything "just works" if you compile Lusca with large file/large cache file support regardless of the platform you're using. I may make that the default option.
  • I've got a couple of small coding projects to introduce a couple of small new features to Lusca - more on those when they're done!
  • Finally, I'm going to be migrating some more of the internal code over to use the sqinet_t type in preparation for IPv4/IPv6 agnostic support.
Stay Tuned!

Monday, September 21, 2009

My current wishlist

I'm going to put this on the website at some point, but I'm currently chasing a few things for Cacheboy:

  • More US nodes. I'll take anything from 50mbit to 5gbit at this point. I need more US nodes to be able to handle enough aggregate traffic to make optimising the CDN content selection methods worthwhile.
  • Some donations to cover my upcoming APNIC membership for ASN and IPv4/IPv6 space. This will run to about AUD $3500 this year and then around AUD $2500 a year after that.
  • Some 1ru/2ru server hardware in the San Francisco area
  • Another site or two willing to run a relatively low bandwidth "master" mirror site. I have one site in New York but I'd prefer to run a couple of others spread around Europe and the United States.
I'm sure more will come to mind as I build things out a little more.

New project - sugar labs!

I've just put the finishing touches on the basic sugar labs software repository. I'll hopefully be serving part or all of their software downloads shortly.

Sugar is the software behind the OLPC environment. It works on normal intel based PCs as far as I can tell. More information can be found at

Monday, August 31, 2009

Cacheboy presentation at AUSNOG

I've just presented on Cacheboy at AUSNOG in Sydney. The feedback so far has been reasonably positive.

There's more information available at

Monday, August 17, 2009

Cacheboy status update

So by and large, the pushing of bits is working quite well. I have a bunch of things to tidy up and a DNS backend to rewrite in C or C++ but that won't stop the bits from being pushed.

Unfortunately what I'm now lacking is US hosts to send traffic from. I still have more Europe and Asian connectivity than North American - and North America is absolutely where I need connectivity the most. Right now I'm only able to push 350-450 megabits of content from North America - and this puts a big, big limit on how much content I can serve overall.

Please contact me as soon as possible if you're interested in hosting a node in North America. I ideally need enough nodes to push between a gigabit and ten gigabits of traffic.

I will be able to start pushing noticable amounts of content out of regional areas once I've sorted out North America. This includes places like Australia, Africa, South America and Eastern Europe. I'd love to be pushing more open source bits out of those locations to keep the transit use low but I just can't do so at the moment.

Canada node online and pushing bits!

The Canada/TORIX node is online thanks to John Nistor at prioritycolo in Toronto, Canada.

Thanks John!

Cacheboy is on WAIX!

Yesterday's traffic from into WAIX:
ASNMBytesRequests% of overall
AS754517946.77743729.85TPG-INTERNET-AP TPG Internet Pty Ltd
AS480212973.47447621.58ASN-IINET iiNet Limited
AS47398497.92294714.13CIX-ADELAIDE-AS Internode Systems Pty Ltd
AS95432524.5712414.20WESTNET-AS-AP Westnet Internet Services
AS48542097.329413.49NETSPACE-AS-AP Netspace Online Systems
AS177461881.1710503.13ORCONINTERNET-NZ-AP Orcon Internet
AS98221425.444562.37AMNET-AU-AP Amnet IT Services Pty Ltd
AS174351161.014111.93WXC-AS-NZ WorldxChange Communications LTD
AS94431140.627011.90INTERNETPRIMUS-AS-AP Primus Telecommunications
AS7657891.9311871.48VODAFONE-NZ-NGN-AS Vodafone NZ Ltd.
AS7718740.742721.23TRANSACT-SDN-AS TransACT IP Service Provider
AS7543732.114231.22PI-AU Pacific Internet (Australia) Pty Ltd
AS24313527.382520.88NSW-DET-AS NSW Department of Education and Training
AS9790436.803890.73CALLPLUS-NZ-AP CallPlus Services Limited
AS17412365.132280.61WOOSHWIRELESSNZ Woosh Wireless
AS17486349.271160.58SWIFTEL1-AP People Telecom Pty. Ltd.
AS17808311.652480.52VODAFONE-NZ-AP AS number for Vodafone NZ IP Networks
AS24093303.401140.50BIGAIR-AP BIGAIR. Multihoming ASN
AS9889288.851970.48MAXNET-NZ-AP Auckland
AS17705282.49840.47INSPIRENET-AS-AP InSPire Net Ltd

Query content served: 54878.07 mbytes; 23170 requests.
Total content served: 60123.25 mbytes; 28037 requests.

BGP aware DNS

I've just written up the first "test" hack of BGP aware DNS.

The basic logic is simple but evil. I'm simply mapping BGP next-hop to a set of weighted servers. A server is then randomly chosen from this pool.

I'm not doing this for -all- prefixes and POPs - it is only being used for two specific POPs where there is a lot of peering and almost no transit. There are a few issues regarding split horizon BGP/DNS and request routing which I'd like to fully sort out before I enable it for everything. I don't want a quirk to temporarily redirect -all- requests to -one- server cluster!

In any case, the test is working well. I'm serving ~10mbit to WAIX (Western Australia) and ~ 30mbit to TORIX (Toronto, Canada.)

All of the DNS based redirection caveats apply - most certainly that not all client requests to the caches will also be over peering. I'll have to craft some method(s) of tracking this.

Sunday, August 16, 2009

Squid-3 isn't a rewrite!


There seems to be this strange misconception that Squid-3 is a "rewrite" of Squid in C++. I am not sure where this particular little tidbit gets copy/pasted from but just for the record:

Squid-3 is the continuation of Squid-2.5, made to compile using the GNU C++ compiler. It is not a rewrite.

If Squid-3 -were- a rewrite, and the resultant code -was- as much of a crappy-performing, bastardised C/C++ hybrid, then I'd have suggested the C++ coders in question need to relearn C++. Luckily for them, the codebase is a hybrid of C and C++ because it did just start as a C codebase with bits and pieces part-migrated to C++.

Sunday, August 9, 2009

Updates - or why I've not been doing very much

G'day! Cacheboy has been running on autopilot for the last couple of months whilst I've been focusing on paid work and growing my little company. So far (mostly) so good there.

The main issue scaling traffic has been the range request handling in Squid/Lusca, so I've been working on fixing things up "just enough" to make it work in the firefox update environment. I think I've finally figured it out - and figured out the bugs in the range request handling in Squid too! - so I'll push out some updates to the network next week and throw it some more traffic.

I really am hoping to ramp traffic up past the gigabit mark once this is done. We'll just have to see!

Thursday, August 6, 2009

Preparation for next release; IPv6 checklist

I've been slowly working on tidying up the codebase before the next snapshot release. I've been avoiding doing further large scale code reorganisation until I'm confident that this codebase is as stable and performs as well as it should.

I'll hopefully have the next stable snapshot online tonight. I'll then re-evaluate where things are at right now and come up with a short-list of things to do over the next couple of weeks. It'll almost certainly be the remainder of the IPv6 preparation work - I'd like to prepare the last few bits of infrastructure for IPv6 - and make certain that is all stable before I start converting the client-side and server-side code to actively using the IPv6 routines.

The current IPv6 shortlist, if I decide to do it:
  1. client database code - convert to a radix tree instead of a hash on the IP address; make IPv4/IPv6 agnostic.
  2. persistent connection code - up the pconn hash key length to fit the text version of the IPv6 address. I'll worry about migrating the pconn code to a tree later on.
  3. Importing the last remaining bits of the IPv6 related code into the internal DNS code.
  4. Make sure the internal and external DNS choices both function properly when handling IPv6 addresses for forward and reverse lookups.
  5. Import the IP protocol ACL type and IPv6 address ACL types - src6 and dst6.
  6. Modify the ACL framework to use the IPv6 datatype instead of "sockaddr_in" and "inaddr" structs; then enable src6/dst6.
  7. Make certain the source and destination hostname ACLs function correctly for both IPv4 and IPv6.
  8. Test, test, test!
The last time I did a "hack" conversion to support IPv6 client side code I found a number of places which expected a newly-allocated struct to be zero'ed, and thus the "in_addr" embedded inside it to be INADDR_ANY. This caused some crashes to occur in production testing. I'm thus going to hold off on pushing through the IPv6 client side changes (which are actually surprisingly simple once the above is done!) until I've enumerated and fixed all of those particular nightmares.

The IPv6 server-side stuff is a whole different barrel of fun. I'm going to ignore a lot of that for now until I've made certain the client-side code is stable and performing as well as the current IPv4-only code.

I don't even want to think about the FTP related changes that need to occur. I may leave the FTP support IPv4 only until someone asks (nicely) about it. The FTP code is rife with C string pointer manipulations which need to be rewritten to use the provided string primitives. I'd really like to do -that- before I consider upgrading it to handle IPv6.

Anyway. Lots to do, not enough spare time to do it all in.

Tuesday, July 28, 2009

Updates - rebuild logic, peering and COSS work

I've committed the initial modifications to the storage rebuilding code. The changes mostly live in the AUFS and COSS code - the rest of Lusca isn't affected.

The change pushes the rebuild logic itself into external helpers which simply stream swaplog entries to the main process. Lusca doesn't care how the swaplog entries are generated.

The external helper method is big boost for AUFS. Each storedir creates a single rebuild helper process which can block on disk IO without blocking anything else. The original code in Squid will do a little disk IO work at a time - which almost always involved blocking the process until said disk IO completed.

The main motivation of this work was the removal of a lot of really horrible, twisty code and further modularisation of the codebase. The speedups to the rebuild process are a nice side-effect. The next big improvement will be sorting out how the swap logs are written. Fixing that will be key to allowing enormous caches to properly function without log rotation potentially destroying the proxy service.

Monday, July 13, 2009

Caching Windows Updates

There are two issues with caching windows updates in squid/lusca:

* the requests for data themselves are all range requests, which means the content is never cached in Squid/Lusca;
* the responses contain validation information (eg ETags) but the object is -always- returned regardless of whether the validators match or not.

This feels a lot like Google Maps who did the same thing with revalidation. Grr.

I'm not sure why Microsoft (and Google!) did this with their web services. I'll see if I can find someone inside Microsoft who can answer questions about the Windows Update related stuff to see if it is intentional (and document why) or whether it is an oversight which they would be interested in fixing.

In any case, I'm going to fix it for the handful of commercial supported customers which I have here.

Wednesday, July 8, 2009

VLC 1.0 released

VLC-1.0 has been released. The CDN is pushing out between 550 and 700mbit of VLC downloads. I'm sure it can do more but as I'm busy working elsewhere, I'm going to be overly conservative and leave the mirror weighting where it is.

Graphs to follow!

Storage rebuilding / logging project - proposal

I've put forward a basic proposal to the fledgling Lusca community to get funding to fix up the storage logging and rebuilding code.

Right now the storage logging (ie, "swap.state" logging) is done using synchronous IO and this starts to lag Lusca if there is a lot of disk file additions/deletions. It also takes a -long- time to rotate the store swap log (which effectively halts the proxy whilst the logs are rotated) and an even longer time to rebuild the cache index at startup.

I've braindumped the proposal here - .

Now, the good news is that I've implemented the rebuild helper programs and the results are -fantastic-. UFS cache dirs will still take forever to rebuild if the logfile doesn't exist or is corrupt but the helper programs speed this up by a factor of "LOTS". It also parallelises correctly - if you have 15 disks and you aren't hitting CPU/bus/controller limits, all the cache dirs will rebuild at full speed in parallel.

Rebuilding from the log files takes seconds rather than minutes.

Finally, I've sketched out how to solve the COSS startup/rebuild times and the even better news is that fixing the AUFS rebuild code will give me about 90% of what I need to fix COSS.

The bad news is that integrating this into the Lusca codebase and fixing up the rebuild process to take advantage of this parallelism is going to take 4 to 6 weeks of solid work. I'm looking for help from the community (and other interested parties) who would like to see this work go in. I have plenty of testers but nothing to help -coding- along and I unfortunately have to focus on projects that provide me with some revenue.

Please contact me if you're able to help with either coding or funding for this.

Monday, June 29, 2009

Current Downtime/issues

There's a current issue with content not being served correctly. It stemmed from a ZFS related panic on one of the backend servers (note to self - update to the very latest FreeBSD-7-stable code; these are all fixed!) which then came up with lighttpd but no ZFS mounts. Lighttpd then started returning 404's.

I'm now watching the backend(s) throw random connection failures and the Lusca caches then cache an error rather than the object.

I've fixed the backend giving trouble so it won't start up in that failed mode again and I've set the negative caching in the Lusca cache nodes to 30 seconds instead of the default 5 minutes. Hopefully the traffic levels now pick up to where its supposed to be.

EDIT: The problem is again related to the Firefox range requests and Squid/Lusca's inability to cache range request fragments.

The backend failure(s) removed the objects from the cache. The problem now is that the objects aren't re-entering the cache because they are all range requests.

I'm going to wind down the Firefox content serving for now until I get some time to hack up Lusca "enough" to cache the range request objects. I may just do something dodgy with the URL rewriter to force a full object request to occur in the background. Hm, actually..

Saturday, June 27, 2009

New mirror node - italy

I've just turned on a new mirror node in Italy thanks to New Media Labs. They've provided some transit services and (I believe) 100mbit access to the local internet exchange.

Thanks guys!

Friday, June 26, 2009

Lusca in Production, #2

Here's some 24 hour statistics from a Lusca-HEAD forward proxy install:

Request rate:
File descriptor count (so clients, servers and disk FDs) :

Byte and Request hit ratio:

Traffic to clients (blue) and from the internet (red):

CPU Utilisation (1.0 is "100% of one core):

Lusca-head pushing >100mbit in production..

Here's the Lusca HEAD install under FreeBSD-7.2 + TPROXY patch. This is a basic configuration with minimal customisation. There's ~ 600gig of data in the cache and there's around 5TB of total disk storage.

You can see when I turned on half of the users, then all of the users. I think there's now around 10,000 active users sitting behind this single Lusca server.

Tuesday, June 23, 2009

Lusca-head in production with tproxy!

I've deployed Lusca head (the very latest head revision) in production for a client who is using a patched FreeBSD-7.2-STABLE to implement full transparency.

I'm using the patches and ipfw config available at .

The latest Lusca fixes some of the method_t related crashes due to some work done last year in Squid-2.HEAD. It seems quite stable now. The bugs only get tickled with invalid requests - so they show up in production but not with local testing. Hm, I need to "extend" my local testing to include generating a wide variety of errors.

Getting back on track, I've also helped another Lusca user deploy full transparency using the TPROXY4 support in the latest Linux kernel (I believe under Debian-unstable?) He helped me iron out some of the bugs which I've just not seen in my local testing. The important ones (method_t in particular) have been fixed; he's been filing Lusca issues in the google code tracker so I don't forget them. Ah, if all users were as helpful. :)

Anyway. Its nice to see Lusca in production. My customer should be turning it on for their entire satellite link (somewhere between 50 and 100mbit I think) in the next couple of days. I believe the other user has enabled it for 5000 odd users. I'll be asking them both for some statistics to publish once the cache has filled and has been tuned.

Stay tuned for example configurations and tutorials covering how this all works. :)

Wednesday, June 17, 2009

And the GeoIP summary..

And the geoip summary:

From Sun Jun 7 00:00:00 2009 to Sun Jun 14 00:00:00 2009



Tuesday, June 16, 2009

A quick snapshot of Cacheboy destinations..

The following is a snapshot of the per destination AS traffic information I'm keeping.

If you're peering with any of these ASes and are willing to sponsor a cacheboy node or two then please let me know. How well I can scale things at this point is rapidly becoming limited to where I can push traffic from, rather than anything intrinsic to the software.

From Sun Jun 7 00:00:00 2009 to Sun Jun 14 00:00:00 2009

TimeSiteASNMBytesRequests% of overall
AS3320602465.0110219753.26DTAG Deutsche Telekom AG
AS7132583164.057782593.16SBIS-AS - AT&T Internet Services
AS19262459322.306031272.49VZGNI-TRANSIT - Verizon Internet Services Inc.
AS3215330962.955532991.79AS3215 France Telecom - Orange
AS3269317534.063331141.72ASN-IBSNAZ TELECOM ITALIA
AS9121259768.324349321.41TTNET TTnet Autonomous System
AS22773244573.652834271.32ASN-CXA-ALL-CCI-22773-RDC - Cox Communications Inc.
AS12322224708.253436861.22PROXAD AS for Proxad/Free ISP
AS3352206093.843051831.12TELEFONICADATA-ESPANA Internet Access Network of TDE
AS812204120.741666331.10ROGERS-CABLE - Rogers Cable Communications Inc.
AS8151198918.223286321.08Uninet S.A. de C.V.
AS6327197906.531528611.07SHAW - Shaw Communications Inc.
AS3209191429.183037871.04ARCOR-AS Arcor IP-Network
AS20115182407.092251510.99CHARTER-NET-HKY-NC - Charter Communications
AS577181167.021523830.98BACOM - Bell Canada
AS12874172973.421084290.94FASTWEB Fastweb Autonomous System
AS6389165445.732361330.90BELLSOUTH-NET-BLK - Inc.
AS6128165183.072103000.89CABLE-NET-1 - Cablevision Systems Corp.
AS2856164332.962192670.89BT-UK-AS BTnet UK Regional network

Query content served: 5234195.61 mbytes; 6878234 requests (ie, what was displayed in the table.)

Total content served: 18473721.25 mbytes; 26272660 requests (ie, the total amount of content served over the time period.)

Saturday, June 13, 2009

Seeking a few more US / Canada hosts

G'day everyone!

I'm now actively looking for some more Cacheboy CDN nodes in the United States and Canada. I've got around 3gbit of available bandwidth in Europe, 1gbit of available bandwidth in Japan but only 300mbit of available bandwidth in North America.

I'd really, really appreciate a couple of well-connected North American nodes so I can properly test the platform and software that I'm building. The majority of traffic is still North American in destination; I'm having to serve a fraction of it from Sweden and the United Kingdom at the moment. Erk.

Please drop me a line if you're interested. The node requirements are at . Thankyou!

Friday, June 12, 2009

Another day, another firefox release done..

The June Firefox 3.0.11 release rush is all but over and Cacheboy worked without much of a problem.

The changes I've made to the Lusca load shedding code (ie, being able to disable it :) works well for this workload. Migrating the backend to lighttpd (and fixing up the ETag generation to be properly consistent between 32 bit and 64 bit platforms) fixed the initial issues I was seeing.

The network pushed out around 850mbit at peak. Not a lot (heck, I can do that on one CPU of a mid-range server without a problem!) but it was a good enough test to show that things are working.

I need to teach Lusca a couple of new tricks, namely:

  • It needs to be taught to download at the fastest client speed, not the slowest; and

  • Some better range request caching needs to be added.

The former isn't too difficult - that is a weekend 5 line patch. The latter is more difficult. I don't really want to shoehorn in range request caching into the current storage layer. It would look a lot like how Vary and Etag is currently handled (ie, with "magical" store entries acting as indexes to the real backend objects.) I'd rather put in a dirtier hack that is easy to undo now and use the opportunity to tidy up the whole storage layer a whole lot. But the "tidying up" rant is not for this blog entry, its for the Lusca development blog.

The hack will most likely be a little logic to start downloading full objects that aren't in the cache when their first range request comes in - so subsequent range requests for those objects will be "glued" to the current request. It means that subsequent requests will "stall" until enough of the object is transferred to start satisfying their range request. The alternative is to pass through each range request to a backend until the full object is transferred and this would improve initial performance but there's a point where the backend could be overloaded with too many range requests for highly popular objects and that starts affecting how fast full objects are transferred.

As a side note, I should probably do up some math on a whiteboard here and see if I can model some of the potential behaviour(s). It would certainly be a good excuse to brush up on higher math clue. Hm..!

Thursday, June 11, 2009

Migrating to Lighttpd on the backend, and why aren't my files being cached..

I migrated away from apache-1.3 to Lighttpd-1.4.19 to handle the load better. Apache-1.3 handles lots of concurrent disk IO on large files fine but it bites for lots of concurrent network connections.

In theory, once all of the caching stuff is fixed, the backends will spend most of their time revalidating objects.

But for some weird reason I'm seeing TCP_REFRESH_MISS on my Lusca edge nodes and generally poor performance during this release. I look at the logs and find this:

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/2009042316 Firefox/3.0.10\r\n
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n
Accept-Language: en-us,en;q=0.5\r\n
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n
If-Modified-Since: Wed, 03 Jun 2009 15:09:39 GMT\r\n
If-None-Match: "1721454571"\r\n
Cache-Control: max-stale=0\r\n
Connection: Keep-Alive\r\n
Pragma: no-cache\r\n
X-BlueCoat-Via: 24C3C50D45B23509\r\n]

[HTTP/1.0 200 OK\r\n
Content-Type: application/octet-stream\r\n
Accept-Ranges: bytes\r\n
ETag: "1687308715"\r\n
Last-Modified: Wed, 03 Jun 2009 15:09:39 GMT\r\n
Content-Length: 2178196\r\n
Date: Fri, 12 Jun 2009 04:25:40 GMT\r\n
Server: lighttpd/1.4.19\r\n
X-Cache: MISS from\r\n
Via: 1.0 (Lusca/LUSCA_HEAD)\r\n
Connection: keep-alive\r\n\r]

Notice the different ETags? Hm! I wonder whats going on. On a hunch I checked the Etags from both backends. master1 for that object gives "1721454571"; master2 gives "1687308715". They both have the same size and same timestamp. I wonder what is different?

Time to go digging into the depths of the lighttpd code.

EDIT: the etag generation is configurable. By default it uses the mtime, inode and filesize. Disabling inode and inode/mtime didn't help. I then found that earlier lighttpd versions have different etag generation behaviour based on 32 or 64 bit platforms. I'll build a local lighttpd package and see if I can replicate the behaviour on my 32/64 bit systems. Grr.

Meanwhile, Cacheboy isn't really serving any of the mozilla updates. :(

EDIT: so it turns out the bug is in the ETag generation code. They create an unsigned 32-bit integer hash value from the etag contents, then shovel it into a signed long for the ETag header. Unfortunately for FreeBSD-i386, "long" is a signed 32 bit type, and thus things go airy from time to time. Grrrrrr.

EDIT: fixed in a newly-built local lighttpd package; both backend servers are now doing the right thing. I'm going back to serving content.

Tuesday, June 2, 2009

New mirrors - and

I've just had two new sponsors show up with a pair of UK mirrors. is thanks to UK Broadband, who have graciously given me access to a few hundred megabits of traffic and space on an ESX server. (due to be turned up today!) is thanks to a private donor named Alex who has given me a server in his colocation space and up to a gigabit of traffic.

Shiny! Thanks to you both.

Tuesday, May 19, 2009

More profiling results

So whilst I wait for some of the base restructuring in LUSCA_HEAD to "bake" (ie, settle down, be stable, etc) I've been doing some more profiling.

Top CPU users on my P3 testing box:

root@jennifer:/home/adrian/work/lusca/branches/LUSCA_HEAD/src# oplist ./squid
CPU: PIII, speed 634.464 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        image name               symbol name
1832194   8.7351            _int_malloc
969838    4.6238            memcpy
778086    3.7096            malloc_consolidate
647097    3.0851            vfprintf
479264    2.2849            _int_free
468865    2.2354            free
382189    1.8221            calloc
326540    1.5568  squid                    memPoolAlloc
277866    1.3247            re_search_internal
256727    1.2240            strncasecmp
249860    1.1912  squid                    httpHeaderIdByName
248918    1.1867  squid                    comm_select
238686    1.1380            strtok
215302    1.0265  squid                    statHistBin

Sigh. snprintf() leads to this:

root@jennifer:/home/adrian/work/lusca/branches/LUSCA_HEAD/src# opsymbol ./squid snprintf
CPU: PIII, speed 634.464 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        image name               symbol name
  15840     1.7869  squid                    safe_inet_addr
  29606     3.3398  squid                    pconnPush
  54021     6.0940  squid                    httpBuildRequestHeader
  64916     7.3230            inet_ntoa
  105692   11.9229  squid                    clientSendHeaders
  127929   14.4314  squid                    urlCanonicalClean
  196481   22.1646  squid                    pconnKey
  265474   29.9476  squid                    urlCanonical
36879    100.000            snprintf
  813284   91.7449            vsnprintf
  36879     4.1602            snprintf [self]
  12516     1.4119            vfprintf
  9209      1.0388            _IO_no_init

Double sigh. Hi, the 90's called, they'd like their printf()-in-performance-critical code back.

Wednesday, May 13, 2009

LUSCA_HEAD (as of today) is in production

I'm deploying the current LUSCA_HEAD build over the Cacheboy CDN network. So far so good.

Now, need more traffic.. :)

Tuesday, April 28, 2009

More brain-damage code..

I queued up another test run of LUSCA_HEAD on the 600mhz celeron box and discovered that performance had dropped slightly.

It turns out that there's an increase in malloc/free calls which I traced back to the ACL code and further back to the HTTP logging code.

The HTTP request logging code is using ACLs to determine what to log. I have the access log disabled AND I'm not even -using- the log_access ACL directive to restrict what is being logged but it still is taking a trip through the ACL checklist code to create an empty check list, check it, and return NULL. This is aside from all the rest of the logging code which will create all the duplicate strings for the logging layer, then not log anything, then free them.

Sigh. The more I look at the code the more I want to stab out my own eyes. Oh well, I'll add "fix the damned logging code" to my TODO list for later.

Wednesday, April 22, 2009

More traffic!

970mbit and rising...! - mozilla!

The nice folk at have provided me with a .jp CDN node. I'm now serving a good stack of bits from it into Australia, Malaysia, India, Japan, China, Korea and the Phillipines.

Thanks guys!

Mozilla 3.0.9 release..

The mozilla release (3.0.9) is currently going on. The traffic levels are ramping up now to the release peak.

880mbit/sec and counting..

Monday, April 13, 2009

Changing tack - more modularisation?

I've decided that tackling the storage manager codebase in its current shape is a bit too much right now. I may end up going down the path of using refcounted buffers as part of the data store to temporarily work around the shortcomings in the code. I don't like the idea of doing that long-term because, to be quite honest, that area of the codebase needs a whole lot of reorganisation and sanity to make it consistent and understandable. I think it should be done before any larger-scale changes are done and this includes the obvious performance boosts available by avoiding the copying.

In any case, I'm going to move onto my next short-term goal - a very, very basic module framework. I'd like to at least shoehorn in some very simple dynamic module loading and unloading into the core, not unlike what TMF did for Squid-3 as part of their initial eCAP work. My initial plans are to do the bare minimum necessary to start breaking out a very small chunk of code into modules - namely the request rewriters (url, storeurl and location) so they don't have to be compiled in. It will also force a little bit of tidying up around the HTTP and client-side code.

The initial aim is purely code reorganisation. Instead of having a nested list of callbacks forming the request processing chain, I plan on having a simple iterative process finite state machine which will route the request through different modules as required before passing it along to the rest of the client-side code. I hope that I can (slowly!) unwind a large part of the request path hairiness and enumerate it with said state engine.

In any case, I won't be going anywhere near as far with this as I'd like to in the first pass. There are plenty of problems with this (the biggest being parsing compound configuration types like ACLs - for example, if I wanted to modularise the "asn" ACL type, the module will need to be loaded far before the rest of the configuration file is parsed; then it needs to hook itself into the ACL code and register (a la what happened with Squid-3) itself in there; then subsequent ACL line parsing needs to direct things at the ACL module; then the ACL module needs to make sure its not unloaded until everything referencing it is gone!) but I'm going to pleasantly ignore them all the first time around.

By keeping the scope low and the changes minimal, I hope that the amount of re-recoding needed later on down the track (once I've established exactly what is needed for all of this) should be limited.

Oh, and as an aside but related project, I'm slowly fixing the SNMP core startup code to -not- use 15-20 nested deep function calls as part of its MIB tree construction. It is a cute functional programming type construct but it is absolutely horrible to try and add something. The related bit is allowing for SNMP mib leaves to be added -and- removed at runtime - so modules can register themselves with the cachemgr and SNMP core to provide their stats. 

Thursday, April 9, 2009

Fixing the disk code, part 4

My changes are now occasionally randomly crashing in malloc() - which means something is modifying memory it shouldn't and confusing things. It's a shame that I can't run this under valgrind - the required traffic load to generate this issue at the moment makes valgrind impossible to use.

My guess is that something is ripping the read buffer out from underneath the pread() in the worker thread. Figuring out exactly where this is happening is going to be tricky.

The prime candidate at the moment is where the read is cancelled indirectly via a call to aioClose(). aioClose() calls aioCancel() which attempts to cancel all of the pending events but I'm absolutely sure that it is possible for a read event to be in progress. The fact this is occurring was hidden by the read() buffer being a temporary local buffer - the read() would complete (error or not) and then the result would be tossed away.

The solution? I'm not sure. I have a feeling I'm going to be introducing something like AsyncCalls in Squid-3 to delay the IO callbacks from occuring - but then, for those keeping score, the AsyncCalls stuff in squid-3 (in its current inclusion) has made some code unstable and other bits of code difficult to debug. I wonder what repercussions there would be to splitting out the generation of IO callbacks and their execution; it would certainly help with diskd..

Wednesday, April 8, 2009

Fixing the disk code, part 3

The current store client code experiment involves this:

  • store clients are created and destroyed as per normal;
  • creation sets "", destruction clears "";
  • destruction doesn't deallocate the store client - it removes it from various lists and only -attempts- to free it via a new function, storeClientComplete()
  • storeClientComplete() (badly named) will only free the store client if it is marked as inactive and the callbacks it is a part of have returned.
  • In the callback functions, storeClientComplete() is called to see whether the store client should be freed, and if so, they terminate the callback
  • In the callback functions, if "" is clear, then the callback is terminated

Since cbdataFree() is not called until the flags are clear (active, disk_io_pending, event_pending) then it is possible that some callbacks will be called when they shouldn't be. In order to preserve the existing behaviour I need to further audit the store client code, find out which functions are passed in as callbacks, and make sure those functions correctly check the active state of the store client and also try calling storeClientComplete(). The problem here is that what controls the callback being called is the validity of the callback -data- - so I am hoping that there are no non-store client callbacks being scheduled with a store client instance as the callback data. Gah.

The other issue at the moment is that there is currently no guarantee (read: I haven't ensured it happens!) from the store API that the FileClosed() and FileNotify() routines get called before the store client is deregistered. I therefore can't delay free'ing the store client until those are also called. It should not be a big deal because the main motivation to delaying the store client -free- is to allow pending async operations using other data (in this case, read buffers) to complete before free'ing the data. In the two above instances the behaviour should be the same as before - the store client pointer will simply go invalid and the callbacks won't be made.

Monday, April 6, 2009

Lusca and Cacheboy improvements in the pipeline..

After profiling Lusca-HEAD rather extensively on the CDN nodes, I've discovered that the largest CPU "use" on the core 2 duo class boxes is memcpy(). On the ia64-2 node memcpy() shows up much lower down in the list. I'm sure this has to do with the differing FSB and general memory bus bandwidth available on the two architectures.

I'm planning out the changes to the store client needed to support fully copy-free async read and write. This should reduce the CPU overhead on core 2 duo class machines to the point where Lusca should break GigE throughput on this workload without too much CPU use. (I'm sure it could break GigE throughput right now on this workload though.)

I'll code this all up during the week and build a simulated testing rig at home "pretending" to be a whole lot of clients downloading partial bits of mozilla/firefox updates, complete with a random packetloss, latency and abort probability.

I also plan on finally releasing the bulk of the Cacheboy CDN software (hackish as it is!) during the week, right after I finally remove the last few bits of hard-coded configuration locations. :) I still haven't finished merging in the bits of code which do the health check, calculate the current probabilities to assign each host and then write out the geoip map files. I'll try to sort that out over the next few days and get a public subversion repository with the software online.

By the way, I plan on releasing the Cacheboy CDN software under the Affero GPL (AGPL) licence.

Sunday, April 5, 2009

Fixing the disk code, part 2

I've been reviewing and documenting the store client code to get a feel for how the code works and is used. Sure enough, it relies heavily on the callback data invalidation magic to avoid having to wait around until the store disk file operations complete (or an event scheduled via eventAdd() to complete at a later date.) I added a little bit of logging to print out a warning if a pending disk operation was there and sure enough, it happens often enough in production to probably cause all those random crashes that I remember in Squid-1.2.

Anyway. I have an idea for eliminating the read copy in the async IO code. Something I've done in other C programs is this:
  • Track the pending events which the store client code schedules callbacks into itself (ie, the eventAdd() for later processing, and pending disk IO)
  • Split storeClientUnregister() into two parts - one which sets a "shutting down" flag, and another which does the hard work
  • assert() that API methods aren't called (ie, storeClientRef(), etc) if the shutting down flag is set - ie, they shouldn't be at the present time, so any instance of that happening means that some callback is occuring where it shouldn't be and getting further than it should be!
  • On event/disk callback completion, clear the relevant flags and see if the store client is completely ready to be destroyed. Only then destroy it.
This -should- preserve the current behaviour (callbacks will return immediately instead of not being called, so they effectively behave the same) but it won't prevent non-callback code from running if the callback data pointer is "valid". (Which, hopefully isn't actually happening, but god knows in this codebase..) It means that the callback data "is valid" check in the store client code (and hopefully, the disk IO code) really becomes a debugging check rather than being used to control whether the callback is made or not.

I'm just worried that there's going to be weirdness in how the store client code is being called. Grr.

Anyway. If this is successful and proves itself in production, I'll then change the disk APIs to also enforce that the callback data remains valid for the lifetime of the scheduled IO. Which IMHO it should be. As I said earlier, using tricks like reference counted buffers shouldn't be a crutch for a bad API design..

Saturday, April 4, 2009

Fixing the disk APIs (or, reading Squid-1.2 code)

One of the performance limiting things I've found with the mozilla workload is the memcpy() being done from the temporary read buffer into the reader buffer (ie, what was supplied to storeClientRead*()) and, curious to know the history of this, I decided to go digging into the earlier revisions of all of this code. I went back to Squid-1.2.beta25.

I found that file_read() and file_write() (ie, the current "sync only" disk routines) had the hooks into calling the async IO functions (ie, aioRead() and aioWrite()) if they were compiled in. Like now, aio_read() makes a temporary buffer to read into and then copies it to the original request buffer if the request wasn't cancelled between submission and completion time.

In the 1.2 code, the callback involved copying the data into the supplied buffer and calls the callback. This didn't check to make sure the callback data is still valid. Tsk. In more recent code, the AIO read callback actually supplies the buffer to the completion callback (which is only called if valid!) and its up to the callee to do the copy. tsk.

The problem is neither "API" forced the caller to ensure the buffer stays valid for the length of the call. I'm going to investigate doing this, but I'm not sure whether this is easier or more difficult than shoehorning in reference counted buffers at this point. I want to use refcounted buffers throughout the code but not as a crutch for a badly designed API.

Tuesday, March 31, 2009

Lusca snapshot released

I've just put up a snapshot of the version of lusca-head which is running on the cacheboy cdn. Head to .

lusca release - rev 13894

I've just put the latest Lusca-HEAD release up for download on the downloads page. This is the version which is currently running on the busiest Cacheboy CDN nodes (> 200mbit each) with plenty of resources to spare.

The major changes from Lusca-1.0 (and Squid-2 / Squid-3, before that):

  • The memory pools code has been gutted so it now acts as a statistics-keeping wrapper around malloc() rather than trying to cache memory allocations; this is in preparation for finding and fixing the worst memory users in the codebase!
  • The addition of reference counted buffers and some support framework has appeared!
  • The server-side code has been reorganised somewhat in preparation for copy-free data flow from the server to the store (src/http.c)
  • The asynchronous disk IO code has been extracted out from the AUFS codebase and turned into its own (mostly - one external variable left..) standalone library - it should be reusable by other parts of Lusca now
  • Some more performance work across the board
  • Code reorganisation and tidying up in preparation for further IPv6 integration (which was mostly completed in another branch, but I decided it moved along too quickly and caused some stability issues I wasn't willing to keep in Lusca for now..)
  • More code has been shuffled into separate libraries (especially libhttp/ - the HTTP code library) in preparation for some widescale performance changes.
  • Plenty more headerdoc-based code documentation!
  • Support for FreeBSD-current full transparent interception and Linux TPROXY-4 based full transparent interception
The next few weeks should be interesting. I'll post a TODO list once I'm back in Australia.

Documentation pass - disk IO

I've begun documenting the disk IO routines in Lusca-HEAD. I've started with the legacy disk routines in libiapp/disk.c - these are used by some of the early code (errorpages, icon loading, etc) and the UFS based swap log (ufs, aufs, diskd).

This code is relatively straight forward but I'm also documenting the API shortcomings so others (including myself!) will realise why it isn't quite good enough in its current form for processing fully asynchronous disk IO.

I'll make a pass through the async IO code on the flight back to Australia (so libasyncio/aiops.c and libasyncio/async_io.c) to document how the existing async IO code works and its shortcomings.

Monday, March 30, 2009

Mirroring a new project - Cyberduck!

I've just started providing mirror download services for Cyberduck - a file manager for a wide variety of platforms including the traditional (SFTP, FTP) and the new (Amazon/S3, WebDAV.) Cacheboy is listed as the primary download site on the main page.


Mozilla 3.0.8 release!

The CDN handled the load with oodles to spare. The aggregate client traffic peak was about 650mbit across 5 major boxes. The boxes themselves peaked at about 160mbit each, depending upon the time of day (ie, whether Europe or the US was active.) None of the nodes were anywhere near maximum CPU utilisation.

About 2 and a half TB of mozilla updates a day are being shuffled out.

I'd like to try pushing a couple of the nodes up to 600mbit -each- but I don't have enough CDN nodes to guarantee the bits will keep flowing if said node fails. I'll just have to be patient and wait for a few more sponsors to step up and provide some hardware and bandwidth to the project.

So far so good - the bits are flowing, I'm able to use this to benchmark Lusca development and fix performance bottlenecks before they become serious (in this environment, at least) and things are growing at about the right rate for me to not need to panic. :)

My next major goal will be to finish off the BGP library and lookup daemon; flesh out some BGP related redirection map logic; and start investigating reporting for "services" on the box. Hm, I may have to write some nagios plugins after all..

Sunday, March 29, 2009

Async IO changes to Lusca

I've just cut out all of the mempool'ed buffers from Lusca and converted said buffers to just use xmalloc()/xfree(). Since memory pools in Lusca now don't "cache" memory - ie, they're just for statistics keeping - the only extra thing said memory pooled buffers were doing was providing NUL'ed memory for disk buffers.

So now, for the "high hit rate, large object" workload which the mirror nodes are currently doing, the top CPU user is memcpy() - via aioCheckCallbacks(). At least it wasn't -also- memset() as well.

That memcpy() is taking ~ 17% of the total userland CPU used by Lusca in this particular workload.

I have this nagging feeling that said memcpy() is the one done in storeAufsReadDone(), where the AUFS code copies the result from the async read into the supplied buffer. It does this because its entirely possible the caller has disappeared between the time the storage manager read was scheduled and the time the filesystem read() was scheduled.

Because the Squid codebase doesn't explicitly cancel or wait for completion of async events - and instead relies on this "locking" and "invalidation" semantics provided by the callback data scheme - trying to pass buffers (and structs in general) into threads is pretty much plainly impossible to do correctly.

In any case, the performance should now be noticably better.

(obnote: I tried explaining this to the rest of the core Squid developers last year and somehow I don't quite think I convinced them that the current approach, with or without the AsyncCallback scheme in Squid-3, is going to work without significant re-engineering of the source tree. Alas..)

Saturday, March 28, 2009

shortcomings in the async io code

Profiling the busy(ish) Lusca nodes during the Mozilla 3.0.8 release cycle has shown significant CPU wastage in memset() (ie, 0'ing memory) - via the aioRead and aioCheckCallbacks code paths.

The problem stems from the disk IO interface inherited from Squid. With Squid, there's no explicit cancel-and-wait-for-cancel to occur with both the network and disk IO code, so the async disk IO read code would actually allocate its own read buffer, read into that, and then provide said read buffer to the completion callback to copy said read data out of. If the request is cancelled but the worker thread is currently read()'ing data, it'll read into its own buffer and not a potentially free()'d buffer from the owner. Its a bit inefficient but in the grand scheme of Squid CPU use, its not that big a waste on modern hardware.

In the short term, I'm going to re-jig the async IO code to not zero buffers that are involved in the aioRead() path. In the longer term, I'm not sure. I prefer cancels which may fail - ie, if an operation is in progress, let it complete, if not then return immediately. I'd like this for the network code too, so I can use async network IO threads for less copy network IO (eg FreeBSD and aio_read() / aio_write()); but there's significant amounts of existing code which assumes things can be cancelled immediately and assumes temporary copies of data are made everywhere. Sigh.

Anyway - grr'ing aside, fixing the pointless zero'ing of buffers should drop the CPU use for large file operations reasonably noticably - by at least 10% to 15%. I'm sure that'll be a benefit to someone.

New Lusca-head features to date

I've been adding in a few new features to Lusca.

Firstly is the config option "n_aiops_threads" which allows configuration / runtime tuning of the number of IO threads. I got fed up recompiling Lusca every time I wanted to fiddle with the number of threads, so I made it configurable.

Next is a "client_socksize" - which overrides the compiled and system default TCP socket buffer sizes for client-side sockets. This allows the admin to run Lusca with restricted client side socket buffers whilst leaving the server side socket buffers (and the default system buffer sizes) large. I'm using this on my Cacheboy CDN nodes to help scale load by having large socket buffers to grab files from the upstream servers, but smaller buffers to not waste memory on a few fast clients.

The async IO code is now in a separate library rather than in the AUFS disk storage module. This change is part of a general strategy to overhaul the disk handling code and introduce performance improvements to storage and logfile writing. I also hope to include asynchronous logfile rotation. The change breaks the "auto tuning" done via various hacks in the AUFS and COSS storage modules. Just set "n_aiops_threads" to a sensible amount (say, 8 * the number of storage directories you have, up to about 128 or so threads in total) and rely on that instead of the auto-tuning. I found the auto-tuning didn't quite work as intended anyway..

Finally, I've started exporting some of the statistics from the "info" cachemgr page in a more easier computer-parsable format. The cachemgr page "mgr:curcounters" page includes current client/server counts, current hit rates and disk/memory storage size. I'll be using these counters as part of my Cacheboy CDN statistics code.

Googletalk: "Getting C++ threads to work right"

I've been watching a few Google dev talks on Youtube. I thought I'd write up a summary of this one:

In summary:
  • Writing "correct" thread code using the pthreads and CPU instructions (fencing, for example) requires the code to know whats going on under the hood;
  • Gluing concurrency to the "side" of a language which was specified without concurrency has shown to be a bit of a problem - eg, concurrent access to different variables in a structure and how various compilers have implemented this (eg, changing a byte in a struct becoming a 32 bit load, 8 bit modify, 32 bit store);
  • Most programmers should really use higher level constructs, like what C++0x and what the Java specification groups have been doing.
If you write threaded code or you're curious about it, you should watch this talk. It provides a very good overview of the problems and should open your mind up a little to what may go wrong..

Friday, March 27, 2009

Another open cdn project - mirrorbrain

I've been made aware of Mirrorbrain (, another project working towards an open CDN framework. Mirrorbrain uses Apache as the web server and some apache module smarts to redirect users between mirrors.

I like it - I'm going to read through their released source and papers to see what clue can be crimed from them - but they still base the CDN on an untrusted, third-party mirror network out of their control. I still think the path forward to an "open CDN" involves complete control right out to the mirror nodes and, in some places, the network which the mirror nodes live on.

There's a couple of shortcomings - most notably, their ASN implementation currently uses snapshots of the BGP network topology table rather than a live BGP feed distributed out to each mirror and DNS node. They also store central indexes of files and attempt to maintain maps of which mirror nodes have which updated versions of files, rather than building on top of perfectly good HTTP/1.1 caching semantics. I wonder why..

Monday, March 23, 2009

Example CDN stats!

Here's a snapshot of the global aggregate traffic level:

.. and top 10 AS stats from last Sunday (UTC) :

Sunday, March 22, 2009

I've setup, a simple mediawiki install which will serve as a place for me to braindump stuff into.

Thursday, March 19, 2009

More "Content Delivery" done open

Another network-savvy guy in Europe is doing something content-delivery related: .

AS250 is building a BGP anycast based platform for various 'open' content delivery and other applications. I plan on doing something similar (or maybe just partner with him, I'm not sure!) but anycast is only part of my over-all solution space.

He's put up some slides from a presentation he did earlier in the year:

Filesystem Specifications, or EXT4 "Losing Data"

This is a bit off-topic for this blog, but the particular issue at hand bugs the heck out of me.

EXT4 "meets" the POSIX specifications for filesystems. The specification does not make any requirements for data to be written out in any order - and for very good reason. If the application developer -requires- data to be written out in order, they should serialise their operations through use of fsync(). If they do -not- require it, then the operating system should be free to optimise away the physical IO operations.

As a clueful(!) application developer, -I- appreciate being given the opportunity to provide this kind of feedback to the operating system. I don't want one or the other. I'd like to be able to use both where and when I choose.

Application developers - stop being stupid. Fix your applications. Read and understand the specification and what it provides -everyone- rather than just you.

Monday, March 16, 2009

Breaking 200mbit..

The CDN broke 200mbit at peak today - roughly half mozilla and half videolan.

200mbit is still tiny in the grand scheme of things, but it proves that things are working fine.

The next goal is to handle 500mbit average traffic during the the day, and keep a very close eye on the overheads in doing so (specifically - making sure that things don't blow up when the number of concurrent clients grows.)

GeoIP backend, or "reinventing the wheel"

The first incantation of the Cacheboy CDN uses 100% GeoIP to redirect users. This is roughly how it goes:

  1. Take a GeoIP map to break up IPs into "country" regions (thanks!) ;
  2. Take the list of "up" CDN nodes;
  3. For each country in my redirection table, find the CDN node that is up with the highest weight;
  4. Generate a "geo-map" file consisting of the highest-weight "up" CDN node for each country in "3";
  5. Feed that to the PowerDNS geoip module (thanks Mark @ Wikipedia!)
This really is a good place to start - its simple, its tested and it provides me with some basic abilities for distributing traffic across multiple sites to both speed up transfer times to end-users and better use the bandwidth available. The trouble is that it knows very little about the current state of the "internet" at any point in time. But, as I said, as a first (coarse!) step to get the CDN delivering bits, it worked out.

My next step is to build a much easier "hackable" backend which I can start adding functionality to. I've reimplemented the geoip backend in Perl and glued it to the "pipe-backend" module in PowerDNS. This simply passes DNS requests to an external process which spits back DNS replies. The trouble is that multiple backend processes will be invoked regardless of whether you want to or not. This means that I can't simply load in large databases into the backend process as it'll take time to load, waste RAM, and generally make things scale (less) well.

So I broke out the first memory hungry bit - the "geoip" lookup - and stuffed it into a small C daemon. All the daemon does is take a client IP and answer the geoip information for that IP. It will periodically check and reload the GeoIP database file in the background if its changed - maintaining whatever request rate I'm throwing at it rather than pausing for a few seconds whilst things are loaded in.

I can then use the "geoip daemon" (lets call it "geoipd") by the PowerDNS pipe-backend process I'm writing. All this process has to do at the moment is load in the geo maps (which are small) and reload them as required. It sends all geoip requests to the geoipd and uses the reply. If there is a problem talking to the geoipd, the backend process will simply use a weighted round robin of well-connected servers as a last resort.

The aim is to build a flexible backend framework for processing redirection requests which can be used by a variety of applications. For example, when its time for the CDN proxy nodes to also do 302 redirections to "closer" nodes, I can simply reuse a large part of the modular libraries written. When I integrate BGP information into the DNS infrastructure, I can reuse all of those libraries in the CDN proxy redirection logic, or the webserver URL rewriting logic, or anywhere else where its needed.

The next step? Figuring out how to load balance traffic destined to the same AS / GeoIP region across multiple CDN end nodes. This should let me scale the CDN up to a gigabit of aggregate traffic given the kind of sponsored boxes I'm currently receiving. More to come..