Making Noise
A post I made a few weeks ago – Flacrobat, Please, No – got linked a couple places. That would be great, if my post wasn’t wrong. The points were fine – lots of reasons why Acrobat Reader and Flash Player shouldn’t be combined. But, Reader and Player aren’t going to be combined. It was like writing a post about why the US shouldn’t invade Canada. Basically, I jumped the gun. I should have waited before posting – novice blogging lesson learned. Fortunately there’s a silver lining. My post is a good example of noise, which is an interesting data viz topic.
Making noise
OK – I was wrong. Big deal. However, in addition to being wrong, I created noise - unwanted, distracting information. Some people read the post and thought about Flash and Acrobat Reader being combined. That was a waste of their time. Additionally, one of the goals of data visualization (what this blog is about) is eliminating noise. By aggregating data, you see trends and can ignore the bad data - chop off the ends of the bell curve or ignore the data point that sits well outside the norm. That kind of filtering allows you to focus on the useful data and ignore the noise - e.g., learn about Flash, rather that worry about Acrobat.
But, how can data visualization help you ignore incorrect blog posts? Well, it can’t…yet. The problem is not the visualization - the problem is the data. It’s not structured in a way that can be visualized. You need attributes to tag the post as good or bad. And, you want to compare differing opinions. Blog aggregators are useful by adding ratings and categories to posts. Some even add “smart categories” which group like posts by searching their content. But, there are limitations - if a blog isn’t aggregated, its content won’t show up; and new categories won’t show up until the aggregator is set to look for them. What you need is a more universal and automatic way to compare data. That’s the goal of semantic web. Many people are writing great things about semantic web. But, here’s the quick summary - semantic web is an attempt to add computer-readable content to internet information. It adds structure to what is primarily text. And that structure can help identify data as noise. The idea of semantic web has been around since 2001. But, getting people to adopt and create the standards seems to be an ongoing effort with limited success. It’s a chicken and the egg problem. I.e., there aren’t many aggregators using semantic web data because there’s not a lot of data out there. And, why add semantic tags since there aren’t many aggregators. But, it is starting. Microformats are being used to identify data. And certain sites (technorati) are supporting them. Another (rival) format is also emerging - structured blogging. Two plug-ins for structured blogging were recently released. There are lots of differing opinions on whether structured blogging or microformats is the right direction. It seems that microformats are more flexible and, interestingly combine both the presentation (what a user sees) and data (what the computer sees) in one tag. But, structured blogging (now that they have the plug-ins) seems much easier for non-developers to add semantic tags. And that’s a big deal. (microformats vs. structured blogging)
Anyway, both ideas are new and interesting. But, this post is about aggregating blog opinions to reduce blog noise. It seems that structured blogging has the advantage in this area. Can it eliminate noisy blog posts? Well, not quite. Although it should be able to eliminate certain types of posts soon. The problem is the tools to add blog structure are still being developed and are currently focused on easily comparable or quantifiable subjects, e.g., movie, restaurant, and website reviews (comparable by how many stars), events (comparable by time, location). Here’s a sample post. Therefore, more amorphous posts (like this one) aren’t being structured yet. But that will be figured out. And, most of the usefulness of structured blogs will be in comparing more quantifiable information anyway. E.g., a movie post that disagrees with all the others is noise and can be ignored. A product listed for sale on a blog at a ridiculously high price will also be ignored. Because a structured blog post will conform to semantic web specs (the structured blog plug-in adds machine-readable xml tags in RDF/OWL format to your post’s page source), structured posts will (in theory) instantly show up in search engines, aggregators, and web apps (similar to RSS notifications). E.g., You’ll be able to list an item on eBay and every other on-line auction house just by making a single blog post (again, in theory). It’s cool stuff. And even if eBay listings take a while to get going, it’s going to result in a lot more structured information – which is information that can be visualized.
While semantic web is kind of the holy grail off on-line data noise elimination. There are other areas where noise is being reducted. Here are two - one old (spam blocking) and one new (folksonomy).
Spam is the most obvious and prevalent example of internet noise. Fortunately, spam blocking is getting more and more effective (I get less than I used to). And, while spam blocking may not be RDF/OWL compliant, the result of running a spam filter on content is actually very similar to structured blogging – it adds computer readable content (“I am spam†tag) to a piece of data. Spam is also going to be very important in the context of semantic web. Plenty of bogus sites will be using semantic tools to misrepresent themselves.
Another example of progress is folksonomy. Usually people think of folksonomies and wikis (community monitered data) as contributers to noise. Many have written about wikipedia’s erroneous information and whether community moderated content is as good as expert moderated (check related links at bottom). Bad information in these contexts is noise. But, folksonomies can also reduce noise…
The first time I heard noise used to describe internet information was at a talk by Josh Porter on Web2.0. (Internet noise is usually used to describe unwanted traffic on web sites – e.g., port scans, worms, etc.) Josh was talking about how searching on delicious is great because the noise level is so low. His point was that because delicious users are technology users, the technology links they save are useful links. Search for “Flex†and you’ll get the links that technology users think are the best – and are therefore, usually the best links. If you make the same search on one of the big search engines (obviously, delicious’s search usefulness is skewed toward technology), you probably won’t get results that are consistently as good - because along with useful results, you’ll also get results seeded higher than they should be because someone knows how to optimize for search engines. E.g., meta-tagged their poker site with “flash development.” (For even lower noise levels on delicious, browse the bookmarks of someone who saves many of the same bookmarks as you.)
This post is about eliminating data noise. But, the trends that are eliminating noise are doing more than that. Adding structure to data makes it more searchable, more organized, and more easily visualized. And that structuring is one of three big trends that are changing how we get information on the web… the data is becoming more useful (structured blogging, delicious, wikipedia vs. Encyclopedia Brittanica), the data is more public (apis are sprouting up everywhere), and the tools (Ajax, Flash) for visualizing the data are becoming more widely used and understood. Finding accurate information quickly and easily is what the internet is all about. These trends are making it happen.
So pretty soon when I make misleading posts, they can be recognized as noise and ignored. Then I can learn about Flex, instead of writing really long posts in response.
And here are a couple interesting posts on structured blogging: PubSub - one of the sponsors of the structured blogging plug-in - talking about many of structured blogging’s benefits, Josh Porter on learning about structured blogging.
UPDATE APRIL 5, 2006 - for some reason this post is getting spammed - so I’m turning comment off. Not that anyone has commented. In fact, this post is way too long, so I doubt anyone’s even read it. If you have, well, you’re awesome. Unless you’re a spammer, in which case, you suck and I won’t buy your pharmaceuticals.
Comments
One Response to “Making Noise”
I made a couple changes to this post on Dec 29. I added a reference to microformats and removed a reference to tagging podcasts, which is very cool, but, not semantic web.
Also here’s another article on semantic web - this one more of a technical primer. I’m a bit over my head with the semantic web stuff and should probably get back to writing about visualization. But, when semantic finally takes off there will be such an enormous amount of strucuted data that a lot is going to change - business models will change, the need for visualization will increase (see? this post is on topic), there will be a big increase in connectedness - (which for me is the coolest thing about Google Earh - there’s maps, and webcams, and overlays and they all work together). Ten years ago, when I started doing visualization, the biggest problem was there was no data. The amount of data has changed dramatically in the last 10 years. And semantic web will increase it a lot more.