Tag Archives: data

Drowning in data

Maybe we’ll have flooded our culture-lungs with angry YouTube comments and pharmaceutical spamblogs before the rising sea-levels get a chance to touch our toes… [via MetaFilter]

According to one estimate, mankind created 150 exabytes (billion gigabytes) of data in 2005. This year, it will create 1,200 exabytes. Merely keeping up with this flood, and storing the bits that might be useful, is difficult enough. Analysing it, to spot patterns and extract useful information, is harder still.

Actually, I don’t see this deluge of data as a bad thing, but I’m very interested in how we’re going to store, manage and curate it.

Here today, gone tomorrow: why the next decade’s web won’t feel familiar

mosaic of Web2.0 logosPeople seem to be waking up to the impermanence of the web of late. TechDirt points us to a mainstream journalism article at the Globe & Mail, which springboards from the imminent nuking of GeoCities to worrying what will happen to all of your pictures uploaded to Facebook when it eventually (and inevitably) goes the same way. [image by jonas_therkildson]

Lately, there’s been so much discussion about the permanence of information – especially the embarrassing kind – that we have overlooked the fact that it can also disappear. At a time when we’re throwing all kinds of data and memories onto free websites, it’s a blunt reminder that the future can bring unwelcome surprises.

Ten years ago, you could have called GeoCities the garish, beating heart of the Web. It was one of the first sites that threw its doors open to users and invited them to populate its pages according to their own creativity. At a time when the Web was still daunting, it encouraged laypeople to set up their own homepages free of charge.

Kinda like the forerunner of MySpace, then, albeit (somewhat ironically) easier on the eyes and ears… and MySpace’s days are certainly (and mercifully) numbered, if the traffic figures are to be believed. But I digress…

And now, it’s curtains. GeoCities won’t disappear entirely. The Internet Archive – a non-profit foundation based in San Francisco dedicated to backing up the Web for posterity’s sake – is trying to salvage as much as it can before the deadline hits. At least one other independent group is trying to do the same. But this complicates things, because it puts GeoCities users’ data into the hands of an unaccountable third party.

Money-losing websites aren’t exactly novelties. Smaller sites flicker in and out of existence like those bugs that only have 18 hours to mate before they die. But it’s disconcerting to see a big site – one that, long ago, was one of the most popular on the Web – not just fade into obscurity, but come to its end game.

It bring to light some truths about data that are easily overlooked. Websites are like buildings: you can’t just abandon them indefinitely and expect them to keep working. For one thing, that electronic storage isn’t free. Storing files requires media that degrade and computers that fail and power that needs paying for.

The obvious answer here is to make sure you have local backups of anything stored “in the cloud” that you couldn’t bear to lose… but it’s only obvious to those with some degree of computer savvy, and (based on personal experience) everyone else is insufficiently bothered to worry about it ahead of time, no matter how patiently you try to explain the situation. If nothing else, there’ll always be good money for people who can write custom API scraping tools for defunct social networks… that business model will be the new equivalent to the photography studios places who now make their income by scanning and retouching old snapshots from the pre-digital era.

But other changes in the way we use the web are very much afoot, as pointed out by Clive Thompson at Wired. For the last decade, classic search has been the dominant internet tool, propelling Google to the top of the pyramid. But this is the age of Twitter, the temporal gateway into the “real-time web”; maybe the old surfing metaphor will finally make more sense when we’re all riding the Zeitgeist of trending topics:

For more than 10 years, Google has organized the Web by figuring out who has authority. The company measures which sites have the most links pointing to them—crucial votes of confidence—and checks to see whether a site grew to prominence slowly and organically, which tends to be a marker of quality. If a site amasses a zillion links overnight, it’s almost certainly spam.

But the real-time Web behaves in the opposite fashion. It’s all about “trending topics”—zOMG a plane crash!—which by their very nature generate a massive number of links and postings within minutes. And a search engine can’t spend days deciding what is the most crucial site or posting; people want to know immediately.


“It’s exactly what your friends are going to be talking about when you get to the bar tonight,” OneRiot executive Tobias Peggs says. “That’s what we’re finding.” Google settles arguments; real-time search starts them.

Well, at least we’re not going to be short of things to argue about. If that ever happened, the web would probably close down due to lack of interest… 😉

Data pr0n: the demographics of employment and leisure

Just a quick one: even if you’re not particularly interested in demographic research into how different segments of the population of the United States spend their time each day, the interactive graphical data thingy that the New York Times have produced to illustrate it is pretty sweet, and good for killing ten minutes of idle time… not to mention allowing you to reflect that the idle time in question is theoretically represented in the data you’re observing; how delightfully post-modern! [via MetaFilter]

What other data sets would benefit from this sort of presentation?

Digital Rosetta stone

stone_writingJapanese researchers are developing a means of storing data for periods of thousands of years, to help solve the problem of an imminent digital dark age:

The team, led by Professor Tadahiro Kuroda of Tokyo’s Keio University, has proposed storing data on semiconductor memory-chips made of what he describes as the most stable material on the Earth – silicon.

Tightly sealed, powered and read wirelessly, such a device, he claims, would yield its digital secrets even after 1000 years, making any stored information as resilient as it were set in stone itself.

It’s a realisation that moved the researchers to name the disc-like, 15in (38cm) wide device the “Digital Rosetta Stone” after the revolutionary 2,200-year-old Egyptian original unearthed by Napoleon’s army.

This is a very similar concept to the Long Now Foundation’s Rosetta Disk, which is intended to be a very-long-term record of contemporary languages.

It is encouraging to know this problem is being studied and so many groups are looking for solutions.

[from the BBC][image from bwhistler on flickr]

The half-life of data: bug or feature?

privacyThe great paradox of electronic data must surely be that while the stuff we want to keep is considered frangible and at risk (think of the old programmer’s adage – “if your data doesn’t exist in three separate locations, it might as well not exist at all), the stuff we’d rather have disappear (that inebriated email to your ex-partner or lawyer, or that Facebook picture of you smoking crack on the steps of the town hall) has a tendency of hanging around out in “the cloud” long enough to embarass or incriminate. [image by rpongsaj]

The answer to the first problem is obviously to take multiple geographically-separated back-ups (and make a yearly sacrifice to Cthulhu for peace of mind); the latter is a bit more tricky, but a team at the University of Washington think they may have cracked it with a system named Vanish, which is designed to “give users control over the lifetime of personal data stored on the web or in the cloud. Specifically, all copies of Vanish encrypted data — even archived or cached copies — will become permanently unreadable at a specific time, without any action on the part of the user or any third party or centralized service.”

Sounds intriguing – so how does it work?

We created self-destructing data to try to address this problem. Our prototype system, called Vanish, shares some properties with existing encryption systems like PGP, but there are also some major differences. First, someone using Vanish to “encrypt/encapsulate” information, like an email, never learns the encryption key. Second, there is a pre-specified timeout associated with each encrypted/encapsulated messages. Prior to the timeout, anyone can read the encrypted/encapsulated message. After the timeout, no one can read that message, because the encryption key is lost due to a set of both natural and programmed processes. It is therefore impossible for anyone to decrypt/decapsulate that email after the timer expires.


we leverage an unusual storage media in a novel way: namely, global-scale peer-to-peer networks. Vanish creates a secret key to encrypt a user’s data item (such as an email), breaks the key into many pieces and then sprinkles the pieces across the P2P network. As machines constantly join and leave the P2P network, the pieces of the key gradually disappear. By the time the hacker or someone with a subpoena actually tries to obtain access to the message, the pieces of the key will have permanently disappeared.

It’s a clever idea, that’s for certain, and its application to sensitive emails makes a great deal of sense (though I’d want the low-down from Bruce Schneier before deploying it on anything that mattered). As far as Facebook messages are concerned, though, anyone stupid enough to post incriminating material about themselves or others on the biggest social network on the planet can be assumed to lack the gumption to avail themselves of encryption technologies like Vanish. Maybe it’s just because I spend a lot more time on the internet than is really healthy, but I can’t understand how it isn’t more widely acknowledged that the best way to keep something secret is to avoid talking about it in public spaces… [via BoingBoing]