Thursday, February 26, 2009

'AWS Public Data sets' has full Wikipedia available in TSV format

'Amazon Web Services Blog' reports that the AWS public data sets has the Wikipedia Extraction (WEX), which is a processed, machine-readable dump of the English-language section of the Wikipedia. At nearly 67 GB, this is a handly and formidable data set. The data is provided is the TSV format as exported by PostgreSQL.

There are a number of other data sets also available, read more here.

They also describe how easily you an use these data sets:
Instantiating these data sets is basically trivial. You create a new EBS volume of the appropriate size, basing it on the snapshot id of the data. Next, you attach the volume to a running EC2 instance in the same availability zone. Finally, you create a mount point and mount the EBS volume on the instance.

Awesome.

0 Comments:

Post a Comment

<< Home