Hadoop: Text Input Format

Simple text-based files are common in the non-Hadoop world, and they’re super common in the Hadoop world too. Data is laid out in lines, with each line being a record. Lines are terminated by a newline character \n in the typical UNIX fashion. Text-files are inherently splittable (just split on \n characters!), but if you want to compress them you’ll have to use a file-level compression codec that support splitting, such as BZIP2 Because these files are just text files you can encode anything you like in a line of the file. One common example is to make each line a JSON document to add some structure. While this can waste space with needless column headers, it is a simple way to start using structured data in HDFS.

Slow to read and write
Can’t split compressed files (Leads to Huge maps)
Need to read/decompress all fields.

An Input format for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text. Advantages: Light weight Disadvantages: Slow to read and write, Can’t split compressed files (Leads to Huge maps)

Hadoop

Text Input Format

No comments:

Post a Comment