Introducción a hadoop

Introducción a Hadoop
El bazuca de los datos

Iván de Prado Alonso // @ivanprado // @datasalt

Datasalt

Foco en el Big Data
– Contribución al Open Source
– Consultoría & Desarrollo
– Formación

2 / 34

BIG
“MAC”
DATA

3 / 34

Fisonomía de un proyecto Big Data

Adquisición

Procesamiento

Servicio

4 / 34

Tipos de sistemas Big Data

● Offline
– La latencia no es un problema
● Online
– La inmediatez de los datos es importante
● Mixto
– Lo más común

Offline Online
MapReduce Bases de datos NoSQL
Hadoop Motores de búsqueda
Distributed RDBMS

5 / 34

“Swiss army knife of the
21st century”
Media Guardian Innovation
Awards

http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop 6 / 34

Historia

● 2004-2006
– Google publica los papers de GFS y MapReduce
– Doug Cutting implementa una versión Open Source en
Nutch
● 2006-2008
– Hadoop se separa de Nutch
– Se alcanza la escala web en 2008
● 2008-Hasta ahora
– Hadoop se populariza y se comienza a explotar
comercialmente.

Fuente: Hadoop: a brief history. Doug Cutting

7 / 34

Hadoop

“The Apache Hadoop
software library is a
framework that allows for
the distributed
processing of large data
sets across clusters of
computers using a
simple programming
model”
De la página de Hadoop

8 / 34

Sistema de Ficheros Distribuido

● Sistema de ficheros distribuido (HDFS)
– Bloques grandes: 64 Mb
● Almacenados en el sistema de ficheros del SO
– Tolerante a Fallos (replicación)
– Formatos habituales:
● Ficheros en formato texto (CSV)
● SequenceFiles
– Ristras de pares [clave, valor]

9 / 34

MapReduce

● Dos funciones (Map y Reduce)
– Map(k, v) : [z,w]*
– Reduce(k, v*) : [z, w]*
● Ejemplo: contar palabras
– Map([documento, null]) -> [palabra, 1]*
– Reduce(palabra, 1*) -> [palabra, total]
● MapReduce y SQL
– SELECT palabra, count(*) GROUP BY palabra
● Ejecución distribuida en un cluster con escalabilidad
horizontal

10 / 34

El típico Word Count
Esto es una linea
Esto también

Map Reduce
reduce(es, {1}) =
map(“Esto es una linea”) =
es, 1
esto, 1
reduce(esto, {1, 1}) =
es, 1
esto, 2
una, 1
reduce(linea, {1}) =
linea, 1
linea, 1
map(“Esto también”) =
reduce(también, {1}) =
esto, 1
también, 1
también, 1
reduce(una, {1}) =
una, 1

es, 1
esto, 2
Resultado: linea, 1
también, 1
una, 1

11 / 34

Word Count en Hadoop
public class WordCountHadoop extends Configured implements Tool {

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while(itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
int sum = 0;
for(IntWritable val : values) {
sum += val.get();
}

¡Mejor vamos por partes!
result.set(sum);
context.write(key, result);
}
}

@Override
public int run(String[] args) throws Exception {

if(args.length != 2) {
System.err.println("Usage: wordcount-hadoop <in> <out>");
System.exit(2);
}

Path output = new Path(args[1]);
HadoopUtils.deleteIfExists(FileSystem.get(output.toUri(), conf), output);

Job job = new Job(getConf(), "word count hadoop");
job.setJarByClass(WordCountHadoop.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);

return 0;
}

public static void main(String[] args) throws Exception {
ToolRunner.run(new SortJobHadoop(), args);
}
}

12 / 34

Word Count en Hadoop - Mapper

public static class TokenizerMapper extends Mapper<Object, Text,
Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context) throws
IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());
while(itr.hasMoreTokens()) {

word.set(itr.nextToken());
context.write(word, one);
}
}
}

13 / 34

Word Count en Hadoop - Reducer

public static class IntSumReducer extends Reducer<Text, IntWritable,
Text, IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException,
InterruptedException {
int sum = 0;
for(IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

14 / 34

Word Count en Hadoop – Configuración y
ejecución

if(args.length != 2) {
System.err.println("Usage: wordcount-hadoop <in> <out>");
System.exit(2);
}

Path output = new Path(args[1]);
HadoopUtils.deleteIfExists(FileSystem.get(output.toUri(), conf),
output);

Job job = new Job(getConf(), "word count hadoop");
job.setJarByClass(WordCountHadoop.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);

15 / 34

Ejecución de un Job MapReduce
Bloques del fichero de entrada

Nodo 1

Nodo 2
Mappers

Datos
Intermedios
Nodo 1

Nodo 2
Reducers

Resultado

16 / 34

Serialización

● Writables
• Serialización nativa de Hadoop
• De muy bajo nivel
• Tipos básicos: IntWritable, Text, etc.
● Otras
• Thrift, Avro, Protostuff
• Compatibilidad hacia atrás.

17 / 34

La curva de
aprendizaje
de Hadoop
es alta

18 / 34

Tuple MapReduce

● Un MapReduce más simple
– Tuplas en lugar de key/value

– A nivel de job se define
● Los campos por los que agrupar
● Los campos por los que ordenar
– Tuple MapReduce-join

19 / 34

Pangool
● Implementación de
TupleMap reduce
– Desarrollado por Datasalt
– OpenSource
– Eficiencia equiparable a
Hadoop
● Objetivo: reemplazar la API
de Hadoop
● Si quieres aprender
Hadoop, empieza por
Pangool

20 / 34

Eficiencia de Pangool
● Equiparable a Hadoop

Ver http://pangool.net/benchmark.html

21 / 34

Pangool – URL resolution

● Ejemplo de Join
– Muy difícil en Hadoop. Fácil en Pangool.
● Problema:
– Existen muchos acortadores de URLs y redirecciones
– Para analizar datos, suele ser útil reemplazar las URLs por su URL
canónica
– Supongamos que tenemos ambos datasets
● Un mapa con entradas URL → URL canónica
● Un dataset con URLs (que queremos resolver) y otros campos.
– El siguiente job Pangool soluciona el problema de manera escalable.

22 / 34

URL Resolution – Definiendo Schemas

static Schema getURLRegisterSchema() {
List<Field> urlRegisterFields = new ArrayList<Field>();
urlRegisterFields.add(Field.create("url",Type.STRING));
urlRegisterFields.add(Field.create("timestamp",Type.LONG));
urlRegisterFields.add(Field.create("ip",Type.STRING));
return new Schema("urlRegister", urlRegisterFields);
}

static Schema getURLMapSchema() {
List<Field> urlMapFields = new ArrayList<Field>();
urlMapFields.add(Field.create("url",Type.STRING));
urlMapFields.add(Field.create("canonicalUrl",Type.STRING));
return new Schema("urlMap", urlMapFields);
}

23 / 34

URL Resolution – Cargando el fichero a
resolver

public static class UrlProcessor extends TupleMapper<LongWritable,
Text> {

private Tuple tuple = new Tuple(getURLRegisterSchema());

@Override
public void map(LongWritable key, Text value, TupleMRContext
context, Collector collector)
throws IOException, InterruptedException {

String[] fields = value.toString().split("t");
tuple.set("url", fields[0]);
tuple.set("timestamp", Long.parseLong(fields[1]));
tuple.set("ip", fields[2]);
collector.write(tuple);
}
}

24 / 34

URL Resolution – Cargando el mapa de URLs

public static class UrlMapProcessor extends TupleMapper<LongWritable,
Text> {

private Tuple tuple = new Tuple(getURLMapSchema());

@Override
public void map(LongWritable key, Text value, TupleMRContext
context, Collector collector)
throws IOException, InterruptedException {

String[] fields = value.toString().split("t");
tuple.set("url", fields[0]);
tuple.set("canonicalUrl", fields[1]);
collector.write(tuple);
}
}

25 / 34

URL Resolution – Resolución en el reducer
public static class Handler extends TupleReducer<Text, NullWritable>
{

private Text result;

@Override
public void reduce(ITuple group, Iterable<ITuple> tuples,
TupleMRContext context, Collector collector)
throws IOException, InterruptedException, TupleMRException {
if (result == null) {
result = new Text();
}
String cannonicalUrl = null;
for(ITuple tuple : tuples) {
if("urlMap".equals(tuple.getSchema().getName())) {
cannonicalUrl = tuple.get("canonicalUrl").toString();
} else {
result.set(cannonicalUrl + "t" +
tuple.get("timestamp") + "t" + tuple.get("ip"));
collector.write(result, NullWritable.get());
}
}
}
}

26 / 34

URL Resolution – Configurando y Lanzando
el job
String input1 = args[0];
String input2 = args[1];
String output = args[2];

deleteOutput(output);

TupleMRBuilder mr = new TupleMRBuilder(conf,"Pangool Url Resolution");
mr.addIntermediateSchema(getURLMapSchema());
mr.addIntermediateSchema(getURLRegisterSchema());
mr.setGroupByFields("url");
mr.setOrderBy(
new OrderBy().add("url", Order.ASC).addSchemaOrder(Order.ASC));
mr.setTupleReducer(new Handler());
mr.setOutput(new Path(output),
new HadoopOutputFormat(TextOutputFormat.class),
Text.class,
NullWritable.class);
mr.addInput(new Path(input1),
new HadoopInputFormat(TextInputFormat.class),
new UrlMapProcessor());
mr.addInput(new Path(input2),
new HadoopInputFormat(TextInputFormat.class),
new UrlProcessor());
mr.createJob().waitForCompletion(true);

27 / 34

Introducción a hadoop

Related slideshows

More Related Content

Introducción a hadoop