Apache Hadoop: Manipulation and Transformation of Data Performance

Course Code

ApHadm1

Duration

21 hours (usually 3 days including breaks)

Requirements

Attendees are not required to have any specific skill as the training is focused on end users skills for both the administration and the manipulation of data under Apache Hadoop

Overview


This course is intended for developers, architects, data scientists or any profile that requires access to data either intensively or on a regular basis.

The major focus of the course is data manipulation and transformation.

Among the tools in the Hadoop ecosystem this course includes the use of Pig and Hive both of which are heavily used for data transformation and manipulation.

This training also addresses performance metrics and performance optimisation.

The course is entirely hands on and is punctuated by presentations of the theoretical aspects.

Course Outline

1.1Hadoop Concepts

1.1.1HDFS

  • The Design of HDFS
  • Command line interface
  • Hadoop File System

1.1.2Clusters

  • Anatomy of a cluster
  • Mater Node / Slave node
  • Name Node / Data Node

1.2Data Manipulation

1.2.1MapReduce detailed

  • Map phase
  • Reduce phase
  • Shuffle

1.2.2Analytics with Map Reduce

  • Group-By with MapReduce
  • Frequency distributions and sorting with MapReduce
  • Plotting results (GNU Plot)
  • Histograms with MapReduce
  • Scatter plots with MapReduce
  • Parsing complex datasets
  • Counting with MapReduce and Combiners
  • Build reports

 

1.2.3Data Cleansing

  • Document Cleaning
  • Fuzzy string search
  • Record linkage / data deduplication
  • Transform and sort event dates
  • Validate source reliability
  • Trim Outliers

1.2.4Extracting and Transforming Data

  • Transforming logs
  • Using Apache Pig to filter
  • Using Apache Pig to sort
  • Using Apache Pig to sessionize

1.2.5Advanced Joins

  • Joining data in the Mapper using MapReduce
  • Joining data using Apache Pig replicated join
  • Joining sorted data using Apache Pig merge join
  • Joining skewed data using Apache Pig skewed join
  • Using a map-side join in Apache Hive
  • Using optimized full outer joins in Apache Hive
  • Joining data using an external key value store

1.3Performance Diagnosis and Optimization Techniques

  • Map
    • Investigating spikes in input data
    • Identifying map-side data skew problems
    • Map task throughput
    • Small files
    • Unsplittable files
  • Reduce
    • Too few or too many reducers
    • Reduce-side data skew problems
    • Reduce tasks throughput
    • Slow shuffle and sort
  • Competing jobs and scheduler throttling
  • Stack dumps & unoptimized code
  • Hardware failures
  • CPU contention
  • Tasks
    • Extracting and visualizing task execution times
    • Profiling your map and reduce tasks
  • Avoid the reducer
  • Filter and project
  • Using the combiner
  • Fast sorting with comparators
  • Collecting skewed data
  • Reduce skew mitigation

Bookings, Prices and Enquiries

Guaranteed to run even with a single delegate!

Private Classroom

From £3750

Private Remote

From £3300 (97)

Public Classroom

Cannot find a suitable date? Choose Your Course Date >>Too expensive? Suggest your price

Course Discounts

Course Venue Course Date Course Price [Remote / Classroom]
Javascript And Ajax St Helier, Jersey, Channel Isles Mon, 2018-07-02 09:30 £4950 / £7325
PostgreSQL for Administrators Swansea- Princess House Mon, 2018-07-02 09:30 £2178 / £2478
OCUP2 UML 2.5 Certification - Advanced Exam Preparation St Helier, Jersey, Channel Isles Mon, 2018-07-23 09:30 £1980 / £2930
Introduction to R Glasgow Wed, 2018-08-01 09:30 £3861 / £4911
Subversion for Users Newcastle Fri, 2018-08-03 09:30 £1089 / £1289
OCUP2 UML 2.5 Certification - Intermediate Exam Preparation St Helier, Jersey, Channel Isles Tue, 2018-08-07 09:30 £2340 / £3290
jQuery Swansea- Princess House Wed, 2018-08-15 09:30 £1980 / £2280
AWS: A Hands-on Introduction to Cloud Computing Edinburgh Training and Conference Venue Tue, 2018-09-11 09:30 £1287 / £1487
Test Automation with Selenium St Helier, Jersey, Channel Isles Tue, 2018-09-18 09:30 £2970 / £4395

Course Discounts Newsletter

We respect the privacy of your email address. We will not pass on or sell your address to others.
You can always change your preferences or unsubscribe completely.