Latest Data is the most powerful thing for starting any kind of work because without it we cant reach the goal. in this fast period of technology, every time market will update with new data. so here we get such useful information on getting latest data from all the collection of information and finding the latest and convenient data for big data application development for use through Map Reduce. what is a process, which environments used by a map reduce and etc


Data Processing by Map Reduce Detect Latest Data From Daily Data -Technoligent

Introduction We collect the daily data from many sources such as streaming, copy, put, log data from other data sources to our system. Now Question is that how to detect Latest data from it? Latest Data?

Introduction Daily Data Latest Data Solution is Map Reduce ! Let’s Learn How to do it?

Here, we use Apache Pig to represent for map reduce Job. The script will load the old data, new data and do the sorting base on the collect date. And only pick records which just collect today and filter records which processed from few days ago. Start the process Daily Data Latest Data

Initial Steps Step 1 : We need to prepare some input data file, open a new file in terminal of Linux: vi file1 Text some input data with format: id;product;price;collectdate 1;XY milk;2000;20160730000000 2;AB candy;5000;20160730000000 3;B chair;6000;20160730000000 vi file2 Text some input data with format: id;product;price;collectdate 2;AB candy;3000;20160731000000 3;B chair;1000;20160731000000

Initial Steps Step 2 : We need to put the local files to Hadoop Distributed File System (HDFS), use this command: hadoop fs - mkdir -p /data/ mysample / mergedData hadoop fs -put file1 /data/ mysample /mergedData_20160730000000/ hadoop fs -put file2 /data/ mysample /mergedData_20160731000000/

Code Walk Through This pig script will merge the data with old and new and collect only the latest records from daily data set. SET ‘merge old and new data with map reduce by Pig script’; Load old data which already processed yesterday previousDayData = LOAD ‘/data/ mysample /mergedData_20160730000000/’ USING PigStorage (‘;’) AS ( id:chararray , product:chararray , price:chararray , collectdate:chararray );

Code Walk Through Load today data which collected today. todayData = LOAD ‘/data/ mysample /mergedData_20160731000000/’ USING PigStorage (‘;’) AS ( id:chararray , product:chararray , price:chararray , collectdate:chararray ); Combine two data set together unionData = UNION previousDayData , todayData ; Group data by id as a key of data set groupData = GROUP unionData by id;

Code Walk Through Sort the data set by collect date then we will have the latest date is top rank of dataset We will collect only 1 record from the top rank of dataset then we can collect the latest data collect by today. This is de-duplication process and generate the output data to HDFS. outputData = foreach groupData { removeDuplication = LIMIT (ORDER unionData by collectdate DESC) 1; GENERATE FLATTEN( removeDuplication ); }

Code Walk Through Store outputData to HDFS STORE outputData INTO ‘/data/ mysample /mergedData_20160731000000_processed/’ USING PigStorage (‘;’); Verify the result We can check in the HDFS location to see the output hadoop fs –text /data/ mysample /mergedData_20160731000000_processed/* | head –n 10 The latest data in HDFS for 31/Jul will be: 1;XY milk;2000;20160730000000 2;AB candy;3000;20160731000000 3;B chair;1000;20160731000000

Conclusion Understand the steps to merge the daily data in big data application development by Map Reduce. You must follow all the steps as discussed in this post for best results. Hope that this can help you guys

