NextGen: XML parsing using PIG

Wednesday, August 28, 2013

XML parsing using PIG

This are the steps for parsing your XML files by PIG.

Step 1: Set the classpath for pig bin

export PATH=/home/hadoop/Documents/pig-0.11.1/bin:$PATH

Step 2: Register the jar file

REGISTER '/home/hadoop/Documents/pig-0.11.1/contrib/piggybank/java/piggybank.jar'

Step 3: Load the data

xml = load '/user/hadoop/input/xml.txt' USING

org.apache.pig.piggybank.storage.XMLLoader('name') as(doc:chararray);
@ data looks like
<Property>
<name>Ryan</name>
</Property>

Step 4: Parse the file and retrieve the value

value = foreach xml GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<name>(.*)</name>'))  AS name:chararray;

Step 5: show the value

dump value;

*Parse the multi attribute file

@ data looks like
<Property>
 <fname>joseph</fname>
 <lname>christino</lname>
 <landmark>peter tower</landmark>
 <city>panji</city>
 <state>Goa</state>
 <contact>89456123</contact>
 <email>joseph@gmail.com</email>
 <PAN_Card>0011542</PAN_Card>
 <URL>blog.joseph.com</URL>
</Property>

Load the data:

pigdata = load '/input/file.txt' USING

org.apache.pig.piggybank.storage.XMLLoader('Property') as (doc:chararray);

Parse the values:
values = foreach pigdata GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<Property>\\s*<fname>(.*)</fname>\\s*<lname>(.*)</lname>\\s*<landmark>(.*)</landmark>\\s*<city>(.*)</city>\\s*<state>(.*)</state>\\s*<contact>(.*)</contact>\\s*<email>(.*)</email>\\s*<PAN_Card>(.*)</PAN_Card>\\s*<URL>(.*)</URL>\\s*</Property>')) AS (fname:chararray, lname:chararray, landmark:chararray, city:chararray, state:chararray, contact:int, email:chararray, PAN_Card:long, URL:chararray);

Output:

dump values;

(joseph,christino,peter tower,panji,Goa,89456123,joseph@gmail.com,0011542,blog.joseph.com)

12 comments:

SIVA KUMAROctober 28, 2013 at 5:43 AM
its great work.....
i had some doubt on it..if xml file having multiples loop tags then how can u convert that xml into csv, further what are the commands u will change.
ReplyDelete
Replies
SIVA KUMAROctober 28, 2013 at 5:45 AM
if u had any document regarding XML files just post it
ReplyDelete
Replies
UnknownJanuary 9, 2014 at 1:06 AM
How can I read the only two column? fname and lname only
ReplyDelete
Replies
rviJanuary 9, 2014 at 2:12 AM
@Kiran you can generate two column at next level like f_l_name = foreach values generate fname,lname;
ReplyDelete
Replies
PriyadharshiniJanuary 27, 2014 at 4:05 AM
Hi,
I am a newbie to pig and right now working on an multi-attribute xml file.Found your post very useful.But,when I try to generate the values of the tags I get ()()()()..but not the values.
ReplyDelete
Replies
UnknownJune 18, 2014 at 2:46 AM
Hi,
I am trying with nested tags, then how i need to change my regular expresssion explain with simple example.
ReplyDelete
Replies
UnknownJuly 10, 2014 at 12:54 AM
Hi,
I am unable to register piggybank.jar in ubuntu 14.04 ( on which i have installed Hadoop 2.2 and pig), whenever i run the register command in the terminal it says REGISTER: command not found.
Please help!
ReplyDelete
Replies
UnknownOctober 2, 2014 at 8:59 AM
Hi Ravi...Nice info..If i have multiple tags loop in XML ..How can i parse using Map reduce or Pig Latin...For Example:

Please dont mention any nick names or alias names

Discount applied only if paid full fees at a time
Pass with Distinction
Atleast one number is mandatory
Only Regular students information is available not for distance or summer course students

Output:
Sname Gender Sid Cname Cid Branch totalfees Totalmarks textinmarks Mobile TextinCollegeinfo textinFees

ReplyDelete
Replies
PritishNovember 3, 2014 at 2:30 PM
Suppose if I have multiple tags with same name how do we parse them
e.g.

P.S - As it is Not allowing to post html tags...$lt means less than symbol and $gt means greater than for xml tags...

$ltcity="name" id="1"$gtpanji$lt/city$gt
$ltcity="state" id="2"$gtGoa$lt/city$gt
$ltcity="zip" id="3"$gt123456$lt/city$gt
ReplyDelete
Replies
UnknownMarch 14, 2015 at 10:43 PM
How to load XML Files with ':' in the tag name ?
ReplyDelete
Replies

Add comment