This are the steps for parsing your XML files by PIG.
Step 1: Set the classpath for pig bin
export PATH=/home/hadoop/Documents/pig-0.11.1/bin:$PATH
Step 2: Register the jar file
REGISTER '/home/hadoop/Documents/pig-0.11.1/contrib/piggybank/java/piggybank.jar'
Step 3: Load the data
xml = load '/user/hadoop/input/xml.txt' USING
org.apache.pig.piggybank.storage.XMLLoader('name') as(doc:chararray);
@ data looks like
<Property>
<name>Ryan</name>
</Property>
Step 4: Parse the file and retrieve the value
value = foreach xml GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<name>(.*)</name>')) AS name:chararray;
Step 5: show the value
dump value;
*Parse the multi attribute file
@ data looks like
<Property>
<fname>joseph</fname>
<lname>christino</lname>
<landmark>peter tower</landmark>
<city>panji</city>
<state>Goa</state>
<contact>89456123</contact>
<email>joseph@gmail.com</email>
<PAN_Card>0011542</PAN_Card>
<URL>blog.joseph.com</URL>
</Property>
Load the data:
pigdata = load '/input/file.txt' USING
org.apache.pig.piggybank.storage.XMLLoader('Property') as (doc:chararray);
Parse the values:
values = foreach pigdata GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<Property>\\s*<fname>(.*)</fname>\\s*<lname>(.*)</lname>\\s*<landmark>(.*)</landmark>\\s*<city>(.*)</city>\\s*<state>(.*)</state>\\s*<contact>(.*)</contact>\\s*<email>(.*)</email>\\s*<PAN_Card>(.*)</PAN_Card>\\s*<URL>(.*)</URL>\\s*</Property>')) AS (fname:chararray, lname:chararray, landmark:chararray, city:chararray, state:chararray, contact:int, email:chararray, PAN_Card:long, URL:chararray);
Output:
dump values;
(joseph,christino,peter tower,panji,Goa,89456123,joseph@gmail.com,0011542,blog.joseph.com)
its great work.....
ReplyDeletei had some doubt on it..if xml file having multiples loop tags then how can u convert that xml into csv, further what are the commands u will change.
if u had any document regarding XML files just post it
ReplyDeleteHow can I read the only two column? fname and lname only
ReplyDelete@Kiran you can generate two column at next level like f_l_name = foreach values generate fname,lname;
ReplyDeleteHi,
ReplyDeleteI am a newbie to pig and right now working on an multi-attribute xml file.Found your post very useful.But,when I try to generate the values of the tags I get ()()()()..but not the values.
Hi,
ReplyDeleteI am trying with nested tags, then how i need to change my regular expresssion explain with simple example.
Hi,
ReplyDeleteI am unable to register piggybank.jar in ubuntu 14.04 ( on which i have installed Hadoop 2.2 and pig), whenever i run the register command in the terminal it says REGISTER: command not found.
Please help!
This comment has been removed by the author.
Delete@kulbeer.. try writing the REGISTER command in the grunt shell.
DeleteHi Ravi...Nice info..If i have multiple tags loop in XML ..How can i parse using Map reduce or Pig Latin...For Example:
ReplyDeletePlease dont mention any nick names or alias names
Discount applied only if paid full fees at a time
Pass with Distinction
Atleast one number is mandatory
Only Regular students information is available not for distance or summer course students
Output:
Sname Gender Sid Cname Cid Branch totalfees Totalmarks textinmarks Mobile TextinCollegeinfo textinFees
Suppose if I have multiple tags with same name how do we parse them
ReplyDeletee.g.
P.S - As it is Not allowing to post html tags...$lt means less than symbol and $gt means greater than for xml tags...
$ltcity="name" id="1"$gtpanji$lt/city$gt
$ltcity="state" id="2"$gtGoa$lt/city$gt
$ltcity="zip" id="3"$gt123456$lt/city$gt
How to load XML Files with ':' in the tag name ?
ReplyDelete