Wednesday, August 28, 2013

XML parsing using PIG

This are the steps for parsing your XML files by PIG.

Step 1: Set the classpath for pig bin
export PATH=/home/hadoop/Documents/pig-0.11.1/bin:$PATH

Step 2: Register the jar file

REGISTER '/home/hadoop/Documents/pig-0.11.1/contrib/piggybank/java/piggybank.jar'

Step 3: Load the data

xml = load '/user/hadoop/input/xml.txt' USING 
org.apache.pig.piggybank.storage.XMLLoader('name') as(doc:chararray);
@ data looks like
<Property>
<name>Ryan</name>
</Property>

Step 4: Parse the file and retrieve the value

value = foreach xml GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<name>(.*)</name>'))  AS name:chararray;

Step 5: show the value

dump value;

*Parse the multi attribute file
@ data looks like
<Property>
 <fname>joseph</fname>
 <lname>christino</lname>
 <landmark>peter tower</landmark>
 <city>panji</city>
 <state>Goa</state>
 <contact>89456123</contact>
 <email>joseph@gmail.com</email>
 <PAN_Card>0011542</PAN_Card>
 <URL>blog.joseph.com</URL>
</Property>

Load the data:
pigdata = load '/input/file.txt' USING 
org.apache.pig.piggybank.storage.XMLLoader('Property') as (doc:chararray);

Parse the values:
values = foreach pigdata GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'<Property>\\s*<fname>(.*)</fname>\\s*<lname>(.*)</lname>\\s*<landmark>(.*)</landmark>\\s*<city>(.*)</city>\\s*<state>(.*)</state>\\s*<contact>(.*)</contact>\\s*<email>(.*)</email>\\s*<PAN_Card>(.*)</PAN_Card>\\s*<URL>(.*)</URL>\\s*</Property>')) AS (fname:chararray, lname:chararray, landmark:chararray, city:chararray, state:chararray, contact:int, email:chararray, PAN_Card:long, URL:chararray);

Output:

dump values;

(joseph,christino,peter tower,panji,Goa,89456123,joseph@gmail.com,0011542,blog.joseph.com)

12 comments:

  1. its great work.....
    i had some doubt on it..if xml file having multiples loop tags then how can u convert that xml into csv, further what are the commands u will change.

    ReplyDelete
  2. if u had any document regarding XML files just post it

    ReplyDelete
  3. How can I read the only two column? fname and lname only

    ReplyDelete
  4. @Kiran you can generate two column at next level like f_l_name = foreach values generate fname,lname;

    ReplyDelete
  5. Hi,
    I am a newbie to pig and right now working on an multi-attribute xml file.Found your post very useful.But,when I try to generate the values of the tags I get ()()()()..but not the values.

    ReplyDelete
  6. Hi,
    I am trying with nested tags, then how i need to change my regular expresssion explain with simple example.

    ReplyDelete
  7. Hi,
    I am unable to register piggybank.jar in ubuntu 14.04 ( on which i have installed Hadoop 2.2 and pig), whenever i run the register command in the terminal it says REGISTER: command not found.
    Please help!

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. @kulbeer.. try writing the REGISTER command in the grunt shell.

      Delete
  8. Hi Ravi...Nice info..If i have multiple tags loop in XML ..How can i parse using Map reduce or Pig Latin...For Example:

    Please dont mention any nick names or alias names

    Discount applied only if paid full fees at a time
    Pass with Distinction
    Atleast one number is mandatory
    Only Regular students information is available not for distance or summer course students




    Output:
    Sname Gender Sid Cname Cid Branch totalfees Totalmarks textinmarks Mobile TextinCollegeinfo textinFees

    ReplyDelete
  9. Suppose if I have multiple tags with same name how do we parse them
    e.g.

    P.S - As it is Not allowing to post html tags...$lt means less than symbol and $gt means greater than for xml tags...


    $ltcity="name" id="1"$gtpanji$lt/city$gt
    $ltcity="state" id="2"$gtGoa$lt/city$gt
    $ltcity="zip" id="3"$gt123456$lt/city$gt

    ReplyDelete
  10. How to load XML Files with ':' in the tag name ?

    ReplyDelete