KAPOW

Would you like to react to this message? Create an account in a few clicks or log in to continue.
KAPOW

Welcome to the Kapow forum. Here you can get help, use your skills to help others and enjoy hanging out in the company of other Kapow Robot Developers.


4 posters

    Creating robots to crawl PDF

    avatar
    E^E_2016


    Posts : 10
    Points : 2831
    Join date : 2016-08-29

    Creating robots to crawl PDF  Empty Creating robots to crawl PDF

    Post by E^E_2016 Fri Sep 02, 2016 9:50 am

    Hi,

    I am really new to Kapow and was really hoping to get some direction. I could not find any tutorial on this so I am hoping that someone here
    who knows can share their help .

    I would like to only crawl certain contents inside PDF file. Not everything. Example: The PDF file content may already be in table format and contain
    values apart from text. How can I create robots to properly structure the contents and only retrieve values and ommit text?
    Because when you are extracting the contents from PDF the values and text are all messed up . 

    I understand that those contents in PDF cannot be exported in excel format right? is there a workaround for this?

    Hope you can guide me or provide your experience in resolving this. 

    Thanks.
    Shyam Kumar
    Shyam Kumar
    Ranks


    Posts : 113
    Points : 4155
    Join date : 2013-07-05
    Location : Kerala, India

    Creating robots to crawl PDF  Empty Re: Creating robots to crawl PDF

    Post by Shyam Kumar Fri Sep 02, 2016 12:18 pm

    Hi,

    Thanks for your post. We are always ready to help you.

    Data extraction from pdf is somewhat difficult, because of the various types of pdf files.
    We can export content from pdf  to excel format.

    Can you give me the sample pdf files and mention what are the content you need to extract, which you want to export data from pdf.


    Thank you.


    Last edited by Shyam Kumar on Thu Jan 19, 2017 10:12 am; edited 1 time in total
    avatar
    E^E_2016


    Posts : 10
    Points : 2831
    Join date : 2016-08-29

    Creating robots to crawl PDF  Empty Re: Creating robots to crawl PDF

    Post by E^E_2016 Fri Sep 02, 2016 1:05 pm

    Hi,

    Thanks for the reply. Do you mean that you could export content of PDF to Excel in Kapow? I tried using the function extract to excel and it says that the function would not work in PDF format. Could you probably advice me where or which step do I need to do in Kapow to create robots which can export PDF content into excel?
    Shyam Kumar
    Shyam Kumar
    Ranks


    Posts : 113
    Points : 4155
    Join date : 2013-07-05
    Location : Kerala, India

    Creating robots to crawl PDF  Empty Re: Creating robots to crawl PDF

    Post by Shyam Kumar Fri Sep 02, 2016 2:14 pm

    Hi,

    Yes, we can export content of PDF to Excel in Kapow.

    I am confusing about you are mentioning in your replay 'function extract to excel not working' can you please attach some screen shots, it may help me to give better solutions and which version you are using?

    We can use lots of options in kapow for export content from pdf to excel depends on the type of pdf.

    Are you using any Database?


    If you are using database, you can convert the pdf file and extract data content in to a variable and store the data. Then you can export form database.

    Other wise you can use 'Write File' Action step available in kapow and directly write the content to excel..

    If you are giving some sample pdf files I will work on this and give you a proper solution.


    Thank you


    Last edited by Shyam Kumar on Thu Jan 19, 2017 10:12 am; edited 1 time in total
    avatar
    E^E_2016


    Posts : 10
    Points : 2831
    Join date : 2016-08-29

    Creating robots to crawl PDF  Empty Re: Creating robots to crawl PDF

    Post by E^E_2016 Fri Oct 21, 2016 9:33 am

    Hi ,

    Thank you for reverting back so fast to me. Really appreciate it. I think I get what you are trying to say but it will be helpful if you can provide me some screenshots on how you can do this . Let me send over a sample of PDF here. Basically the requirement is to extract the transaction details in the bills statement which appears in table format.  Can you guide me through how can u create robot to perform this ? Thank you . 

    Please download free for the PDF file here at this link since I cant attached ot here . Too big.
    https://ufile.io/ed91

    Hope to hear from you soon . thanks.
    Shyam Kumar
    Shyam Kumar
    Ranks


    Posts : 113
    Points : 4155
    Join date : 2013-07-05
    Location : Kerala, India

    Creating robots to crawl PDF  Empty Re: Creating robots to crawl PDF

    Post by Shyam Kumar Mon Oct 24, 2016 5:02 pm

    Hi,


    In my understanding you need to extract only the TRANSACTION DETAILS from the PDF file.

    Creating robots to crawl PDF  110

    If you need to do multiple pdf file, use action step, file system then select “for each file” action step or you can directly use url.

    Creating robots to crawl PDF  210

    After loading the pdf file, convert the pdf file, use “Extract Binary Content” and “Extract from PDF”


    Here extract the full binary content to the pdf varibale. And extract from binary use the same varibale. So we can show data.


    The above mentioned pdf, you need to extract transaction details, when i research on the pdf file, all the contents of each transaction is located in a paragraph tag (<p>).


    All the transaction details contents included in the tag <p> and tag start with date of transaction, So initial step you should extract date, because we are looping all the paragraph tags, if any paragraph tag is not satisfy the date extraction, we need to skip that and take next, because that is not a transaction details.

    Creating robots to crawl PDF  310

    Then extract the normal contents what you need and take in a variable (Here i am using the kapow default ScratchPad variables)

    If you are using any database you can insert data in to database table using “Store in Database” action step.


    If you are directly write the content of the pdf means, you can simply write the contents to the CSV file using “Write File” action step.

    Creating robots to crawl PDF  410
    In write file action step, you should give the file name(location).

    File Name: /root/Desktop/Excel.csv // Here you can give your location

    variable1+"\t"+variable2+"\t"+variable3+"\t"+variable4+"\t"+variable5+"\n"

    \t (tab-comment for next column)

    \n (Enter-comment for next Line)

    Then run the robot and show the extracted data.

    Creating robots to crawl PDF  Screen10


    If you dont understand anything please let me know.




    Thank you.


    Regards,

    Shyam kumar P
    avatar
    E^E_2016


    Posts : 10
    Points : 2831
    Join date : 2016-08-29

    Creating robots to crawl PDF  Empty Re: Creating robots to crawl PDF

    Post by E^E_2016 Wed Oct 26, 2016 5:20 pm

    Hi Shyam,

    Thanks for going through the earlier exercise with me here. Your help has been greatly helpful and I also appreciate your time into this with me. I think the only area where I do not understand is at the patterns and expression configuration. Hopefully when you have time, we can discuss on this part more in detail. Thanks.
    avatar
    kaundalsajan10@gmail.com


    Posts : 7
    Points : 2669
    Join date : 2017-02-01

    Creating robots to crawl PDF  Empty PDF

    Post by kaundalsajan10@gmail.com Wed Feb 01, 2017 6:35 pm

    You can also configure the Robot using "Merge Text" option available in Extract from PDF step . Data will be displayed in 2 different formats by enabling/disabling this option.
    avatar
    jinitha kumari.j.r


    Posts : 1
    Points : 3960
    Join date : 2013-07-08

    Creating robots to crawl PDF  Empty Re: Creating robots to crawl PDF

    Post by jinitha kumari.j.r Fri Feb 03, 2017 12:26 pm

    Hi Kaundalsajan

            By default the generated HTML from the PDF will merge text that is on the same line into one HTML element even though these are represented as different text in the PDF document.
       
            It is better to turn off this feature(Merge Text) if the PDF document contains more than one column.It will help to maintain the column structure.

         Regards
         Jinitha

    Sponsored content


    Creating robots to crawl PDF  Empty Re: Creating robots to crawl PDF

    Post by Sponsored content


      Current date/time is Sat May 11, 2024 2:03 am