4 posters

Creating robots to crawl PDF

E^E_2016

E^E_2016: Posts : 10
Points : 3020
Join date : 2016-08-29

Post n°1

Creating robots to crawl PDF

by E^E_2016 Fri Sep 02, 2016 9:50 am

Hi,

I am really new to Kapow and was really hoping to get some direction. I could not find any tutorial on this so I am hoping that someone here
who knows can share their help .

I would like to only crawl certain contents inside PDF file. Not everything. Example: The PDF file content may already be in table format and contain
values apart from text. How can I create robots to properly structure the contents and only retrieve values and ommit text?
Because when you are extracting the contents from PDF the values and text are all messed up .

I understand that those contents in PDF cannot be exported in excel format right? is there a workaround for this?

Hope you can guide me or provide your experience in resolving this.

Thanks.

Shyam Kumar

Shyam Kumar: Ranks; Posts : 113
Points : 4344
Join date : 2013-07-05
Location : Kerala, India

Post n°2

Re: Creating robots to crawl PDF

by Shyam Kumar Fri Sep 02, 2016 12:18 pm

Hi,

Thanks for your post. We are always ready to help you.

Data extraction from pdf is somewhat difficult, because of the various types of pdf files.
We can export content from pdf to excel format.

Can you give me the sample pdf files and mention what are the content you need to extract, which you want to export data from pdf.

Thank you.

Last edited by Shyam Kumar on Thu Jan 19, 2017 10:12 am; edited 1 time in total

E^E_2016

E^E_2016: Posts : 10
Points : 3020
Join date : 2016-08-29

Post n°3

Re: Creating robots to crawl PDF

by E^E_2016 Fri Sep 02, 2016 1:05 pm

Hi,

Thanks for the reply. Do you mean that you could export content of PDF to Excel in Kapow? I tried using the function extract to excel and it says that the function would not work in PDF format. Could you probably advice me where or which step do I need to do in Kapow to create robots which can export PDF content into excel?

Shyam Kumar

Shyam Kumar: Ranks; Posts : 113
Points : 4344
Join date : 2013-07-05
Location : Kerala, India

Post n°4

Re: Creating robots to crawl PDF

by Shyam Kumar Fri Sep 02, 2016 2:14 pm

Hi,

Yes, we can export content of PDF to Excel in Kapow.

I am confusing about you are mentioning in your replay 'function extract to excel not working' can you please attach some screen shots, it may help me to give better solutions and which version you are using?

We can use lots of options in kapow for export content from pdf to excel depends on the type of pdf.

Are you using any Database?

If you are using database, you can convert the pdf file and extract data content in to a variable and store the data. Then you can export form database.

Other wise you can use 'Write File' Action step available in kapow and directly write the content to excel..

If you are giving some sample pdf files I will work on this and give you a proper solution.

Thank you

Last edited by Shyam Kumar on Thu Jan 19, 2017 10:12 am; edited 1 time in total

E^E_2016

E^E_2016: Posts : 10
Points : 3020
Join date : 2016-08-29

Post n°5

Re: Creating robots to crawl PDF

by E^E_2016 Fri Oct 21, 2016 9:33 am

Hi ,

Thank you for reverting back so fast to me. Really appreciate it. I think I get what you are trying to say but it will be helpful if you can provide me some screenshots on how you can do this . Let me send over a sample of PDF here. Basically the requirement is to extract the transaction details in the bills statement which appears in table format. Can you guide me through how can u create robot to perform this ? Thank you .

Please download free for the PDF file here at this link since I cant attached ot here . Too big.
https://ufile.io/ed91

Hope to hear from you soon . thanks.

Shyam Kumar

Shyam Kumar: Ranks; Posts : 113
Points : 4344
Join date : 2013-07-05
Location : Kerala, India

Post n°6

Re: Creating robots to crawl PDF

by Shyam Kumar Mon Oct 24, 2016 5:02 pm

Hi,

In my understanding you need to extract only the TRANSACTION DETAILS from the PDF file.

Creating robots to crawl PDF 110

If you need to do multiple pdf file, use action step, file system then select “for each file” action step or you can directly use url.

Creating robots to crawl PDF 210

After loading the pdf file, convert the pdf file, use “Extract Binary Content” and “Extract from PDF”

Here extract the full binary content to the pdf varibale. And extract from binary use the same varibale. So we can show data.

The above mentioned pdf, you need to extract transaction details, when i research on the pdf file, all the contents of each transaction is located in a paragraph tag (<p>).

All the transaction details contents included in the tag <p> and tag start with date of transaction, So initial step you should extract date, because we are looping all the paragraph tags, if any paragraph tag is not satisfy the date extraction, we need to skip that and take next, because that is not a transaction details.

Creating robots to crawl PDF 310

Then extract the normal contents what you need and take in a variable (Here i am using the kapow default ScratchPad variables)

If you are using any database you can insert data in to database table using “Store in Database” action step.

If you are directly write the content of the pdf means, you can simply write the contents to the CSV file using “Write File” action step.

Creating robots to crawl PDF 410

In write file action step, you should give the file name(location).

File Name: /root/Desktop/Excel.csv // Here you can give your location

variable1+"\t"+variable2+"\t"+variable3+"\t"+variable4+"\t"+variable5+"\n"

\t (tab-comment for next column)

\n (Enter-comment for next Line)

Then run the robot and show the extracted data.

Creating robots to crawl PDF Screen10

If you dont understand anything please let me know.

Thank you.

Regards,

Shyam kumar P

E^E_2016

E^E_2016: Posts : 10
Points : 3020
Join date : 2016-08-29

Post n°7

Re: Creating robots to crawl PDF

by E^E_2016 Wed Oct 26, 2016 5:20 pm

Hi Shyam,

Thanks for going through the earlier exercise with me here. Your help has been greatly helpful and I also appreciate your time into this with me. I think the only area where I do not understand is at the patterns and expression configuration. Hopefully when you have time, we can discuss on this part more in detail. Thanks.

kaundalsajan10@gmail.com

kaundalsajan10@gmail.com: Posts : 7
Points : 2858
Join date : 2017-02-01

Post n°8

PDF

by kaundalsajan10@gmail.com Wed Feb 01, 2017 6:35 pm

You can also configure the Robot using "Merge Text" option available in Extract from PDF step . Data will be displayed in 2 different formats by enabling/disabling this option.

jinitha kumari.j.r

jinitha kumari.j.r: Posts : 1
Points : 4149
Join date : 2013-07-08

Post n°9

Re: Creating robots to crawl PDF

by jinitha kumari.j.r Fri Feb 03, 2017 12:26 pm

Hi Kaundalsajan

        By default the generated HTML from the PDF will merge text that is on the same line into one HTML element even though these are represented as different text in the PDF document.

        It is better to turn off this feature(Merge Text) if the PDF document contains more than one column.It will help to maintain the column structure.

     Regards
     Jinitha

Re: Creating robots to crawl PDF

by Sponsored content

KAPOW

Creating robots to crawl PDF

Creating robots to crawl PDF

Re: Creating robots to crawl PDF

Re: Creating robots to crawl PDF

Re: Creating robots to crawl PDF

Re: Creating robots to crawl PDF

Re: Creating robots to crawl PDF

Re: Creating robots to crawl PDF

PDF

Re: Creating robots to crawl PDF

Re: Creating robots to crawl PDF

Similar topics

Similar topics