• How to extract text from PDF(Image) files, OCR


    Background: below is SS1.0 as example since it came from NetSuite email plugin, SS2.0 is the same thing.

    1. Registry a API key throw https://ocr.space/OCRAPI

    There are limitations for Free Plan

    2. Save the email attachment(PDF file) to NetSuite FileCabinet, set it to available without login, get the full url address, encode it.

    var importFile = attachments[indexAtt];importFile.setIsOnline(true);
    var intFileId = nlapiSubmitFile(importFile);
    var strInvFileUrl = "https://" + nlapiGetContext().getCompany() + ".app.netsuite.com"+ objInvoiceFileRec.getURL();
    strInvFileUrl = encodeURIComponent(strInvFileUrl);

     

    3. Send Request to https://api.ocr.space/parse/imageurl?apikey=abcAPIKEYabc&filetype=PDF&isTable=true&url=

    var response = nlapiRequestURL(strReqUrl, null, a);
    There are varience of parameters for this API, in my case, it's invoice formated as table, that's why I send isTable=true to identify it; then it will help me to locate the expected cell and values.


    4. Got and parsed the Response, we will get the Text messages on the PDF or Images.

    var arrParsedLines = (objOcrRes['ParsedResults'] && objOcrRes['ParsedResults'][0]) ? objOcrRes['ParsedResults'][0]['TextOverlay']['Lines']: null;
    var objVndBillData = parseDataFromInvPdf(arrParsedLines);

  • 相关阅读:
    Python微信机器人
    Jumpserver开源跳板机系统介绍
    Django---django-rest-framework(drf)-luffycity projects
    Linux-Mysql 遗忘密码如何解决?
    up line
    linux
    vue中computed(计算属性)
    input框在浏览器上显示一个叉,去掉方法
    如何通过命令行来克隆git
    手机抓包fiddler配置及使用教程
  • 原文地址:https://www.cnblogs.com/backuper/p/How_to_extract_text_from_PDF_or_Image_files.html
Copyright © 2020-2023  润新知